Quick Links

Dynamic Partitioning using Segment Visibility Maps

Lists:	pgsql-hackers

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-02 17:56:14
Message-ID:	1199296574.7260.149.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Happy New Year, everybody.

This proposal follows on from previous thinking about partitioning,
where I've taken up Andrew Sullivan's suggestion to re-examine the
current partitioning concept of using tables as partitions. So I've come
up with an alternative concept to allow us to discuss the particular
merits of each. ISTM that this new proposal has considerable potential.

Recap: Very Large Table Use Case
--------------------------------

Many large tables have a regular access pattern:
- new rows inserted frequently
- some updates and deletes on the "recent" data
- older data becomes effectively read-only
- at some point data may be removed in bulk from the table

Most tables vary on what "recent" means; some are insert only, many not.
A typical example might be a large history table where the last 6 months
data is updated, but after that the data has almost zero change.

Partitioning relies on knowledge of the physical location of data to
exclude sections of data from a scan.

Currently the partitioning knowledge is provided by user declarations in
the form of constraints on tables linked together by views or
inheritance. We also currently do this in "constraint-first" fashion,
i.e. we put the data where the constraints say we should, rather than
deriving the constraints from the data.

In-table Partitioning
---------------------

In the common use-case there is a clear affinity between certain data
columns and the physical placement of data in a table. This is true for
columns such as LOG_DATE or TRANSACTION_DATE, but is also true for
any columns where the id column is assigned by a Sequence. HOT helps
this to remain true over time.

Given the above use case, when there is an affinity between data values
and data placement *and* we know that older data tends to become read
only, we can then realise that certain physical sections of a table will
become effectively read only. (My experience is that the larger the
table, the less random the write pattern - whatever the guys that wrote
TPC-C say).

If we were able to keep track of which sections of a table are now
read-only then we would be able to record information about the data in
that section and use it to help solve queries. This is turning the
current thinking on its head: we could derive the constraints from the
data, rather than placing the data according to the constraints. That
would be much more natural: load data into the table and have the system
work out the rest.

We can imagine that we do this within a single table. Imagine: split a
table up into 100 sections, then record the min/max values of all tuples
in those sections. When we perform a SeqScan we could skip over sections
that could be proved would never satisfy the query predicate. Same thing
as partitioning, but within the table.

Currently tables are already broken up into 1GB file segments, so it
seems fairly natural to pick that as our section size. That fits
reasonably well with recommendations for ideal partition size.

Segment Exclusion
-----------------

After we note that a segment is read-only we can scan the segment and
record min/max values for all columns. These are then "implicit
constraints", which can then be used for segment exclusion in a similar
manner as we do with the current constraint exclusion mechanism.

The implicit constraints can then allow SeqScans to assess segment
exclusion for each segment at execution time, i.e. a scan can skip a
whole segment if constraints don't match. If a segment does not (yet)
have implicit constraints it will always be scanned in full. This allows
partitioning, synch scans and buffer recycling to work much better
together than they currently do.

Other scans types might also use segment exclusion, though this would
only be useful for scans retrieving many rows, otherwise the overhead of
segment exclusion might not be worthwhile.

This would also allow a Merge Join to utilise exclusion for the inner
plan at execution time. Instead of scanning the whole inner table plan
from the start, we would be able to start the scan from the appropriate
segment. This would require us to pass the current value of the outer
plan down into the inner plan. The typical executor nodes on the inner
plan would be a SortNode and below that a SeqScan. In that case the
executor would need to pass the outer value from the MergeJoinNode down
thru the SortNode to the SeqScan node. The SeqScan node could then
perform partition exclusion, reducing the time for that scan and also
reducing the time for the resulting sort. This sounds like it might be
worth doing in normal cases also. It might turn out that the potentially
applicable cases are already excluded during planning, I haven't thought
about that aspect in enough detail yet.

If we collect data for all columns then many of our implicit constraints
would be useless. e.g. if a column only has a few values and these are
present in all segments. Matching our predicate against all constraints
would be expensive, so we must weed out poor constraints. We would do
this by removing any constraint that overlapped more than 10% of other
segments. Various heuristics would likely need to be discovered during
development to make this work without resorting to manual commands.

Note that all of this exclusion is happening in the executor, not the
planner. That allows this form of exclusion to work with stable
functions and parameters without problem.

Noting which segments are read-only
-----------------------------------

Everything so far has relied upon our ability to note which segments of
a table are read-only. We could do this in two ways

1) have the system automatically keep track of non-changing data
2) have the DBA issue a command to "mark" a segment as read-only now

Probably a combination of the two is better, so we have three states for
segments
- READ_WRITE_ALLOWED
- EFFECTIVE_READ_ONLY (set by 1 only)
- MARKED_READ_ONLY (set by 2 only)

Having said that I want to concentrate on (1), though consider the other
one also in case requested by reviewers.

Visibility Map
--------------

So *how* would we mark segments as read only? It turns out that
attempting to do this is extremely similar to Heikki's Visibility Map
idea, so I'll describe it in those terms, even though there are some
differences. It also brings to mind Andrew Sullivan's read-only tuples
concept.

We would keep a dynamic visibility map at *segment* level, showing which
segments have all rows as 100% visible. No freespace map data would be
held at this level.

Most everything about Heikki's Visibility Map idea holds, just that we
have a bit for each segment in the table, rather than each block. The
map is very small and hardly ever changes for a table, so we can cache
it easily on each backend. If the map does change we can perform a
relcache invalidation to make everybody re-read the map.

No dynamic shared memory cache is required because any concurrent
changes to the table would be ignored by a Scan anyway, so it doesn't
matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
new scans that start will attempt to lock the table and then perform a
rel cache check before continuing. So the visibility will be set
correctly for *that* scan at least.

In most cases the visibility map can be summarised as a single boolean
to show whether any 100% visible segments exist. That makes accessing
the map very cheap in the common, unset case.

Setting the Visibility Map
--------------------------

VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of
them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to
all current backends. Marking the map will cause a rel cache
invalidation.

We would never mark the highest numbered segment
EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur.
This is also a useful way of excluding smaller tables from the overheads
involved for this feature.

In an insert-only table this would mean that only the last 1 or 2
segments would be read write, so VACUUM would always run in a bounded
time, no matter how large the table.

When would we attempt to set the visibility map? If we do this
aggressively, then it will be quickly unset. If we wait too long then we
might lose some of the benefits of this approach.

First, we recognise that read only segments will change the autovacuum
calculation - we simply exclude them entirely. This will mean that
VACUUMs are run more regularly than they are now, but when they do run
they will be quicker. (Note that currently, VACUUMs on a growing table
get further and further apart as the table grows, assuming constant
write rate).

My feeling is that VACUUMs are sufficiently infrequent that we should
set the visibility map aggressively after each VACUUM. If the map is
quickly unset, then we wait some time before trying again. Unsetting
seems to be relatively low contention (see below), so that seems
acceptable.

We would then re-scan newly marked segments to derive min/max values for
implicit constraints, performed by autovacuum worker processes. The best
way seems to be that an ANALYZE command would automatically detect that
a segment has been recently set EFFECTIVE_READ_ONLY and perform an
all-rows scan rather than just a sample. (We would keep min/max values
only for datatypes with valid btree sort operators, just as ANALYZE
already does). The data for this can be stored as part of the stats
histogram for each column. pg_class.reltuples would become an array of
row counts in each segment, known accurate for read only segments.

A later VACUUM will still be required to freeze the rows, though that
can happen later when we hit the frozen limit, or earlier if we run an
explicit VACUUM FREEZE. We would need to store the oldest xid for each
segment, so we know when to kick off a FREEZE on that part of the table.
We would do this by making pg_class.relfrozenxid an array also.

This departs completely from the idea that the visibility map and the
freespace map are related somehow. However, there is a change required
in the FSM to ensure the proposed technique is stable and useful: If we
scan a segment and it is 100% visible, if there is freespace in just one
of the blocks in the segment then we will soon find that our
visibility-bit will be set to off again very quickly.

If there is more than 5% freespace total in the segment then we will not
set EFFECTIVE_READ_ONLY, nor report blocks in a segment to the FSM. That
way, small amounts of freespace won't destroy the benefits of segment
exclusion. VACUUM currently overwrites FSM information, though if this
was not the case then we would have to actively remove it for the newly
read only segments. Perhaps that limit would be (100 - fillfactor)%,
which is normally set according to a user's expectation of the number of
updates likely on a table.

Unsetting the visibility map
----------------------------

For EFFECTIVE_READ_ONLY, INSERTs, UPDATEs and DELETEs are still
possible. For MARKED_READ_ONLY it would be explicitly refused.

The relcache invalidation caused by making a segment EFFECTIVE_READ_ONLY
can flush the rd_targBlock for the relation in each backend. Other
INSERTs are not possible since all current code-paths either use
extension or the FSM when the table is non-empty, and neither of those
will cause a problem. Extension only touches last segment, which will
never by EFFECTIVE_READ_ONLY. The FSM contains no entries for an
EFFECTIVE_READ_ONLY segment.

UPDATEs or DELETEs are possible on EFFECTIVE_READ_ONLY segments, but we
don't expect it often. If this occurs, we will simply reset the
visibility map for the table. This can be handled with a
non-transactional overwrite - the bit always exists already, so the
overwritten data is always same size. That's pessimistic, since it
resets the state even if the UPDATE/DELETE aborts, but its better than
holding the row locked until the transaction completes, which might
prevent other things from happening. We probably also want VACUUM to
clear up any aborted transactions anyway.

The user may also wish to clear down very old data, so allowing DELETEs
can ensure the user can still remove old data from the table. By
carefully choosing the values to be deleted, a whole segment can be
deleted and then returned to the FSM.

Running a huge DELETE will likely perform a SeqScan that avoids all but
the deleting data. The following VACUUM will also ignore the other
segments, so removing older data seems fairly well optimised already, in
comparison with now. It would be possible to optimise this to perform a
segment-level truncate, but since the partitioning will already improve
the DELETE and VACUUM, I'm not proposing that yet, if ever.

SHARE locks currently write to data blocks, but since they represent
transient state we do not need to unset either the partitioning or the
visibility map.

Overlap with Visibility Map Proposal
------------------------------------

This proposal offers many of the advantages of the earlier Visibility
Map proposal, yet without major changes to heap structure. Since the
segment-level visibility map is more granular it would only be 80% as
effective as the more detailed block-level map. Having said that, the
bookkeeping overheads will also be considerably reduced, so it does seem
a good joint approach. It also allows freezing to be handled fully,
which was a problem with the original visibility map proposal. WAL
logging visibility map changes is also now much easier.

My thoughts have been targeted directly at partitioning, yet I have to
admit that this idea overlaps, and in my world view, replaces the
Visibility Map proposal. I very much like the name Visibility Map
though. I've re-read the earlier thread and agree with Jeff Davis that
we *could* have multiple kinds of visibility map, but also agree with
Heikki that it seems like too much work. It was never my intent to
address both challenges in one proposal and this is just pure
coincidence, or possibly just luck, depending upon how you like the
proposal.

If we do need to differentiate between the two proposals, we can refer
to this one as the Segment Visibility Map (SVM).

Index Scans
-----------

Index-only scans would look at the Visibility Map, just as they would
have done in the earlier proposal.

It would be costly for an index scan to repeatedly check the visibility
map, but probably worth it to do so.

Other Changes
-------------

We can handle select count(*) by scanning the non-100% visible segments
of a table, then adding the stored counts for each segment to get a
final total. Not sure if its really worth doing, but it does sound like
an added bonus.

There would be additional complexity in selectivity estimation and plan
costing. The above proposal allows dynamic segment exclusion, which
cannot be assessed at planning time anyway, so suggestions welcome...

I think it would be prudent to have an override option on VACUUM to
allow us to scan a table in full, checking rather than using the
visibility map. This mode would be called VACUUM CHECK.

None of the above blocks earlier proposals for read only tables, and the
two features could work together very straightforwardly.

Comparison with other Partitioning approaches
---------------------------------------------

Once I realised this was possible in fairly automatic way, I've tried
hard to keep away from manual overrides, commands and new DDL.

Declarative partitioning is a big overhead, though worth it for large
databases. No overhead is *much* better though.

This approach to partitioning solves the following challenges
- allows automated table extension, so works automatically with Slony
- responds dynamically to changing data
- allows stable functions, nested loop joins and parametrised queries
- allows RI via SHARE locks
- avoids the need for operator push-down through Append nodes
- allows unique indexes
- allows both global indexes (because we only have one table)
- allows advanced planning/execution using read-only/visible data
- works naturally with synchronous scans and buffer recycling

All of the above are going to take considerably longer to do in any of
the other ways I've come up with so far...

And yes, this is a different conclusion to earlier discussions,
particularly those in 2005 where I was still thinking in terms of
declarative partitioning.

Conclusions
-----------

This technique would be useful for any table with historical data keyed
by date or timestamp. It would also be useful for data where a
time-of-insert component is implicit, such as many major entity tables
where the object ids are assigned by a sequence. e.g. an Orders table
with an OrderId as PK. Once all orders placed in a period have been
shipped/resolved/closed then the segments will be marked read-only.

Its not really going to change the code path much for small tables, yet
seems likely to work reasonably well for large tables of all shapes and
sizes. If a segment is being updated, we leave it alone, and maybe never
actually set the visibility map at all. So overall, this idea seems to
cover the main use case well, yet with only minimal impact on the
existing use cases we support.

As before, I will maintain this proposal on the PG developer Wiki, once
we get down to detailed design.

Like it?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Sam Mason <sam(at)samason(dot)me(dot)uk>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-03 00:41:04
Message-ID:	20080103004104.GK11262@frubble.xen.chris-lamb.co.uk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote:
> Like it?

Sounds good. I've only given it a quick scan though. Would read-only
segments retain the same disk-level format as is currently? It seems
possible to remove the MVCC fields and hence get more tuples per page---
whether this would actually be a net performance gain/loss seems like
a difficult question question to answer, it would definitly be a
complexity increase though.

Reading this reminds me of the design of the store for a persistent
operating system called EROS. It has a very good paper[1] describing
the design (implementation and careful benchmarking thereof) that I
think could be a useful read.

A lot of your design sounds like the EROS store, with the the
"Checkpoint Area" being, in current and all previous versions of
Postgres, the only place data is stored. Data in EROS also has a "Home
Location" which is where the data ends up after a checkpoint, and sounds
somewhat like the proposed read-only.

Checkpoints here serve a similar purpose than checkpoints to PG, so the
following analogy may get a bit confusing. When you're reading the
paper I'd try and imagine the checkpoints not occurring as one event,
but spread across time as the database recognizes that data is now (or
has been marked as) read-only. The home locations would then store
only the read-only copies of the data and all the churn would (if the
recognition of read-only data works) be done in the checkpoint area.

Sam

[1] http://www.eros-os.org/papers/storedesign2002.pdf

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Sam Mason <sam(at)samason(dot)me(dot)uk>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-03 09:44:25
Message-ID:	1199353465.7260.173.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2008-01-03 at 00:41 +0000, Sam Mason wrote:
> On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote:
> > Like it?
>
> Sounds good. I've only given it a quick scan though. Would read-only
> segments retain the same disk-level format as is currently?

Yes, no changes at all to the table. So your app would just work,
without any DDL changes. Existing partitioning apps would not change.

> It seems
> possible to remove the MVCC fields and hence get more tuples per page---
> whether this would actually be a net performance gain/loss seems like
> a difficult question question to answer, it would definitly be a
> complexity increase though.

I've been looking at general compression at table and segment level, but
thats further down the track. Removing the MVCC fields is too much work,
I think.

> Reading this reminds me of the design of the store for a persistent
> operating system called EROS. It has a very good paper[1] describing
> the design (implementation and careful benchmarking thereof) that I
> think could be a useful read.

Thanks, will do.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Sam Mason" <sam(at)samason(dot)me(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 07:36:19
Message-ID:	9362e74e0801032336u23f2453hfc97ffb4ac623021@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Simon,
Looks like a novel idea. I just want to confirm my
understanding of the proposal.
a) This proposal would work for the kind of table organizations which are
currently partitioned and maintained based on some kind of timestamp.
Consider one of the use-case. A large Retail firm has a lot of stores. DBA
maintains and updates the inventories of those stores in hash-partitions
based on store-no. As the inventory gets updated, the corresponding
partition receives the update and it goes like that..
Here all the partitions are going to get constantly updated. So no
partition can possibly become a read only partition. You have clearly called
it out in your para. i am just re-confirming on that. Or do you have
something for this in your soln.?

To my limited experience, most partition strategies are based on some form
of time-stamp. If the proposed soln. can cater to those, it has lot of
use-cases.

Thanks,
Gokul.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Sam Mason <sam(at)samason(dot)me(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 09:59:05
Message-ID:	1199440745.18598.12.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-01-04 at 13:06 +0530, Gokulakannan Somasundaram wrote:

> a) This proposal would work for the kind of table organizations which
> are currently partitioned and maintained based on some kind of
> timestamp. Consider one of the use-case. A large Retail firm has a lot
> of stores. DBA maintains and updates the inventories of those stores
> in hash-partitions based on store-no. As the inventory gets updated,
> the corresponding partition receives the update and it goes like
> that..
> Here all the partitions are going to get constantly updated.
> So no partition can possibly become a read only partition. You have
> clearly called it out in your para. i am just re-confirming on that.
> Or do you have something for this in your soln.?
>
> To my limited experience, most partition strategies are based on some
> form of time-stamp. If the proposed soln. can cater to those, it has
> lot of use-cases.

I don't think it would apply to an Inventory table. That is a current
state table.

It is designed for any large tables that would grow naturally over time
if we left them to do so. Solutions that it would work for:

- any Fact table where measurements/observations/events are accumulated
e.g.
Web Hits (any Internet events)
Call Detail Records
Sales
Security Events
Scientific Measurements
Process Control

- any Major Entity where new entities are created from a sequence
e.g.
Orders, OrderItems
Invoices
Shipments, Returns
most SCM/DCM events

It's not aimed at any particular benchmark, just real usage scenarios.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Richard Huxton <dev(at)archonet(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 10:22:40
Message-ID:	477E08F0.7080308@archonet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> We would keep a dynamic visibility map at *segment* level, showing which
> segments have all rows as 100% visible. No freespace map data would be
> held at this level.

Small dumb-user question.

I take it you've considered some more flexible consecutive-run-of-blocks
unit of flagging rather than file-segments. That obviously complicates
the tracking but means you can cope with infrequent updates as well as
mark most of the "most recent" segment for log-style tables.

--
Richard Huxton
Archonet Ltd

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Richard Huxton <dev(at)archonet(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 10:40:17
Message-ID:	1199443217.18598.32.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-01-04 at 10:22 +0000, Richard Huxton wrote:
> Simon Riggs wrote:
> > We would keep a dynamic visibility map at *segment* level, showing which
> > segments have all rows as 100% visible. No freespace map data would be
> > held at this level.
>
> Small dumb-user question.
>
> I take it you've considered some more flexible consecutive-run-of-blocks
> unit of flagging rather than file-segments. That obviously complicates
> the tracking but means you can cope with infrequent updates as well as
> mark most of the "most recent" segment for log-style tables.

I'm writing the code to abstract that away, so yes.

Now you mention it, it does seem straightforward to have a table storage
parameter for partition size, which defaults to 1GB. The partition size
is simply a number of consecutive blocks, as you say.

The smaller the partition size the greater the overhead of managing it.
Also I've been looking at read-only tables and compression, as you may
know. My idea was that in the future we could mark segments as either
- read-only
- compressed
- able to be shipped off to hierarchical storage

Those ideas work best if the partitioning is based around the physical
file sizes we use for segments.

Changing the partition size would simply reset the visibility map for
that table, in its easiest implementation.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Richard Huxton <dev(at)archonet(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 10:59:43
Message-ID:	477E119F.1090102@archonet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Fri, 2008-01-04 at 10:22 +0000, Richard Huxton wrote:
>> Simon Riggs wrote:
>>> We would keep a dynamic visibility map at *segment* level, showing which
>>> segments have all rows as 100% visible. No freespace map data would be
>>> held at this level.
>> Small dumb-user question.
>>
>> I take it you've considered some more flexible consecutive-run-of-blocks
>> unit of flagging rather than file-segments. That obviously complicates
>> the tracking but means you can cope with infrequent updates as well as
>> mark most of the "most recent" segment for log-style tables.
>
> I'm writing the code to abstract that away, so yes.
>
> Now you mention it, it does seem straightforward to have a table storage
> parameter for partition size, which defaults to 1GB. The partition size
> is simply a number of consecutive blocks, as you say.
>
> The smaller the partition size the greater the overhead of managing it.

Oh, obviously, but with smaller partition sizes this also becomes useful
for low-end systems as well as high-end ones. Skipping 80% of a seq-scan
on a date-range query is a win for even small (by your standards)
tables. I shouldn't be surprised if the sensible-number-of-partitions
remained more-or-less constant as you scaled the hardware, but the
partition size grew.

> Also I've been looking at read-only tables and compression, as you may
> know. My idea was that in the future we could mark segments as either
> - read-only
> - compressed
> - able to be shipped off to hierarchical storage
>
> Those ideas work best if the partitioning is based around the physical
> file sizes we use for segments.

I can see why you've chosen file segments. It certainly makes things easier.

Hmm - thinking about the date-range scenario above, it occurs to me that
for seq-scan purposes the correct partition size depends upon the
data value you are interested in. What I want to know is what blocks Jan
07 covers (or rather what blocks it doesn't) rather than knowing blocks
1-9999999 cover 2005-04-12 to 2007-10-13. Of course that means that
you'd eventually want different partition sizes tracking visibility for
different columns (e.g. id, timestamp).

I suspect the same would be true for read-only/compressed/archived
flags, but I can see how they are tightly linked to physical files
(particularly the last two).

--
Richard Huxton
Archonet Ltd

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 12:29:55
Message-ID:	477E26C3.8040604@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Simon,

Simon Riggs wrote:
> I've come
> up with an alternative concept to allow us to discuss the particular
> merits of each. ISTM that this new proposal has considerable potential.

Hm.. interesting idea.

> If we were able to keep track of which sections of a table are now
> read-only then we would be able to record information about the data in
> that section and use it to help solve queries. This is turning the
> current thinking on its head: we could derive the constraints from the
> data, rather than placing the data according to the constraints. That
> would be much more natural: load data into the table and have the system
> work out the rest.

Yeah, but that's also the most limiting factor of your approach: it
covers only horizontal partitioning by time (or to be more precise, by
columns which are very likely to increase or decrease with time). All
other columns will very likely contain values from the full range of
possible values.

As you have pointed out, that might be a very frequent use case. I can't
argue about that, however, I think it's important to be well aware of
that limitation.

> Other scans types might also use segment exclusion, though this would
> only be useful for scans retrieving many rows, otherwise the overhead of
> segment exclusion might not be worthwhile.

Uh.. the overhead of checking against min/max values doesn't seem that
big to me.

I rather think the gain for index scans would be prohibitively small,
because (given frequent enough vacuuming) an index scan shouldn't return
many pointers to tuples in segments which could be optimized away by
segment exclusion.

> If we collect data for all columns then many of our implicit constraints
> would be useless. e.g. if a column only has a few values and these are
> present in all segments. Matching our predicate against all constraints
> would be expensive, so we must weed out poor constraints. We would do
> this by removing any constraint that overlapped more than 10% of other
> segments. Various heuristics would likely need to be discovered during
> development to make this work without resorting to manual commands.

Uh, well, that's about the limitation I've pointed out above. But is it
really worth maintaining statistics about overlapping values and
removing min/max checks for certain columns?

It would save you the min/max check per segment and scan, but cost
maintaining the statistics and checking against the statistics once per
scan. AFAICS the block with the min/max tuple per segment will often
remain cached anyway... dunno.

> Noting which segments are read-only
> -----------------------------------
>
> Everything so far has relied upon our ability to note which segments of
> a table are read-only. We could do this in two ways
>
> 1) have the system automatically keep track of non-changing data
> 2) have the DBA issue a command to "mark" a segment as read-only now
>
> Probably a combination of the two is better, so we have three states for
> segments
> - READ_WRITE_ALLOWED
> - EFFECTIVE_READ_ONLY (set by 1 only)
> - MARKED_READ_ONLY (set by 2 only)
>
> Having said that I want to concentrate on (1), though consider the other
> one also in case requested by reviewers.

Hm.. AFAICT, horizontal partitioning often serves manageability, which
is quite limited having all data in one table (you can't move a single
segment to a different tablespace). Thus I think option 2 is pretty
constrained is usability. What would the DBA gain by setting a segment
to read only? How does the DBA figure out, in which segment a tuple is
stored in (so she can decide to mark it read only)?

> The user may also wish to clear down very old data, so allowing DELETEs
> can ensure the user can still remove old data from the table. By
> carefully choosing the values to be deleted, a whole segment can be
> deleted and then returned to the FSM.

Oh, yeah, that sounds like a good optimization. Bulk deletes, yummie!

> This proposal offers many of the advantages of the earlier Visibility
> Map proposal, yet without major changes to heap structure. Since the
> segment-level visibility map is more granular it would only be 80% as
> effective as the more detailed block-level map. Having said that, the
> bookkeeping overheads will also be considerably reduced, so it does seem
> a good joint approach. It also allows freezing to be handled fully,
> which was a problem with the original visibility map proposal. WAL
> logging visibility map changes is also now much easier.

I generally agree, although I'm somewhat dubious about the 80% factor.

> My thoughts have been targeted directly at partitioning, yet I have to
> admit that this idea overlaps, and in my world view, replaces the
> Visibility Map proposal. I very much like the name Visibility Map
> though.

I would even say, that partitioning is somewhat of a misleading term
here, because it normally allows the DBA to decide on where to split.

Given that we are operating on segments here, to which the DBA has very
limited information and access, I prefer the term "Segment Exclusion". I
think of that as an optimization of sequential scans on tables with the
above mentioned characteristics.

> If we do need to differentiate between the two proposals, we can refer
> to this one as the Segment Visibility Map (SVM).

I'm clearly in favor of separating between the two proposals. SVM is a
good name, IMHO.

> We can handle select count(*) by scanning the non-100% visible segments
> of a table, then adding the stored counts for each segment to get a
> final total. Not sure if its really worth doing, but it does sound like
> an added bonus.

Yup, sounds tempting. Although it's contrary to Postgres position so
far. And one could argue that you'd have to maintain not only count(),
but also avg(), sum(), etc... as well for all tuples in the 100% visible
segment.

> There would be additional complexity in selectivity estimation and plan
> costing. The above proposal allows dynamic segment exclusion, which
> cannot be assessed at planning time anyway, so suggestions welcome...

Hm.. that looks like a rather bad downside of an executor-only optimization.

> Comparison with other Partitioning approaches
> ---------------------------------------------
>
> Once I realised this was possible in fairly automatic way, I've tried
> hard to keep away from manual overrides, commands and new DDL.
>
> Declarative partitioning is a big overhead, though worth it for large
> databases. No overhead is *much* better though.
>
> This approach to partitioning solves the following challenges
> - allows automated table extension, so works automatically with Slony
> - responds dynamically to changing data
> - allows stable functions, nested loop joins and parametrised queries
> - allows RI via SHARE locks
> - avoids the need for operator push-down through Append nodes
> - allows unique indexes
> - allows both global indexes (because we only have one table)
> - allows advanced planning/execution using read-only/visible data
> - works naturally with synchronous scans and buffer recycling
>
> All of the above are going to take considerably longer to do in any of
> the other ways I've come up with so far...

I fully agree. But as I tried to point out above, the gains in
manageability from Segment Exclusion are also pretty close to zero. So
I'd argue they only fulfill parts of the needs for general horizontal
partitioning.

> This technique would be useful for any table with historical data keyed
> by date or timestamp. It would also be useful for data where a
> time-of-insert component is implicit, such as many major entity tables
> where the object ids are assigned by a sequence. e.g. an Orders table
> with an OrderId as PK. Once all orders placed in a period have been
> shipped/resolved/closed then the segments will be marked read-only.

Agreed. Just a minor note: I find "marked read-only" too strong, as it
implies an impossibility to write. I propose speaking about mostly-read
segments, or optimized for reading or similar.

> Its not really going to change the code path much for small tables, yet
> seems likely to work reasonably well for large tables of all shapes and
> sizes.

That sounds a bit too optimistic to me. For Segment Exclusion, it takes
only *one* tuple to enlarge the min/max range dramatically in any
direction. So it's not the overall correlation between column values and
storage location, but rather only the min/max column values which
matter. Have you ever checked, how well these min/max values correlate
with the segment number?

Pretty much the same argument applies to SVM: an update to only one
tuple in a segment is enough to remove the optimization for reading
(EFFECTIVE_READ_ONLY property) for the segment. The assumption here
(being that updates happen mostly to newer segments) is not quite the
same as above.

Maybe a combination with CLUSTERing would be worthwhile? Or even
enforced CLUSTERing for the older segments?

> If a segment is being updated, we leave it alone, and maybe never
> actually set the visibility map at all. So overall, this idea seems to
> cover the main use case well, yet with only minimal impact on the
> existing use cases we support.

Yup.

> As before, I will maintain this proposal on the PG developer Wiki, once
> we get down to detailed design.

Cool.

Thanks for working out yet another great proposal. I hope to have been
of help with my questions and remarks. ;-)

Regards

Markus

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Sam Mason <sam(at)samason(dot)me(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 12:39:42
Message-ID:	477E290E.2040100@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Simon Riggs wrote:
> - any Fact table where measurements/observations/events are accumulated
> e.g.
> Web Hits (any Internet events)
> Call Detail Records
> Sales
> Security Events
> Scientific Measurements
> Process Control
>
> - any Major Entity where new entities are created from a sequence
> e.g.
> Orders, OrderItems
> Invoices
> Shipments, Returns
> most SCM/DCM events

...and only changed very seldom after a while, I would add. Because
changing an old tuple would invalidate the optimization for the affected
segment.

That's why this optimization can't help for inventory tables, where an
id might correlate with time and storage location, but writing access
doesn't correlate with storage location (segment number) and time.

Regards

Markus

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Richard Huxton <dev(at)archonet(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 12:49:27
Message-ID:	477E2B57.9030708@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Simon Riggs wrote:
> The smaller the partition size the greater the overhead of managing it.
> Also I've been looking at read-only tables and compression, as you may
> know. My idea was that in the future we could mark segments as either
> - read-only
> - compressed
> - able to be shipped off to hierarchical storage
>
> Those ideas work best if the partitioning is based around the physical
> file sizes we use for segments.

As much as I'd like this simplification.. But I'm still thinking of
these segments as an implementation detail of Postgres, and not
something the user should have to deal with.

Allowing the DBA to move segments to a different table space and giving
him the possibility to check which tuples are in which segment seems
awkward from a users perspective, IMO.

Regards

Markus

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 13:40:03
Message-ID:	1199454003.18598.87.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-01-04 at 13:29 +0100, Markus Schiltknecht wrote:

> Given that we are operating on segments here, to which the DBA has very
> limited information and access, I prefer the term "Segment Exclusion". I
> think of that as an optimization of sequential scans on tables with the
> above mentioned characteristics.
>
> > If we do need to differentiate between the two proposals, we can refer
> > to this one as the Segment Visibility Map (SVM).
>
> I'm clearly in favor of separating between the two proposals. SVM is a
> good name, IMHO.

OK, I'll refer to this as proposal as SVM.

I think that's generally true. We already have that problem with planned
statements and work_mem, for example, and parameterised query planning
is a difficult problem. Stable functions are already estimated at plan
time, so we hopefully should be getting that right. I don't see any show
stoppers here, just more of the usual problems of query optimization.

> > Comparison with other Partitioning approaches
> > ---------------------------------------------
> >
> > Once I realised this was possible in fairly automatic way, I've tried
> > hard to keep away from manual overrides, commands and new DDL.
> >
> > Declarative partitioning is a big overhead, though worth it for large
> > databases. No overhead is *much* better though.
> >
> > This approach to partitioning solves the following challenges
> > - allows automated table extension, so works automatically with Slony
> > - responds dynamically to changing data
> > - allows stable functions, nested loop joins and parametrised queries
> > - allows RI via SHARE locks
> > - avoids the need for operator push-down through Append nodes
> > - allows unique indexes
> > - allows both global indexes (because we only have one table)
> > - allows advanced planning/execution using read-only/visible data
> > - works naturally with synchronous scans and buffer recycling
> >
> > All of the above are going to take considerably longer to do in any of
> > the other ways I've come up with so far...
>
> I fully agree. But as I tried to point out above, the gains in
> manageability from Segment Exclusion are also pretty close to zero. So
> I'd argue they only fulfill parts of the needs for general horizontal
> partitioning.

Agreed.

My focus for this proposal wasn't manageability, as it had been in other
recent proposals. I think there are some manageability wins to be had as
well, but we need to decide what sort of partitioning we want/need
first.

So in the case of SVM, enhanced manageability is really a phase 2 thing.

Plus, you can always combine a design with constraint and segment
exclusion.

> Maybe a combination with CLUSTERing would be worthwhile? Or even
> enforced CLUSTERing for the older segments?

I think there's merit in Heikki's maintain cluster order patch and that
should do an even better job of maintaining locality.

Thanks for detailed comments. I'll do my best to include all of the
viewpoints you've expressed as the design progresses.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 18:06:04
Message-ID:	20080104180604.GK821@crankycanuck.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote:
>
> Agreed. Just a minor note: I find "marked read-only" too strong, as it
> implies an impossibility to write. I propose speaking about mostly-read
> segments, or optimized for reading or similar.

I do want some segments to be _marked_ read-only: I want attempted writes to
them to _fail_.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 19:38:24
Message-ID:	1199475504.18598.167.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote:
> On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote:
> >
> > Agreed. Just a minor note: I find "marked read-only" too strong, as it
> > implies an impossibility to write. I propose speaking about mostly-read
> > segments, or optimized for reading or similar.
>
> I do want some segments to be _marked_ read-only: I want attempted writes to
> them to _fail_.

I think Markus thought that we would mark them read only automatically,
which was not my intention. I believe its possible to have this in a way
that will make you both happy. Some more explanation:

There would be three different states for a segment:
1. read write
2. "optimized for reading", as Markus says it
3. marked read only by explicit command

Transition 1 -> 2 is by autovacuum under the SVM proposal, transition 2
-> 3 is by user command only. So throwing an ERROR is acceptable for
segments in state 3.

I came up with a complex scheme for going from 1 -> 3 previously, but I
don't think its needed any longer (for this, at least). It's trivial to
go from 2 -> 3 using an ALTER TABLE statement, along the lines of ALTER
TABLE .... WHERE ....

Files that are completely in state 3 can then be archived by a
hierarchical storage manager without problem.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 21:26:54
Message-ID:	477EA49E.8070401@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Simon Riggs wrote:
> On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote:
>> On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote:
>>> Agreed. Just a minor note: I find "marked read-only" too strong, as it
>>> implies an impossibility to write. I propose speaking about mostly-read
>>> segments, or optimized for reading or similar.

Hm.. yeah, after rereading, I realize that I've mixed up states no. 2
and 3 here, sorry.

>> I do want some segments to be _marked_ read-only: I want attempted writes to
>> them to _fail_.

Well, I can see use cases for marking tuples or complete relations as
read only. But segments?

I'm still puzzled about how a DBA is expected to figure out which
segments to mark. Simon, are you assuming we are going to pass on
segment numbers to the DBA one day?

If not, a more user friendly command like "MARK READ ONLY WHERE date <=
2006" would have to move tuples around between segments, so as to be
able to satisfy the split point exactly, right?

> I think Markus thought that we would mark them read only automatically,
> which was not my intention. I believe its possible to have this in a way
> that will make you both happy. Some more explanation:
>
> There would be three different states for a segment:
> 1. read write
> 2. "optimized for reading", as Markus says it
> 3. marked read only by explicit command

Thanks for clarification.

Regards

Markus

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 21:40:45
Message-ID:	20080104214045.GF821@crankycanuck.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote:
>
> I'm still puzzled about how a DBA is expected to figure out which
> segments to mark.

I think that part might be hand-wavy still. But once this facility is
there, what's to prevent the current active segment (and the rest) from also
getting this mark, which would mean "the table is read only"?

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-04 22:01:31
Message-ID:	1199484091.18598.188.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:

> I'm still puzzled about how a DBA is expected to figure out which
> segments to mark. Simon, are you assuming we are going to pass on
> segment numbers to the DBA one day?

No Way!

That would stop Richard's idea to make the segment stride configurable,
apart from being a generally ugly thing.

> If not, a more user friendly command like "MARK READ ONLY WHERE date <=
> 2006" would have to move tuples around between segments, so as to be
> able to satisfy the split point exactly, right?

Yes, just a simple WHERE clause that we can translate into segments
under the covers. It would be an alter table, so we get an exclusive
lock.

ALTER TABLE foo SET READ ONLY WHERE ....

possibly with a few restrictions on the WHERE clause. Anyway this is
just futures and dreams, so far, so lets just say something like that is
possible in the future and work out more when we pass the first few
hurdles.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-05 03:31:14
Message-ID:	200801042231.15295.xzilla@users.sourceforge.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 04 January 2008 17:01, Simon Riggs wrote:
> On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:
> > I'm still puzzled about how a DBA is expected to figure out which
> > segments to mark. Simon, are you assuming we are going to pass on
> > segment numbers to the DBA one day?
>
> No Way!
>
> That would stop Richard's idea to make the segment stride configurable,
> apart from being a generally ugly thing.
>
> > If not, a more user friendly command like "MARK READ ONLY WHERE date <=
> > 2006" would have to move tuples around between segments, so as to be
> > able to satisfy the split point exactly, right?
>
> Yes, just a simple WHERE clause that we can translate into segments
> under the covers. It would be an alter table, so we get an exclusive
> lock.
>
> ALTER TABLE foo SET READ ONLY WHERE ....
>
> possibly with a few restrictions on the WHERE clause. Anyway this is
> just futures and dreams, so far, so lets just say something like that is
> possible in the future and work out more when we pass the first few
> hurdles.

Not to be negative, but istm how this feature would be managed is as important
as the bits under the hood. Or at least we have to believe there will be
some practical way to manage this, which as of yet I am skeptical.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-05 09:33:45
Message-ID:	1199525625.18598.229.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-01-04 at 22:31 -0500, Robert Treat wrote:

> Not to be negative, but istm how this feature would be managed is as important
> as the bits under the hood.

Agreed. On this part of the thread, we've been discussing an extension
to the basic proposal, which is why I have not been concentrating there.

Core management wise, the basic proposal showed how we would be able to
have VACUUM run much faster than before and how DELETE will also be
optimised naturally by this approach. Loading isn't any slower than it
is now; loading does need some work, but that's another story.

> Or at least we have to believe there will be
> some practical way to manage this, which as of yet I am skeptical.

Skepticism is OK, but I'd like to get your detailed thoughts on this.
I've been an advocate of the multi-tables approach now for many years,
so I don't expect everybody to switch their beliefs on my say-so
overnight. Let me make a few more comments in this area:

The main proposal deliberately has few, if any, knobs and dials. That's
a point of philosophy that I've had views on previously: my normal
stance is that we need some knobs to allow the database to be tuned to
individual circumstances.

In this case, partitioning is way too complex to administer effectively
and requires application changes that make it impossible to use for
packaged applications. The latest Oracle TPC-H benchmark uses 10 pages
of DDL to set it up and if I can find a way to avoid that, I'd recommend
it to all. I do still want some knobs and dials, just not 10 pages
worth, though I'd like yours and others' guidance on what those should
be. Oracle have been responding to feedback with their new interval
partitioning, but its still a multi-table approach in essence.

My observation of partitioned databases is that they all work
beautifully at the design stage, but problems emerge over time. A
time-based range partitioned table can often have different numbers of
rows per partition, giving inconsistent response times. A
height-balanced approach where we make the partitions all the same size,
yet vary the data value boundaries will give much more consistent query
times and can be completely automated much more easily.

The SVM concept doesn't cover everything that you can do with
partitioning, but my feeling is it covers the main use cases well. If
that's not true, in broad strokes or in the detail, then we need to
uncover that. Everybody's help in doing that is appreciated, whatever
the viewpoint and whatever the outcome.

It's probably worth examining existing applications to see how well they
would migrate to segmented tables approach. The following query will
analyse one column of a table to produce a list of boundary values,
given a segment size of 131072 blocks (1 GB).

select
substr(ctid::text,2,strpos(ctid::text,',')-2)::integer/131072 as seg,
min(PrimaryKey), max(PrimaryKey)
from bigtable
group by seg;

We should be able to see whether this works for existing use cases, or
not fairly easily.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-05 10:13:12
Message-ID:	477F5838.7010904@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Sullivan wrote:
> On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote:
>> I'm still puzzled about how a DBA is expected to figure out which
>> segments to mark.
>
> I think that part might be hand-wavy still. But once this facility is
> there, what's to prevent the current active segment (and the rest) from also
> getting this mark, which would mean "the table is read only"?

Well, sure, marking *all* segments read-only is pretty easy. But that
was not quite my point.

Regards

Markus

From:	tomas(at)tuxteam(dot)de
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-05 12:45:47
Message-ID:	20080105124547.GA17275@www.trapp.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, Jan 05, 2008 at 09:33:45AM +0000, Simon Riggs wrote:

[...]

One thought I had back then, with partitioned tables was "gee -- B-tree
index is already doing a partition; why do a manual partition on top of
that?".

It's an exhilarating experience to watch you making a design before I
could even spell this nebulous and unclear thougt :-)

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFHf3v7Bcgs9XrR2kYRApP7AJwIW1cQwdK99fl+uzDelGEaqZFnEgCfUNjl
y0zGzkhf6qel/FC+rArxQu8=
=FZuv
-----END PGP SIGNATURE-----

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	tomas(at)tuxteam(dot)de
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-05 15:04:54
Message-ID:	477F9C96.1020309@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

tomas(at)tuxteam(dot)de wrote:
>> The main proposal deliberately has few, if any, knobs and dials. That's
>> a point of philosophy that I've had views on previously: my normal
>> stance is that we need some knobs to allow the database to be tuned to
>> individual circumstances.
>
> One thought I had back then, with partitioned tables was "gee -- B-tree
> index is already doing a partition; why do a manual partition on top of
> that?".

Well, that's why I'm so focused on manageability: I think the primary
purpose of partitioning is manageability (of the underlying storage for
a table).

Regards

Markus

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-05 15:42:28
Message-ID:	477FA564.9010704@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Simon Riggs wrote:
> On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:
>
>> I'm still puzzled about how a DBA is expected to figure out which
>> segments to mark. Simon, are you assuming we are going to pass on
>> segment numbers to the DBA one day?
>
> No Way!

Ah, I'm glad ;-)

Simon Riggs wrote:
> Skepticism is OK, but I'd like to get your detailed thoughts on this.
> I've been an advocate of the multi-tables approach now for many years,
> so I don't expect everybody to switch their beliefs on my say-so
> overnight. Let me make a few more comments in this area:

I've so far always thought about some sort of multi-relations approach
for partitioning, yes. Let's see if I can get my mind around
single-table partitioning.

> The main proposal deliberately has few, if any, knobs and dials. That's
> a point of philosophy that I've had views on previously: my normal
> stance is that we need some knobs to allow the database to be tuned to
> individual circumstances.
>
> In this case, partitioning is way too complex to administer effectively
> and requires application changes that make it impossible to use for
> packaged applications. The latest Oracle TPC-H benchmark uses 10 pages
> of DDL to set it up and if I can find a way to avoid that, I'd recommend
> it to all. I do still want some knobs and dials, just not 10 pages
> worth, though I'd like yours and others' guidance on what those should
> be. Oracle have been responding to feedback with their new interval
> partitioning, but its still a multi-table approach in essence.

I can absolutely support your efforts to minimize knobs and
configuration DDL. However, my current feeling is, that segments based
partitioning complicates things, because the DBA doesn't have tools and
commands to handle segments.

To satisfy all the different requirements of partitioning with segments
based partitioning, we'd have to allow a table to span multiple table
spaces. I'm not very keen on going that way.

However, what I certainly like is the automated split point definition.
Instead of having to create tables by hand and "linking" them via
inheritance and constraint exclusion, I have something very similar in
mind, like what you proposed for marking read-only segments. Something like:

SPLIT TABLE customers AT cust_name > 'n';

or:

SPLIT TABLE inventory AT inv_id % 4 >= 2;

In my imagination, this should automatically create the underlying
relations, i.e.:

NOTICE: relations inventory__l and inventory__r have been created.

That way, the DBA could then handle those like normal relations,
querying them or moving them to different table spaces like all other
normal relations.

In a way, that's not so different from possible extensions on top of
Segment Exclusion, except that the DBA additionally get a relation name
to be able to address the set of segments which form a partition. Or put
it the other way around: go for Segment Exclusion, but add some sort of
a sentinel relation for each set of segments, to make them reachable for
the DBA.

> My observation of partitioned databases is that they all work
> beautifully at the design stage, but problems emerge over time. A
> time-based range partitioned table can often have different numbers of
> rows per partition, giving inconsistent response times. A
> height-balanced approach where we make the partitions all the same size,
> yet vary the data value boundaries will give much more consistent query
> times and can be completely automated much more easily.

Uh.. well, consistent query time isn't the first thing I'm expecting
from partitioning by time ranges. If I wanted consistent query times I'd
rather use hash partition or something, no?

I'd even state, that one *wants* inconsistent response times when using
time based range partitioning, by moving old, seldom used data to slower
storage and keeping only a small amount of often used tuples on the
faster disks, for example.

> The SVM concept doesn't cover everything that you can do with
> partitioning, but my feeling is it covers the main use cases well.

As I regard manageability to be the main advantage of partitioning,
which you've intentionally left out for now, I disagree here.

How could SVM or Segment Exclusion potentially be covering what hash
partitioning does? Maybe together with the ability to store different
segments of a table on different table spaces. That could be considered
an approach to range partitioning. But then, that would be the
partitioning, and not SVM or Segment Exclusion. To me, both of SVM and
SE look much more like an optimization for certain special cases and
don't have much to do with partitioning.

Regards

Markus

From:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-05 16:59:46
Message-ID:	200801051159.46527.xzilla@users.sourceforge.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Saturday 05 January 2008 10:42, Markus Schiltknecht wrote:
> > The main proposal deliberately has few, if any, knobs and dials. That's
> > a point of philosophy that I've had views on previously: my normal
> > stance is that we need some knobs to allow the database to be tuned to
> > individual circumstances.
> >
> > In this case, partitioning is way too complex to administer effectively
> > and requires application changes that make it impossible to use for
> > packaged applications. The latest Oracle TPC-H benchmark uses 10 pages
> > of DDL to set it up and if I can find a way to avoid that, I'd recommend
> > it to all. I do still want some knobs and dials, just not 10 pages
> > worth, though I'd like yours and others' guidance on what those should
> > be. Oracle have been responding to feedback with their new interval
> > partitioning, but its still a multi-table approach in essence.
>
> I can absolutely support your efforts to minimize knobs and
> configuration DDL. However, my current feeling is, that segments based
> partitioning complicates things, because the DBA doesn't have tools and
> commands to handle segments.
>

Personally I cant say it complicates things, because it isn't clear how it
will be managed. :-)

> To satisfy all the different requirements of partitioning with segments
> based partitioning, we'd have to allow a table to span multiple table
> spaces. I'm not very keen on going that way.
>

Why?

> However, what I certainly like is the automated split point definition.
> Instead of having to create tables by hand and "linking" them via
> inheritance and constraint exclusion, I have something very similar in
> mind, like what you proposed for marking read-only segments. Something
> like:
>
> SPLIT TABLE customers AT cust_name > 'n';
>
> or:
>
> SPLIT TABLE inventory AT inv_id % 4 >= 2;
>
> In my imagination, this should automatically create the underlying
> relations, i.e.:
>
> NOTICE: relations inventory__l and inventory__r have been created.
>
> That way, the DBA could then handle those like normal relations,
> querying them or moving them to different table spaces like all other
> normal relations.
>
> In a way, that's not so different from possible extensions on top of
> Segment Exclusion, except that the DBA additionally get a relation name
> to be able to address the set of segments which form a partition. Or put
> it the other way around: go for Segment Exclusion, but add some sort of
> a sentinel relation for each set of segments, to make them reachable for
> the DBA.
>

So the one thing that always scares me about these "define it all and let the
database sort it out" methods is they seem to lead to cases where the system
ends up rewriting the data to fit into some new partition layout. One thing
that is nice about the current partitioning scheme is you can control the
impact of this behavior in these scenarios, but moving around small portions
of the table at a time.

> > My observation of partitioned databases is that they all work
> > beautifully at the design stage, but problems emerge over time. A
> > time-based range partitioned table can often have different numbers of
> > rows per partition, giving inconsistent response times. A
> > height-balanced approach where we make the partitions all the same size,
> > yet vary the data value boundaries will give much more consistent query
> > times and can be completely automated much more easily.
>
> Uh.. well, consistent query time isn't the first thing I'm expecting
> from partitioning by time ranges. If I wanted consistent query times I'd
> rather use hash partition or something, no?
>

More to the point (I think) is that people define access to the data based on
the meaning of the data, not how it is stored on disk. For example, in some
tables we only need to be active on 1 months worth of data... how that is
laid out on disk (# partitions, which tablespaces) is a means to the end of
working actively on 1 months worth of data. I can't think of many cases where
people would actually say the want to work actively on the most recent GB of
data.

> I'd even state, that one *wants* inconsistent response times when using
> time based range partitioning, by moving old, seldom used data to slower
> storage and keeping only a small amount of often used tuples on the
> faster disks, for example.
>

Yes, we do this quite a bit; and at least one of our partitioned tables (in
total) is larger in size than our tablespace with the fastest disks.

> > The SVM concept doesn't cover everything that you can do with
> > partitioning, but my feeling is it covers the main use cases well.
>
> As I regard manageability to be the main advantage of partitioning,
> which you've intentionally left out for now, I disagree here.
>
> How could SVM or Segment Exclusion potentially be covering what hash
> partitioning does? Maybe together with the ability to store different
> segments of a table on different table spaces. That could be considered
> an approach to range partitioning. But then, that would be the
> partitioning, and not SVM or Segment Exclusion. To me, both of SVM and
> SE look much more like an optimization for certain special cases and
> don't have much to do with partitioning.
>

Even if this were true, it might still be a useful optimization. One table I
am thinking of in particular in my system has one query we need to run across
partitions, which ends up doing a slew of bitmap index scans for all the
partitions. If using segment exclusion on it meant that I could get a global
index to help that query, I'd be happy.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-05 19:02:41
Message-ID:	477FD451.1010406@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Robert Treat wrote:
> Personally I cant say it complicates things, because it isn't clear how it
> will be managed. :-)

Well, management of relations is easy enough, known to the DBA and most
importantly: it already exists. Having to set up something which is
*not* tied to a relation complicates things just because it's an
additional concept.

But as I've pointed out, maybe what we have in mind isn't that different
at all. Just have a sentinel relation mean a set of segments, i.e. all
read-only segments of a table. Then again, a table - in a way - is not
much else than a set of segments. So where's the real difference?

>> To satisfy all the different requirements of partitioning with segments
>> based partitioning, we'd have to allow a table to span multiple table
>> spaces. I'm not very keen on going that way.
>>
>
> Why?

Uh.. if a table (RELKIND_RELATION) can only span one table space, as it
is now, all of its segments are in the same table space. I don't quite
call that partitioning. Well, sure, you could call it so, but then, each
and every Postgres table is already partitioned in 1G segments.

It all depends on the definitions, but in my world, horizontal
partitioning for databases involves multiple table spaces (and is quite
useless without that). Calling anything else partitioning is confusing, IMO.

> So the one thing that always scares me about these "define it all and let the
> database sort it out" methods is they seem to lead to cases where the system
> ends up rewriting the data to fit into some new partition layout.

That holds true no matter if you shuffle between segments or relations.
To be able to let the DBA define an exact split point, the database
*will* have to shuffle tuples around. Why does that scare you? It's a
regular database system's maintenance procedure.

> One thing
> that is nice about the current partitioning scheme is you can control the
> impact of this behavior in these scenarios, but moving around small portions
> of the table at a time.

Uh.. I'm not quite following. What "current partitioning scheme" are you
referring to?

Why should that not be possible with other schemes? Moving the split
point between two partitions involves moving tuples around, no matter if
you are going to move them between segments or between relations
building the partitions.

> More to the point (I think) is that people define access to the data based on
> the meaning of the data, not how it is stored on disk. For example, in some
> tables we only need to be active on 1 months worth of data... how that is
> laid out on disk (# partitions, which tablespaces) is a means to the end of
> working actively on 1 months worth of data. I can't think of many cases where
> people would actually say the want to work actively on the most recent GB of
> data.

Agreed. I'd say that's why the DBA needs to be able to define the split
point between partitions: only he knows the meaning of the data.

>> To me, both of SVM and
>> SE look much more like an optimization for certain special cases and
>> don't have much to do with partitioning.
>
> Even if this were true, it might still be a useful optimization.

Possibly, yes. To me, the use case seems pretty narrow, though. For
example it doesn't affect index scans much.

> One table I
> am thinking of in particular in my system has one query we need to run across
> partitions, which ends up doing a slew of bitmap index scans for all the
> partitions. If using segment exclusion on it meant that I could get a global
> index to help that query, I'd be happy.

As proposed, Segment Exclusion works only on exactly one table. Thus, if
you already have your data partitioned into multiple relations, it most
probably won't affect your setup much. It certainly has nothing to do
with what I understand by 'global index' (that's an index spanning
multiple tables, right?).

Regards

Markus

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	tomas(at)tuxteam(dot)de
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Andrew Sullivan" <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-05 19:42:32
Message-ID:	9362e74e0801051142p50210q4f4073e38b57f98f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 5, 2008 6:15 PM, <tomas(at)tuxteam(dot)de> wrote:

>
> One thought I had back then, with partitioned tables was "gee -- B-tree
> index is already doing a partition; why do a manual partition on top of
> that?".
>
> Can you please explain more on what you are trying to say here?

Thanks,
Gokul.

From:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-05 21:30:04
Message-ID:	200801051630.04706.xzilla@users.sourceforge.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote:
> >> To satisfy all the different requirements of partitioning with segments
> >> based partitioning, we'd have to allow a table to span multiple table
> >> spaces. I'm not very keen on going that way.
> >
> > Why?
>
> Uh.. if a table (RELKIND_RELATION) can only span one table space, as it
> is now, all of its segments are in the same table space. I don't quite
> call that partitioning. Well, sure, you could call it so, but then, each
> and every Postgres table is already partitioned in 1G segments.
>
> It all depends on the definitions, but in my world, horizontal
> partitioning for databases involves multiple table spaces (and is quite
> useless without that). Calling anything else partitioning is confusing,
> IMO.

I'm not following this. If we can work out a scheme, I see no reason not to
allow a single table to span multiple tablespaces. Do you see a problem with
that?

>
> > So the one thing that always scares me about these "define it all and let
> > the database sort it out" methods is they seem to lead to cases where the
> > system ends up rewriting the data to fit into some new partition layout.
>
> That holds true no matter if you shuffle between segments or relations.
> To be able to let the DBA define an exact split point, the database
> *will* have to shuffle tuples around. Why does that scare you? It's a
> regular database system's maintenance procedure.
>

It's a question of control.... see below.

> > One thing
> > that is nice about the current partitioning scheme is you can control the
> > impact of this behavior in these scenarios, but moving around small
> > portions of the table at a time.
>
> Uh.. I'm not quite following. What "current partitioning scheme" are you
> referring to?
>

The current, constraint exclusion/ inheritence based, partitioning we have now
in postgresql.

> Why should that not be possible with other schemes? Moving the split
> point between two partitions involves moving tuples around, no matter if
> you are going to move them between segments or between relations
> building the partitions.
>

The difference is that, if I currently have a table split by month, I
can "re-partition" it into weekly segments, and only shuffle one months data
at a time minimize impact on the system while I shuffle it. This can even be
used to do dynamic management, where data from the current month is archived
by day, data from the past year by week, and data beyond that done monthly.

On many other databases, if you change the partition scheme, it requires
exclusive locks and a shuffleing of all of the data, even data whose
partitions arent being redefined. Even worse are systems like mysql, where
you need to rewrite the indexes as well. To me, these requirements always
seem like show stoppers; I generally can't afford to lock a table while the
database rewrites a billion rows of data.

> > More to the point (I think) is that people define access to the data
> > based on the meaning of the data, not how it is stored on disk. For
> > example, in some tables we only need to be active on 1 months worth of
> > data... how that is laid out on disk (# partitions, which tablespaces) is
> > a means to the end of working actively on 1 months worth of data. I can't
> > think of many cases where people would actually say the want to work
> > actively on the most recent GB of data.
>
> Agreed. I'd say that's why the DBA needs to be able to define the split
> point between partitions: only he knows the meaning of the data.
>
> >> To me, both of SVM and
> >> SE look much more like an optimization for certain special cases and
> >> don't have much to do with partitioning.
> >
> > Even if this were true, it might still be a useful optimization.
>
> Possibly, yes. To me, the use case seems pretty narrow, though. For
> example it doesn't affect index scans much.
>
> > One table I
> > am thinking of in particular in my system has one query we need to run
> > across partitions, which ends up doing a slew of bitmap index scans for
> > all the partitions. If using segment exclusion on it meant that I could
> > get a global index to help that query, I'd be happy.
>
> As proposed, Segment Exclusion works only on exactly one table. Thus, if
> you already have your data partitioned into multiple relations, it most
> probably won't affect your setup much. It certainly has nothing to do
> with what I understand by 'global index' (that's an index spanning
> multiple tables, right?).
>

In a more general sense, a global index is a an index that spans multiple
partitions, as opposed to a local index, which is an index on specific
partitions; postgresql current supports the latter, not the former.

In any case, my thinking is if we had the segment exclusion technique, I could
convert that partitioned table into a regular table again, use segment
exclusion to handle what is currently handled by partitions, and create
a "global index" across all the other data for that other, currently killer,
query.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

From:	tomas(at)tuxteam(dot)de
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	tomas(at)tuxteam(dot)de, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-06 05:57:36
Message-ID:	20080106055736.GB32629@www.trapp.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, Jan 06, 2008 at 01:12:32AM +0530, Gokulakannan Somasundaram wrote:
> On Jan 5, 2008 6:15 PM, <tomas(at)tuxteam(dot)de> wrote:
>
> >
> > One thought I had back then, with partitioned tables was "gee -- B-tree
> > index is already doing a partition; why do a manual partition on top of
> > that?".

> Can you please explain more on what you are trying to say here?

Sure. A B-tree is just a device to partition something along some order.
If you have , say, a table of orders (to use the example upthread) and a
B-tree index on order date, this index partitions your set (at
recursively finer levels). Of course, you'd have to "sort" your data
alogn this index, but PostgreSQL knows how to do this trick: CLUSTER.

This was just a vague idea, many things were missing (for example to
separate out the more quiescent parts of the table into their own files)
which are spelled out in Simon Riggs' proposal.

This struck me when seeing people partition tables by hand -- and I was
delighted to actually watch Simon forging a real design.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFHgG3QBcgs9XrR2kYRAgKhAJ93KUybgMfG07ta67DiR8bgAbHPrgCeOI2V
by/xeXKrDJ5O0JZHyFurego=
=R/vC
-----END PGP SIGNATURE-----

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>
Cc:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Andrew Sullivan" <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-06 08:17:01
Message-ID:	9362e74e0801060017r1f6d5587wb8eac8906bf667e9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 6, 2008 3:00 AM, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net> wrote:

> On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote:
> > >> To satisfy all the different requirements of partitioning with
> segments
> > >> based partitioning, we'd have to allow a table to span multiple table
> > >> spaces. I'm not very keen on going that way.
> > >
> > > Why?
> >
> > Uh.. if a table (RELKIND_RELATION) can only span one table space, as it
> > is now, all of its segments are in the same table space. I don't quite
> > call that partitioning. Well, sure, you could call it so, but then, each
> > and every Postgres table is already partitioned in 1G segments.
> >
> > It all depends on the definitions, but in my world, horizontal
> > partitioning for databases involves multiple table spaces (and is quite
> > useless without that). Calling anything else partitioning is confusing,
> > IMO.
>
> I'm not following this. If we can work out a scheme, I see no reason not
> to
> allow a single table to span multiple tablespaces. Do you see a problem
> with
> that?
>

If we can span two different partitions under two tablespaces, it would
help concepts like parallel queries, parallel updates etc in data
warehousing. If we allow a single table to span across multiple tablespaces,
the question arises how we would be able to move around different parts of
the table,as we like. Segments, i believe doesn't let the user/DBA draw the
split-points.

>
> In a more general sense, a global index is a an index that spans multiple
> partitions, as opposed to a local index, which is an index on specific
> partitions; postgresql current supports the latter, not the former.
>
> In any case, my thinking is if we had the segment exclusion technique, I
> could
> convert that partitioned table into a regular table again, use segment
> exclusion to handle what is currently handled by partitions, and create
> a "global index" across all the other data for that other, currently
> killer,
> query.
>
>
> That's a good idea. Local index , then would be equivalent to partial
indexes. But single Table partition maintenance might not be as flexible as
the multi-table partition. In real partitioning, dropping a single partition
should not affect the local indexes of other partitions. But i think that
would be very difficult to implement under this scheme.

Thanks,
Gokul.

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	tomas(at)tuxteam(dot)de
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Andrew Sullivan" <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-06 08:24:02
Message-ID:	9362e74e0801060024j4e6c3ee4maf189eabd3ab15b1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 6, 2008 11:27 AM, <tomas(at)tuxteam(dot)de> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Sun, Jan 06, 2008 at 01:12:32AM +0530, Gokulakannan Somasundaram wrote:
> > On Jan 5, 2008 6:15 PM, <tomas(at)tuxteam(dot)de> wrote:
> >
> > >
> > > One thought I had back then, with partitioned tables was "gee --
> B-tree
> > > index is already doing a partition; why do a manual partition on top
> of
> > > that?".
>
> > Can you please explain more on what you are trying to say here?
>
> Sure. A B-tree is just a device to partition something along some order.
> If you have , say, a table of orders (to use the example upthread) and a
> B-tree index on order date, this index partitions your set (at
> recursively finer levels).

But the current index scans - Index Scan and Bitmap Index Scan, doesn't
provide the exact benefit of partitioning, even if all the columns are
covered by the index. It does a lot more disk reads than the partitioning
scheme. I think you are looking for something like Block indexes in
Multi-dimensional Clusters in DB2. Heikki did something like that in a more
subtle way.

Postgresql Clusters, as you may know doesn't maintain the order with
inserts. We might go for Index Organized Tables/Clustered indexes. But then,
B-tree would give lot of performance problems, if the Index Tuple size
increases beyond a certain point.

Thanks,
Gokul.

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	tomas(at)tuxteam(dot)de, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-06 10:39:47
Message-ID:	4780AFF3.7070904@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Gokulakannan Somasundaram wrote:
>
>
> On Jan 5, 2008 6:15 PM, <tomas(at)tuxteam(dot)de <mailto:tomas(at)tuxteam(dot)de>> wrote:
>
>
> One thought I had back then, with partitioned tables was "gee -- B-tree
> index is already doing a partition; why do a manual partition on top of
> that?".
>
> Can you please explain more on what you are trying to say here?

I think this has to do with SE not being of much use for index scans. Or
put it another way: SE is an optimization for sequential scans. For
tables where it works well, it could possibly replace the index entirely.

Without the index, you would rely on SE to always be able to exclude
enough segments, so that the seq scan is less expensive than an index
scan with the following table lookups.

With an index, the planner gets a hard time deciding between the index
scan and the (possibly SE optimized) seq scan.

Regards

Markus

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-06 10:48:51
Message-ID:	4780B213.7070303@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Robert Treat wrote:
> On Saturday 05 January 2008 14:02, Markus Schiltknecht wrote:
>>>> To satisfy all the different requirements of partitioning with segments
>>>> based partitioning, we'd have to allow a table to span multiple table
>>>> spaces. I'm not very keen on going that way.
>>> Why?
>> Uh.. if a table (RELKIND_RELATION) can only span one table space, as it
>> is now, all of its segments are in the same table space. I don't quite
>> call that partitioning. Well, sure, you could call it so, but then, each
>> and every Postgres table is already partitioned in 1G segments.
>>
>> It all depends on the definitions, but in my world, horizontal
>> partitioning for databases involves multiple table spaces (and is quite
>> useless without that). Calling anything else partitioning is confusing,
>> IMO.
>
> I'm not following this. If we can work out a scheme, I see no reason not to
> allow a single table to span multiple tablespaces. Do you see a problem with
> that?

Uhm... well, no. I was just pointing out that it's a requirement. It
depends on how you define things, but I'm seeing it that way:

table -- 1:n -- partition -- 1:1 -- table space -- 1:n -- segments

What I'm advocating is making partitions available to the DBA as some
kind of a relation, she can query separately and move around between
table spaces.

>> Why should that not be possible with other schemes? Moving the split
>> point between two partitions involves moving tuples around, no matter if
>> you are going to move them between segments or between relations
>> building the partitions.
>
> The difference is that, if I currently have a table split by month, I
> can "re-partition" it into weekly segments, and only shuffle one months data
> at a time minimize impact on the system while I shuffle it. This can even be
> used to do dynamic management, where data from the current month is archived
> by day, data from the past year by week, and data beyond that done monthly.

This should be possible for both schemes, I see no connection to what
we've discussed. SE doesn't magically give you this level of control you
are requesting here. Quite the opposite: referring to CLUSTERing to
makes me wonder, if that's not going to shuffle way too many tuples around.

What I'm saying is, that SE doesn't partition the segments into
different table spaces. Thus I don't consider it "database partitioning"
in the first place. As I currently understand it, it's:

table -- 1:1 -- table space -- 1:n -- partitions -- 1:n -- segments

> On many other databases, if you change the partition scheme, it requires
> exclusive locks and a shuffleing of all of the data, even data whose
> partitions arent being redefined. Even worse are systems like mysql, where
> you need to rewrite the indexes as well. To me, these requirements always
> seem like show stoppers; I generally can't afford to lock a table while the
> database rewrites a billion rows of data.

I fully agree here. How do you plan to solve that problem on top of SE?

> In a more general sense, a global index is a an index that spans multiple
> partitions, as opposed to a local index, which is an index on specific
> partitions; postgresql current supports the latter, not the former.
>
> In any case, my thinking is if we had the segment exclusion technique, I could
> convert that partitioned table into a regular table again,

... on a single table space ...

> use segment
> exclusion to handle what is currently handled by partitions,

... except, that there is no partitioning (!?!) (between table spaces)

> and create
> a "global index" across all the other data for that other, currently killer,
> query.

I thought the table you are referring to is bigger than your fastest
table space? That would even make it impossible.

See where I'm coming from? And why I'm stating that SE is an
optimization (for seq scans), but not partitioning?

Regards

Markus

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>
Cc:	tomas(at)tuxteam(dot)de, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, "Andrew Sullivan" <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-06 11:13:47
Message-ID:	9362e74e0801060313yfb55391y60cb271eb37a34fd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 6, 2008 4:09 PM, Markus Schiltknecht <markus(at)bluegap(dot)ch> wrote:

> Hi,
>
> Gokulakannan Somasundaram wrote:
> >
> >
> > On Jan 5, 2008 6:15 PM, <tomas(at)tuxteam(dot)de <mailto:tomas(at)tuxteam(dot)de>>
> wrote:
> >
> >
> > One thought I had back then, with partitioned tables was "gee --
> B-tree
> > index is already doing a partition; why do a manual partition on top
> of
> > that?".
> >
> > Can you please explain more on what you are trying to say here?
>
> I think this has to do with SE not being of much use for index scans. Or
> put it another way: SE is an optimization for sequential scans. For
> tables where it works well, it could possibly replace the index entirely.
>
> Without the index, you would rely on SE to always be able to exclude
> enough segments, so that the seq scan is less expensive than an index
> scan with the following table lookups.
>
> With an index, the planner gets a hard time deciding between the index
> scan and the (possibly SE optimized) seq scan.

That's a good point. But i think Simon is planning not to give the job to
the planner, but to the executor. So SE optimization will come into play,
only when planner has decided on Sequential scan.

From:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-06 17:25:55
Message-ID:	200801061225.56512.xzilla@users.sourceforge.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sunday 06 January 2008 05:48, Markus Schiltknecht wrote:
> What I'm saying is, that SE doesn't partition the segments into
> different table spaces. Thus I don't consider it "database partitioning"
> in the first place. As I currently understand it, it's:
>
> table -- 1:1 -- table space -- 1:n -- partitions -- 1:n -- segments
>

Ah, was a slight misunderstanding of terminology. I see what your getting at.

> > On many other databases, if you change the partition scheme, it requires
> > exclusive locks and a shuffleing of all of the data, even data whose
> > partitions arent being redefined. Even worse are systems like mysql,
> > where you need to rewrite the indexes as well. To me, these requirements
> > always seem like show stoppers; I generally can't afford to lock a table
> > while the database rewrites a billion rows of data.
>
> I fully agree here. How do you plan to solve that problem on top of SE?
>
> > In a more general sense, a global index is a an index that spans multiple
> > partitions, as opposed to a local index, which is an index on specific
> > partitions; postgresql current supports the latter, not the former.
> >
> > In any case, my thinking is if we had the segment exclusion technique, I
> > could convert that partitioned table into a regular table again,
>
> .... on a single table space ...
>
> > use segment
> > exclusion to handle what is currently handled by partitions,
>
> .... except, that there is no partitioning (!?!) (between table spaces)
>
> > and create
> > a "global index" across all the other data for that other, currently
> > killer, query.
>
> I thought the table you are referring to is bigger than your fastest
> table space? That would even make it impossible.
>

Nope, different table :-) Although the above global/local one would probably
live entirely on the slow disks, using SE and global index to satisfy the
queries. And really our slow disks aren't exactly slow, but do have poor
concurrency, so we put primarily read data on them to keep writes to a
minimum.

> See where I'm coming from? And why I'm stating that SE is an
> optimization (for seq scans), but not partitioning?
>

Yes, but aiui, SE should allow seq scans to achieve performance similar to
partitioning, especially if the planner can exclude segments based on values
in the data.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-07 11:14:43
Message-ID:	1199704483.16301.106.camel@PCD12478
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2008-01-02 at 17:56 +0000, Simon Riggs wrote:
> Like it?

Very cool :-)

One additional thought: what about a kind of "segment fill factor" ?
Meaning: each segment has some free space reserved for future
updates/inserts of records in the same range of it's partitioning
constraint. And when inserting/updating you put the new record into the
corresponding segment... just like a very coarse clustering. Then you
could vacuum the segments separately to keep the free space not running
out. For active segments you would then fix the partitioning constraint
range once the fill factor is reached, to allow for keeping it's
constraint even when heavily updating (heavily vacuuming it too as
response to that), and create a new segment for the unbounded range for
new inserts... this would work fine for tables where the constraint is
based on ever increasing keys and accidental inserts in old ranges
(which do happen occasionally in real life).

When the change rate of old segments get down, the segments could be
reorganized to have a smaller fill factor, so that you still allow for
accidental updates but keep space usage efficient. This would be some
similar action as a clustering, but hopefully not blocking (which might
be a hard thing to do)... and later again you could mark some of the
really old things as read only and put them in special segments with no
wasted space.

One problem would be when the segment's free space runs out, so you must
put records from the same constraint range in multiple segments - but
that could still work, you just would have multiple segments covering
the same range, but if the "segment fill factor" is chosen properly it
should not be the case... you could normally maintain a set of
non-overlapping segments in terms of the partitioning constraint.

Cheers,
Csaba.

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-07 12:59:01
Message-ID:	47822215.9080200@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Csaba,

Csaba Nagy wrote:
> One additional thought: what about a kind of "segment fill factor" ?
> Meaning: each segment has some free space reserved for future
> updates/inserts of records in the same range of it's partitioning
> constraint. And when inserting/updating you put the new record into the
> corresponding segment... just like a very coarse clustering.

Hm.. yeah. That way, a few writes to a "read optimized" segment could be
accepted, without having to drop the optimization immediately. And the
other way around: generally prevent having to drop the optimization by
forcing tuples to be written to a segment with matching min/max tuples.
Although, that's not exactly trivial, I think.

However, for tables which don't fit the use case of SE, people certainly
don't want such a fill factor to bloat their tables.

Regards

Markus

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-07 13:12:53
Message-ID:	1199711573.5646.5.camel@PCD12478
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-01-07 at 13:59 +0100, Markus Schiltknecht wrote:
> However, for tables which don't fit the use case of SE, people certainly
> don't want such a fill factor to bloat their tables.

Sure, but it could be configurable and should only be enabled if the
table is marked as partitioned on some condition... I think it would be
a bad idea anyway if the DB would start partitioning on some arbitrary
criteria based on analyzing he table, so the DBA should be the one to
decide on what criteria to partition. In particular it could be a bad
idea on occasions to partition on the clustering criteria for tables
which were clustered once.

Cheers,
Csaba.

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-07 13:20:02
Message-ID:	47822702.5040005@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Csaba Nagy wrote:
> Sure, but it could be configurable and should only be enabled if the
> table is marked as partitioned on some condition...

As I'm regarding SE as an optimization, I disagree here.. As all
optimizations, SE should conceptually be reasonably close to cost-less
when unneeded.

> I think it would be
> a bad idea anyway if the DB would start partitioning on some arbitrary
> criteria based on analyzing he table, so the DBA should be the one to
> decide on what criteria to partition.

I absolutely agree for real partitioning, which targets manageability of
table partitions.

> In particular it could be a bad
> idea on occasions to partition on the clustering criteria for tables
> which were clustered once.

Why is that? AFAIUI, Segment Exclusion combines perfectly well with
clustering. Or even better, with an upcoming feature to maintain
clustered ordering. Where do you see disadvantages such an optimization
for sequential scans?

Regards

Markus

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-07 13:32:03
Message-ID:	1199712723.5646.14.camel@PCD12478
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-01-07 at 14:20 +0100, Markus Schiltknecht wrote:
> Why is that? AFAIUI, Segment Exclusion combines perfectly well with
> clustering. Or even better, with an upcoming feature to maintain
> clustered ordering. Where do you see disadvantages such an optimization
> for sequential scans?

Well, as I understood it, this would be some kind of special case of
clustering, where the cluster key is expected to be ever increasing in
time and new rows would not be randomly distributed over the complete
possible range. In theory you could also have each segment in turn be
clustered on some other criteria than the partitioning criteria so
indexed access could also be better on the main selection criteria which
could be different than the partitioning criteria. All this is of course
just speculations - but I guess that's what you expected too in this
discussion :-)

Cheers,
Csaba.

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-07 15:41:46
Message-ID:	20080107154146.GA18581@crankycanuck.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote:
> Well, management of relations is easy enough, known to the DBA and most
> importantly: it already exists. Having to set up something which is
> *not* tied to a relation complicates things just because it's an
> additional concept.

But we're already dealing with some complicated concepts.

There isn't anything that will prevent current-style partitioning strategies
from continuing to work in the face of Simon's proposal. But let me see if
I can outline the sort of cases where I see real value in what he's
outlined.

There is a tendency in data systems to gather all manner of data that, in
retrospect, _might_ turn out to be be valuable; but which, at the time, is
not really valuable at all. Moreover, the value later on might be
relatively low: if you can learn something much later from that data, and do
so easily, then it will be worth doing. But if the work involved passes
some threshold (say 1/2 a day), it's suddenly not worth it any more. It's
simple economics: below a certain cost, the data is valuable. Above a
certain cost, you simply shouldn't keep the data in the first place, because
the cost of using it is higher than any value you'll likely be able to
extract.

Simon's proposal changes the calculations you have to do. If keeping some
data online longer does not impose administrative or operational overhead
(you have it marked read only, so there's no I/O for vacuum; you don't need
to do anything to get the data marked read only; &c.), then all it costs is
a little more disk, which is relatively cheap these days. More importantly,
if the longer-term effect of this strategy is to make it possible to move
such data offline _without imposing a big cost_ when moving it back online,
then the value is potentially very high.

Without even trying, I can think of a dozen examples in the past 5 years
where I could have used that sort of functionality. Because the cost of
data retrieval was high enough, we had to decide that the question wasn't
worth answering. Some of those answers might have been quite valuable
indeed to the Internet community, to be frank; but because I had to pay the
cost without getting much direct benefit, it just wasn't worth the effort.

The thing about Simon's proposal that is beguiling is that it is aimed at
a very common use pattern. The potential for automatic management under
such a use pattern makes it seem to me to be worth exploring in some detail.

> Agreed. I'd say that's why the DBA needs to be able to define the split
> point between partitions: only he knows the meaning of the data.

I think this is only partly true. A casual glance at the -general list will
reveal all manner of false assumptions on the parts of administrators about
how their data is structured. My experience is that, given that the
computer has way more information about the data than I do, it is more
likely to make the right choice. To the extent it doesn't do so, that's a
problem in the planning (or whatever) algorithms, and it ought to be fixed
there.

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-07 18:16:35
Message-ID:	47826C83.3000306@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Andrew Sullivan wrote:
> On Sat, Jan 05, 2008 at 08:02:41PM +0100, Markus Schiltknecht wrote:
>> Well, management of relations is easy enough, known to the DBA and most
>> importantly: it already exists. Having to set up something which is
>> *not* tied to a relation complicates things just because it's an
>> additional concept.
>
> But we're already dealing with some complicated concepts.

Possibly, yes, but that's by far no reason to introduce even more
complicated concepts...

Does anything speak against letting the DBA handle partitions as relations?

> There isn't anything that will prevent current-style partitioning strategies
> from continuing to work in the face of Simon's proposal.

Agreed. Nor will Simon's proposal completely replace them.

> Without even trying, I can think of a dozen examples in the past 5 years
> where I could have used that sort of functionality. Because the cost of
> data retrieval was high enough, we had to decide that the question wasn't
> worth answering. Some of those answers might have been quite valuable
> indeed to the Internet community, to be frank; but because I had to pay the
> cost without getting much direct benefit, it just wasn't worth the effort.

Sure, there's value in Simon's proposal. But it has pretty strict
requirements. IMO, it's pretty hard to say, if it would have helped at
all for your cases. Any of them still available to check?

Remember the requirements: no single tuple in the segment may be
significantly out of the average bounds. Otherwise, the min/max gets
pretty useless and the segment can never be excluded.

As said before, combination with CLUSTERing might help, yes. But if you
need to maintain CLUSTERed ordering, aren't there better ways? For
example, you could use binary searching on the relation directly, much
like with indices, instead of sequentially scanning on the CLUSTERed
relation. That would even give us some sort of "indices with visibility".

>> Agreed. I'd say that's why the DBA needs to be able to define the split
>> point between partitions: only he knows the meaning of the data.
>
> I think this is only partly true. A casual glance at the -general list will
> reveal all manner of false assumptions on the parts of administrators about
> how their data is structured. My experience is that, given that the
> computer has way more information about the data than I do, it is more
> likely to make the right choice. To the extent it doesn't do so, that's a
> problem in the planning (or whatever) algorithms, and it ought to be fixed
> there.

Well, Postgres doesn't automatically create indices, for a counter example.

With regard to partitioning over multiple table spaces, I think the DBA
definitely has more information available, than the computer. A DBA
(hopefully) knows future plans and emergency strategies for the storage
system, for example. Lacking such information, the database system will
have a pretty hard time taking a good decision on how to partition
between table spaces, IMO.

Regards

Markus

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-07 19:08:00
Message-ID:	20080107190800.GF18581@crankycanuck.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote:
>
> Does anything speak against letting the DBA handle partitions as relations?

Yes: it doesn't solve the problem I have, which is that I don't want to have
to manage a whole bunch of tables. I want one table, and I want to be able
to say, "That section is closed".

> Sure, there's value in Simon's proposal. But it has pretty strict
> requirements. IMO, it's pretty hard to say, if it would have helped at
> all for your cases. Any of them still available to check?

No, but one of your worries doesn't bother me:

> Remember the requirements: no single tuple in the segment may be
> significantly out of the average bounds. Otherwise, the min/max gets
> pretty useless and the segment can never be excluded.

The segment can never be excluded in a search on that key, yes. But
consider the likely cases we're looking at:

WHERE some_date >= '1999-01-01' AND some_date < '2001-01-01';
WHERE sequence_field BETWEEN 3000 AND 300000;

&c. These are the two obvious cases: you're searching for data in a given
date range or for primary (sometimes artificial) identifiers in a range,
and the source data increases (almost) monotonically. You have to do this
now anyway, because there's _some_ basis on which you're partitioning your
data; but today, you do this with a lot of fooling around with views and
nasty triggers that push data into the "right" table, assuming someone
doesn't screw it up.

> need to maintain CLUSTERed ordering, aren't there better ways? For
> example, you could use binary searching on the relation directly, much
> like with indices, instead of sequentially scanning on the CLUSTERed
> relation. That would even give us some sort of "indices with visibility".

I think this is a nice idea too :)

> Well, Postgres doesn't automatically create indices, for a counter example.

Yes, and it has no data-use analyser tools that automatically suggest
indexes, either. That's the sort of thing people coming from other (err,
"Other" ;-) products complain about, in fact.

> definitely has more information available, than the computer. A DBA
> (hopefully) knows future plans and emergency strategies for the storage
> system, for example.

Perhaps my jaundice comes from too much time spent in operational trenches,
but while good DBAs have some ideas about that, large numbers of them are
harried and overwhelmed just by the piles of work they already have.
Nevertheless, while what you say is true, I'm not sure what it has to do
with the present case. I don't think the current proposal is to address
partitioning across table spaces. It's to do with the way certain segments
of a table are interpreted by the system. It's undoubtedly true that this
strategy is of questionable utility for many kinds of use of PostgreSQL.
But it seems to offer very significant advantages for one use-pattern that
is very common.

That said, I am not trying to argue it should be adopted without poking at
its weaknesses. I just think it unfair to ask the proposal to address
problems it's not really aimed at.

From:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-07 20:53:31
Message-ID:	4782914B.6070400@cheapcomplexdevices.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Sullivan wrote:
> On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote:
>> ...the requirements: no single tuple in the segment may be
>> significantly out of the average bounds. Otherwise, the min/max gets
>> pretty useless and the segment can never be excluded.
>
> The segment can never be excluded in a search on that key, yes. But
> consider the likely cases we're looking at:
> ...you're searching for data in a given
> date range or for primary (sometimes artificial) identifiers in a range,
> and the source data increases (almost) monotonically.

It seems to me the idea discussed elsewhere in the thread[1]
about configurable segment sizes would make this limitation
much less problematic for some types of data.

Some of my biggest tables are clustered by zip-code; and are insert
mostly. Common queries are "where state_provence='TX'" or
"where city='Dallas'". While I doubt I have enough data to fill
a 1GB segment for any but the largest cities; I certainly have
runs of many consecutive blocks - since clustering by zip tends
to cluster cities as well. Even though the table's not monotonically
increasing or decreasing, like values for cities and states are
clustered together.

Is my understanding right that these Segment Visibility Maps could
help this case as well?

[1] http://archives.postgresql.org/pgsql-hackers/2008-01/msg00065.php

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-08 00:08:52
Message-ID:	4782BF14.7040808@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Andrew Sullivan wrote:
> On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote:
>> Does anything speak against letting the DBA handle partitions as relations?
>
> Yes: it doesn't solve the problem I have, which is that I don't want to have
> to manage a whole bunch of tables. I want one table, and I want to be able
> to say, "That section is closed".

That's a fair requirement. Wanting to manage partitions manually is
another fair requirement, IMO. They can well coexist.

>> Remember the requirements: no single tuple in the segment may be
>> significantly out of the average bounds. Otherwise, the min/max gets
>> pretty useless and the segment can never be excluded.
>
> The segment can never be excluded in a search on that key, yes. But
> consider the likely cases we're looking at:

Uh, which key are you talking about? AFAIU Simon's proposal, he suggests
maintaining min/max values for all columns of the table.

> WHERE some_date >= '1999-01-01' AND some_date < '2001-01-01';

Yeah, and if only *one* tuple in the 1G segment has:

some_date <= '1998-12-31' OR some_date >= '2001-01-01'

Segment Exclusion can't exclude that segment. That's all I'm saying.

> That said, I am not trying to argue it should be adopted without poking at
> its weaknesses. I just think it unfair to ask the proposal to address
> problems it's not really aimed at.

Huh? I'm certainly not the one asking for it. Quite the opposite, I'm
warning from over-estimating the use of SE.

In his proposal, Simon was explicitly comparing to "declarative
partitioning", pointing out lots of advantages and implicitly stating
that SE could cover most, if not all needs of what's commonly understand
by partitioning. That's where I disagree.

But certainly, SE and SVM has some merit for it's use case. And I'm
looking forward to test patches ;-)

Regards

Markus

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-08 02:12:28
Message-ID:	87abnh3w1f.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Andrew Sullivan" <ajs(at)crankycanuck(dot)ca> writes:

> On Mon, Jan 07, 2008 at 07:16:35PM +0100, Markus Schiltknecht wrote:
>>
>> Does anything speak against letting the DBA handle partitions as relations?
>
> Yes: it doesn't solve the problem I have, which is that I don't want to have
> to manage a whole bunch of tables. I want one table, and I want to be able
> to say, "That section is closed".

That's not your problem, that's the solution you're looking for. You're
assuming a particular solution in your problem statement.

I posit that the whole point of partitioning is to create an object which
serves to represent the semantically meaningful chunks of data. The reason
this is useful is precisely because it serves as a short-hand for the DBA
describe the data and how it will be used.

I think Simon's proposal loses the very feature that makes partitioning
useful. The DBA doesn't have a "thing" to describe, he has to define what
parts of the table he's describing for every operation. And if you define a
whole new object to name these "things" I think you'll end up with something
that looks a lot like tables.

I also don't understand how this proposal deals with the more common use case
of unloading and loading data. Normally in partitioned tables we build the
data in a side table until the data is all correct then load it as a
partition. If you treat it as a lower-level object then I don't see that
working. The layout of the new table won't often match the layout of the
target partitioned table.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-08 04:31:26
Message-ID:	4782FC9E.90404@paradise.net.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> I also don't understand how this proposal deals with the more common use case
> of unloading and loading data. Normally in partitioned tables we build the
> data in a side table until the data is all correct then load it as a
> partition. If you treat it as a lower-level object then I don't see that
> working. The layout of the new table won't often match the layout of the
> target partitioned table.
>
>
>
+1

In addition, a similar use case is archival of "old" partitions as they
are not longer (commonly) accessed - perhaps to a different tablespace
(or even to backup media). I don't see how dynamic in-table partitioning
handles this, and I think it would highly desirable to be able to do
these things!

best wishes

Mark

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-08 16:27:55
Message-ID:	20080108162755.GA23841@crankycanuck.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 08, 2008 at 01:08:52AM +0100, Markus Schiltknecht wrote:
>
> Uh, which key are you talking about? AFAIU Simon's proposal, he suggests
> maintaining min/max values for all columns of the table.

Right, but I think that's just because that approach is automatable. Only
some use cases are going to be approproate to this.

> Yeah, and if only *one* tuple in the 1G segment has:
>
> some_date <= '1998-12-31' OR some_date >= '2001-01-01'
>
> Segment Exclusion can't exclude that segment. That's all I'm saying.

Correct.

> Huh? I'm certainly not the one asking for it. Quite the opposite, I'm
> warning from over-estimating the use of SE.

Right; I think one should be clear that there are many -- maybe most --
uses of PostgreSQL where the proposal will be of no use. I just think we
need to be clear that for the areas where it _can_ be useful, it could be
very useful indeed.

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-08 16:38:38
Message-ID:	20080108163838.GB23841@crankycanuck.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 08, 2008 at 02:12:28AM +0000, Gregory Stark wrote:

> > Yes: it doesn't solve the problem I have, which is that I don't want to
> > have to manage a whole bunch of tables. I want one table, and I want to
> > be able to say, "That section is closed".
>
> That's not your problem, that's the solution you're looking for. You're
> assuming a particular solution in your problem statement.

Probably in that one, yes. I'm still waiting for permission to post my
original problem statement; I suspect it's not going to be forthcoming by
next Monday, so it's not going to happen.

But I did outline something like what I'm talking about elsewhere in this
thread. For my case, I'm thinking of the sort of data that builds up over
time, and most of which happens probably not to be useful at any moment, but
all of which _might_ be useful over the long haul. So what I wanted,
originally, was to be able to set arbitrary ranges of tuples to be
read-only, and to be able to set them offline if I wanted. Pseudo-DDL:

ALTER TABLE foo
SET read_only='t'
WHERE created_on < '2007-01-01';

ALTER TABLE foo
SET tuple_offline='t'
WHERE created_on < '2006-01-01';

Now, the second segment is marked "offline". If I query the table for
things in that range, I get an ERROR telling me there might be data in the
range, but it's not mounted at the moment. If I try to update records in
the read-only (first) range, I get an error telling me the tuple is marked
read only. The idea then is that these older tuples can be put off into
long-term storage (wave hands here about the management of that stuff), and
this keeps my online system compact but yet allows me, for just the cost
of mounting a backup tape and reading the segments back, to go back and
query those old ranges.

The case I was particularly aiming at originally was for a case of data that
cannot cost more than fractions of pennies to store, but that might
represent a hugely expensive liability if the answer is not always right.
Driving down that storage cost was mostly what I was aiming at, but people
gradually convinced me that slightly more generic implementations might be
useful. Simon's proposal came along, and it seems to me to be something
like the generic implementation that others already convinced me was needed.

> I think Simon's proposal loses the very feature that makes partitioning
> useful. The DBA doesn't have a "thing" to describe, he has to define what
> parts of the table he's describing for every operation. And if you define a
> whole new object to name these "things" I think you'll end up with something
> that looks a lot like tables.

I don't see how that's the case at all. In fact, I have the feeling it's
the opposite, so perhaps I've misunderstood something.

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 06:28:59
Message-ID:	20080109062859.GI6934@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote:
> This technique would be useful for any table with historical data keyed
> by date or timestamp. It would also be useful for data where a
> time-of-insert component is implicit, such as many major entity tables
> where the object ids are assigned by a sequence. e.g. an Orders table
> with an OrderId as PK. Once all orders placed in a period have been
> shipped/resolved/closed then the segments will be marked read-only.
>
> Its not really going to change the code path much for small tables, yet
> seems likely to work reasonably well for large tables of all shapes and
> sizes. If a segment is being updated, we leave it alone, and maybe never
> actually set the visibility map at all. So overall, this idea seems to
> cover the main use case well, yet with only minimal impact on the
> existing use cases we support.
>
>
> As before, I will maintain this proposal on the PG developer Wiki, once
> we get down to detailed design.
>
>
>
> Like it?

Simon,

A novel approach to the problem. For me, this proposal addresses some
of the other problems in postgres (visibility in the heap vs. index)
rather than the problems with partitioning itself.

It might seem otherwise, but for me partitioning is tool for managing
large volumes of data. It allows the user to encode what they know about
the nature of their data, and that's often about management. In this
way, your proposal actually makes partitioning too smart: the user wants
to tell the system how to organise the data, as opposed to the other way
around.

At Greenplum, we've been discussing this in depth. Interestingly, we
also felt that the storage layer could be much smarter with large tables
with predictable layouts and predictable data patterns. But the thing
is, people with large tables like partitioning, putting different
partitions on different storage; they like the ability to merge and
split partitions; they like truncating old partitions. In a word, they
seem to like the managability partitions give them -- as well as the
performance that comes with this.

To this end, we (well, Jeff Cohen) looked at the syntax and semantics of
partitining in leading databases (Oracle, Informix, DB2) and came up
with a highly expressive grammar which takes the best of each I think
(I'll post details on the grammar in a seperate thread). The idea is
that range (for example, a date range), list (a list of distinct values)
and hash partitioning be supported on multiple columns. Partitions can
be added, merged, truncated. Partitions can reside on different
tablespaces. The existing issues with the rewriter, COPY support and the
like go away by smartening up the backend.

To explore the grammar and semantics Jeff and I (to a small extent) have
implemented the parsing of such a grammar in the backend. At the moment,
this just results in the generation of a bunch of rules (i.e., it
produces the current behaviour, as if someone had written rules
themselves) and debugging output.

The fully fledged approach will see partitioning rules stored in
a new system catalog which can be interrogated by the planner or COPY to
determine which partition a given tuple belongs in or which partition(s)
a WHERE clause matches.

Yes, this is the traditional approach but I think it still has merit. We
shall continue working on this approach because it is what our customers
have asked for. We would also like to see it get into postgres too.

Thanks,

Gavin

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc:	Markus Schiltknecht <markus(at)bluegap(dot)ch>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 12:56:48
Message-ID:	1199883408.4266.203.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2008-01-05 at 16:30 -0500, Robert Treat wrote:

> I'm not following this. If we can work out a scheme, I see no reason not to
> allow a single table to span multiple tablespaces.

That seems to be something we might want anyway, so yes.

> The difference is that, if I currently have a table split by month, I
> can "re-partition" it into weekly segments, and only shuffle one months data
> at a time minimize impact on the system while I shuffle it. This can even be
> used to do dynamic management, where data from the current month is archived
> by day, data from the past year by week, and data beyond that done monthly.

Understood

Agreed

> In any case, my thinking is if we had the segment exclusion technique, I could
> convert that partitioned table into a regular table again, use segment
> exclusion to handle what is currently handled by partitions, and create
> a "global index" across all the other data for that other, currently killer,
> query.

Yes, that's what I have in mind.

Can I ask that you produce a "gap analysis" between what you have now
and what you would have in the future, so we can see what omissions or
errors there are in the segex proposal?

If we had indexes that spanned partitions, would we find that some of
the queries that were producing seq scans will now produce better join
and index plans, do you think?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 14:34:30
Message-ID:	1199889270.4266.267.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-01-07 at 12:14 +0100, Csaba Nagy wrote:
> On Wed, 2008-01-02 at 17:56 +0000, Simon Riggs wrote:
> > Like it?
>
> Very cool :-)

Thanks. As ever, a distillation of various thoughts, not all mine.

> One additional thought: what about a kind of "segment fill factor" ?
> Meaning: each segment has some free space reserved for future
> updates/inserts of records in the same range of it's partitioning
> constraint. And when inserting/updating you put the new record into the
> corresponding segment... just like a very coarse clustering. Then you
> could vacuum the segments separately to keep the free space not running
> out. For active segments you would then fix the partitioning constraint
> range once the fill factor is reached, to allow for keeping it's
> constraint even when heavily updating (heavily vacuuming it too as
> response to that), and create a new segment for the unbounded range for
> new inserts... this would work fine for tables where the constraint is
> based on ever increasing keys and accidental inserts in old ranges
> (which do happen occasionally in real life).

Lots of thoughts there, so I'll try to separate them out and analyse.

The way I originally described it is a very simple mechanism and we
could tweak that some more. All ideas welcome.

If we had dynamic segment constraints when the segment was not yet read
only that would lead to changes in the segment constraint for each
INSERT when we have increasing keys. It seems better to set the
constraints once only, when the segment was full, then prevent further
INSERTs. The accidental inserts in old ranges seem like something that
can be avoided if it is a real-world problem.

For UPDATEs on a segment with constraints we might choose to apply the
constraints to see what to do. You might want to allow the UPDATE and
have it stretch the constraints outwards or you might want to prevent it
and throw an error, or you might want to allow the UPDATE, yet migrate
the new tuple to the appropriate segment.

Dynamic partitioning works in multiple dimensions, so there isn't just
one single valid location for any row. i.e. if we update a have a row
with OrderId, OrderDate, RequiredByDate, ShipDate and LastModifiedDate
on it, we'll probably expand the constraints on at least one of those.
If we were lucky enough to have only changed one of those it might turn
out there was *in that case* a single more sensible location for the new
tuple, but that probably isn't the common case.

So the likely best behaviour for UPDATEs is to try to keep the new row
version in the same segment, then change the constraints.

The segment constraints concept and the read only concept were linked.
You're right we could separate them, thought that turns out not to be
that desirable. When we do an DELETE or an UPDATE we don't know whether
the deleted row version was the last tuple with that particular boundary
value. So we don't know whether the DELETE or UPDATE changes the
constraints or not, and however we try to avoid it, we'll probably need
to recalc the constraints in some circumstance. So updating the
constraints dynamically isn't anywhere near as easy as it sounds and
ultimately probably isn't worth the effort. So thats why constraints and
read-only go together.

HOT allows us to keep the new row version within the segment, in many
cases. What might be worth doing is marking the FSM space for that
segment as update-only to exclude inserts from using it, then forcing
UPDATEs to stay within the segment if possible by providing the current
block number in each call to the FSM. That would also mean that every
UPDATE that wants to do a multi-block update on a currently read-only
segment would need to call the FSM. Sounds good, but that would only
work for the first UPDATE on a segment after it is marked read only,
which isn't much use, or we would do it for *every* block-spanning
UPDATE, which would cause contention in other use cases.

So although I'm willing to listen and tweak, that hasn't yet resulted in
any additional design points, unless I missed something above?

> When the change rate of old segments get down, the segments could be
> reorganized to have a smaller fill factor, so that you still allow for
> accidental updates but keep space usage efficient. This would be some
> similar action as a clustering, but hopefully not blocking (which might
> be a hard thing to do)... and later again you could mark some of the
> really old things as read only and put them in special segments with no
> wasted space.

Yes, hopefully marking read only and compressed segments is a possible
future.

> One problem would be when the segment's free space runs out, so you must
> put records from the same constraint range in multiple segments - but
> that could still work, you just would have multiple segments covering
> the same range, but if the "segment fill factor" is chosen properly it
> should not be the case... you could normally maintain a set of
> non-overlapping segments in terms of the partitioning constraint.

Thanks very much for your detailed thoughts on the proposal.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, tomas(at)tuxteam(dot)de, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 14:47:48
Message-ID:	1199890068.4266.279.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2008-01-06 at 11:39 +0100, Markus Schiltknecht wrote:

> I think this has to do with SE not being of much use for index scans.

Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be
easy to apply the index condition and/or filters to see which segments
are excluded and then turn off bits in the bitmap appropriately.

Not fully sure about IndexScans yet. I don't think it would be worth
trying to apply SE until we estimated we would return say 100 rows. It
needs to be able to work without slowing down the common path.

> Or
> put it another way: SE is an optimization for sequential scans. For
> tables where it works well, it could possibly replace the index entirely.

True

> Without the index, you would rely on SE to always be able to exclude
> enough segments, so that the seq scan is less expensive than an index
> scan with the following table lookups.

It would have to be a very fat index scan for so large a table...

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 14:49:49
Message-ID:	1199890189.4266.281.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-01-07 at 14:20 +0100, Markus Schiltknecht wrote:

> AFAIUI, Segment Exclusion combines perfectly well with
> clustering.

Yes, seems like it would be possible to have a segment-aware CLUSTER, so
it was actually usable on large tables. Not planning that initially
though.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, tomas(at)tuxteam(dot)de, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 14:53:39
Message-ID:	4784DFF3.9040701@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> Hmmm. I think it fits rather neatly with BitmapIndexScans. It would be
> easy to apply the index condition and/or filters to see which segments
> are excluded and then turn off bits in the bitmap appropriately.

Yeah, good point.

> Not fully sure about IndexScans yet. I don't think it would be worth
> trying to apply SE until we estimated we would return say 100 rows. It
> needs to be able to work without slowing down the common path.

Yup.

>> Or
>> put it another way: SE is an optimization for sequential scans. For
>> tables where it works well, it could possibly replace the index entirely.
>
> True
>
>> Without the index, you would rely on SE to always be able to exclude
>> enough segments, so that the seq scan is less expensive than an index
>> scan with the following table lookups.
>
> It would have to be a very fat index scan for so large a table...

..for SE to be faster than an index scan, you mean? Yes.

Regards

Markus

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 15:04:34
Message-ID:	1199891074.4266.292.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2008-01-05 at 16:42 +0100, Markus Schiltknecht wrote:

> Simon Riggs wrote:
> > On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:
> >
> >> I'm still puzzled about how a DBA is expected to figure out which
> >> segments to mark. Simon, are you assuming we are going to pass on
> >> segment numbers to the DBA one day?
> >
> > No Way!
>
> Ah, I'm glad ;-)

But now you want names on partitions?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 15:08:08
Message-ID:	1199891288.4266.300.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gavin and all,

This is quite a long reply, so apologies for that.

On Wed, 2008-01-09 at 07:28 +0100, Gavin Sherry wrote:
> On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote:
> > This technique would be useful for any table with historical data keyed
> > by date or timestamp. It would also be useful for data where a
> > time-of-insert component is implicit, such as many major entity tables
> > where the object ids are assigned by a sequence. e.g. an Orders table
> > with an OrderId as PK. Once all orders placed in a period have been
> > shipped/resolved/closed then the segments will be marked read-only.
> >

> A novel approach to the problem. For me, this proposal addresses some
> of the other problems in postgres (visibility in the heap vs. index)
> rather than the problems with partitioning itself.

I think you're right that it isn't attempting to address the problems
with partitioning directly, it is attempting to address the core use
case itself. I think we have an opportunity to bypass the
legacy-of-thought that Oracle has left us and implement something more
usable.

There are some low-level technical issues associated with declarative
partitioning that I'll address last, which are in many ways the big
reason to want to go to non-declarative partitioning.

> It might seem otherwise, but for me partitioning is tool for managing
> large volumes of data. It allows the user to encode what they know about
> the nature of their data, and that's often about management. In this
> way, your proposal actually makes partitioning too smart: the user wants
> to tell the system how to organise the data, as opposed to the other way
> around.

> At Greenplum, we've been discussing this in depth. Interestingly, we
> also felt that the storage layer could be much smarter with large tables
> with predictable layouts and predictable data patterns. But the thing
> is, people with large tables like partitioning, putting different
> partitions on different storage; they like the ability to merge and
> split partitions; they like truncating old partitions. In a word, they
> seem to like the managability partitions give them -- as well as the
> performance that comes with this.

Do people really "like" running all that DDL? There is significant
manpower cost in implementing and maintaining a partitioning scheme,
plus significant costs in getting it wrong.

If people with large tables like partitioning why is Oracle moving
towards automated partitioning in 11g? Automated partitioning was one of
the major goals for this next set of Postgres partitioning functionality
also, whether or not we have declarative partitioning. My thinking is
lets go for the best ideas and skip over the stuff that will be (is)
obsolete before its left the design stage.

I see many more systems in the Terabyte range now than I did 10 years
ago, but I see about the same number of DBAs. We'll always need experts,
but I feel we should be using our expertise to simplify the standard
cases, not just maintain the same level of difficulty in managing them.
One of the big benefits of the dynamic partitioning approach is that it
needs no DDL. So it will work out-of-the-box, for anybody.

Deleting older data would be optimised under the proposed scheme, so
that's not really a problem. Loading data is actually slightly harder
and slower with declarative partitioning (see below).

Merging and splitting partitions are tools for fine tuning a very
complex partitioning scheme. They do also allow a non-linear
segmentation scheme, which might aid performance in some cases.

(On a different note, I'm strongly in favour of a declarative approach
to other optimizer information to allow the DBA to say what they know
about how a table will be used. But that's off-topic for now.)

One major advantage of the dynamic approach is that it can work on
multiple dimensions simultaneously, which isn't possible with
declarative partitioning. For example if you have a table of Orders then
you will be able to benefit from Segment Exclusion on all of these
columns, rather than just one of them: OrderId, OrderDate,
RequiredByDate, LastModifiedDate. This will result in some "sloppiness"
in the partitioning, e.g. if we fill 1 partition a day of Orders, then
the OrderId and OrderData columns will start out perfectly arranged. Any
particular RequiredByDate will probably be spread out over 7 partitions,
but thats way better than being spread out over 365+ partitions.

When we look at the data in the partition we can look at any number of
columns. When we declaratively partition, you get only one connected set
of columns, which is one of the the reasons you want multi-dimensional
partitioning in the first place.

> To this end, we (well, Jeff Cohen) looked at the syntax and semantics of
> partitining in leading databases (Oracle, Informix, DB2) and came up
> with a highly expressive grammar which takes the best of each I think
> (I'll post details on the grammar in a seperate thread). The idea is
> that range (for example, a date range), list (a list of distinct values)
> and hash partitioning be supported on multiple columns. Partitions can
> be added, merged, truncated. Partitions can reside on different
> tablespaces. The existing issues with the rewriter, COPY support and the
> like go away by smartening up the backend.

> To explore the grammar and semantics Jeff and I (to a small extent) have
> implemented the parsing of such a grammar in the backend. At the moment,
> this just results in the generation of a bunch of rules (i.e., it
> produces the current behaviour, as if someone had written rules
> themselves) and debugging output.
>
> The fully fledged approach will see partitioning rules stored in
> a new system catalog which can be interrogated by the planner or COPY to
> determine which partition a given tuple belongs in or which partition(s)
> a WHERE clause matches.

When loading data we would need to examine each incoming tuple and
distribute it to the correct place in the partitioned heap. That's a
non-trivial overhead on data loading that I would like to avoid, since
if we rely on the natural locality of loaded data then that overhead is
zero.

> ...partitioning rules stored in
> a new system catalog which can be interrogated...

Agreed. I think this would be the best approach whatever we do. pg_class
would eventually impose a limit on the number of partitions for us.

> Yes, this is the traditional approach but I think it still has merit. We
> shall continue working on this approach because it is what our customers
> have asked for. We would also like to see it get into postgres too.

I think there very likely some merit hiding in the "traditional
approach", but we should be clear about what that is and is not. (I
still have half my brain in that way of doing things, so I'm working
through everything to make sure I miss as little as possible.)

I'd like to hear from anybody with some descriptions of use cases where
the differences between what's been proposed and what is missing provide
a significant gain. If nobody has ever run such tests or had those
thoughts, lets examine them now to make sure we're doing the right
thing. I don't want to copy what others have done without considering
whether it is correct for us.

After thinking about what you've said ISTM that the declarative
partition approach has these advantages over the dynamic approach

- allows partitions of different sizes, for when a non-linear
partitioning scheme gives considerable performance gain

- allows partitioning to continue to work whilst UPDATEs continue within
a particular partition

and these disadvantages

- only works with one set of columns, cannot exclude partitions by other
columns

- requires manual definition

- requires partition names and numbers to be user visible

- declarative partitioing can improve the situation, or make it worse,
depending upon whether the definitions match the reality of the data

- acquires table locks during DDL execution, to enforce validity of the
rules, possibly for extended periods while data moves/is checked

- requires loaded data to execute partitioning code to identify where it
should insert itself

My assessment is that the dynamic partitioning approach wins in most
common cases for large growing tables, though there are additional use
cases where more control is desirable.

(The two advantages of declarative partitioning might be something we
could change with the dynamic approach, but not in first phase. For
example we might see a way to break down segments into 4 sub-segments
recursively using a quad tree approach. Thus when segments are being
updated we simply isolate the issue around the changing data.)

I would like to proceed in a way that allows dynamic partitioning to
occur naturally as the default, with the option to add explicit
partitioning, if required and possible at a later stage.

At this stage I'm not sure whether those additional cases apply to truly
growing tables and whether there is sufficient additional advantage to
make declarative partitioning worth considering.

Technical Issues
----------------

The main problems I see occur in the storage layer. If we can discuss
what you think is needed at the storage layer then we might stand a
chance of laying the foundations for both options. The catalog changes
and executor changes are probably fairly similar for both.

My first problem sounds quite mundane: expandability. If you define a
partition manually and logically then it needs to cope with an arbitrary
number of rows. That means that the view of a partition as a range of
contiguous blocks disappears and is replaced by a more complex
definition. When you turn partitioning on its head and say "What data is
in this range of blocks?" you don't need to have expandability and the
partition can be a fixed set of contiguous blocks. That works well with
any shape of data, so merging/splitting operations are not required
because we never have fat or thin partitions to cope with. You can also
redefine partitioning simply by changing the segment size and
re-VACUUMing the table.

At the moment, smgr provides the executor with a single uniform view of
a table as a single contiguous range of blocks. The current
implementation of md.c allows a table to actually consist of a range of
files, without the executor knowing. If we violate that, then we'd need
to change all code that accepts a BlockNumber, i.e. indexes, buffer
manager etc. So anything we do that is of any complexity must be below
the smgr layer. Allowing expandable partitions eventually implies a
non-contiguous range of blocks.

In the worst case coping with a non-contiguous block range means you
need to consult a block index for each block, which would be horrible.
One better solution might be to have logical partitions broken up into
physical partitions all of the same size. Various solutions possible,
but most seem to detract from the thing which we were seeking in the
first place and/or seem to restrict the range of block addresses
available, unless we go for variable-size BlockNumbers also. Urgh.

If anybody thinks we can get around this by putting a maximum size on
partitions, then I'd say you're mad. The last thing we would want is a
mechanism that stops you from adding new data if you get the partition
definitions wrong, especially since my experience is that most people
set up partitioning in a way that is proven sub-optimal after some time
in real operation. The same comments would also apply to Oracle's Hash
Cluster and DB2's Multi-Dimensional Clustering techniques. They're great
performance features when declared correctly, but almost unusable in
real applications that can't control what business and the real world
will throw at them.

So it seems likely that we'd be heading for a level of change within the
storage layer that would not be easily accepted, given that not
everybody needs partitioning in Postgres.

We're probably caught on the question of whether this is a general
purpose or special purpose database system. Realistically, I have to set
limits, so I'd have to describe my current goal of being able to manage
16TB easily and quickly with Postgres, without too much additional
thought on the part of the DBA, when they have one or more naturally
growing tables. Special purpose requirements for parallelism or larger
databases can use another database system, e.g. GP, XDB, ParAccel etc.

So, there are specific disadvantages that I see to declarative
partitioning with respect to its deeper implementation for us. I see no
need to rule it out completely, but I do see reasons to de-prioritise
it. Detailed comments and thoughts most welcome.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Chris Browne <cbbrowne(at)acm(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 16:47:31
Message-ID:	60abnfymho.fsf@dba2.int.libertyrms.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

simon(at)2ndquadrant(dot)com (Simon Riggs) writes:
> I think we have an opportunity to bypass the legacy-of-thought that
> Oracle has left us and implement something more usable.

This seems like a *very* good thing to me, from a couple of
perspectives.

1. I think you're right on in terms of the issue of the cost of
"running all that DDL" in managing partitioning schemes.

When I was working as DBA, I was decidedly *NOT* interested in
doing a lot of low level partition management work, and those that
are in that role now would, I'm quite sure, agree that they are
not keen on spending a lot of their time trying to figure out what
tablespace to shift a particular table into, or what tablespace
filesystem to get sysadmins to set up.

2. Blindly following what Oracle does has always been a dangerous
sort of thing to do.

There are two typical risks:

a) There's always the worry that they may have patented some
part of how they implement things, and if you follow too
closely, There Be Dragons...

b) They have enough billion$ of development dollar$ and
development re$ource$ that they can follow strategies that
are too expensive for us to even try to follow.

3. If, rather than blindly following, we create something at least
quasi-new, there is the chance of doing fundamentally better.

This very thing happened when it was discovered that IBM had a
patent on the ARC cacheing scheme; the "clock" system that emerged
was a lot better than ARC ever was.

I think it's worth observing both the advantages and demerits of this
together.

In effect, with the dynamic approach, Segment Exclusion provides its
benefits as an emergent property of the patterns of how INSERTs get
drawn into segments.

The tendancy will correspondly be that Segment Exclusion will be able
to provide useful constraints for those patterns that can naturally
emerge from the INSERTs.

We can therefore expect useful constraints for attributes that are
assigned in some kind of more or less chronological order. Such
attributes will include:

- Object ID, if set by a sequence
- Processing dates

There may be a bit of sloppiness, but the constraints may still be
useful enough to exclude enough segments to improve efficiency.

_On The Other Hand_, there will be attributes that are *NOT* set in a
more-or-less chronological order, and Segment Exclusion will be pretty
useless for these attributes.

In order to do any sort of "Exclusion" for non-"chronological"
attributes, it will be necessary to use some mechanism other than the
patterns that fall out of "natural chronological insertions." If you
want exclusion on such attributes, then there needs to be some sort of
rule system to spread such items across additional partitions. Mind
you, if you do such, that will weaken the usefulness of Segment
Exclusion. For instance, suppose you have 4 regions, and scatter
insertions by region. In that case, there will be more segments that
overlap any given chronological range.

> When we look at the data in the partition we can look at any number of
> columns. When we declaratively partition, you get only one connected set
> of columns, which is one of the the reasons you want multi-dimensional
> partitioning in the first place.

Upside: Yes, you get to exclude based on examining any number of
columns.

Downside: You only get the exclusions that are "emergent properties"
of the data...

The more I'm looking at the dynamic approach, the more I'm liking
it...
--
"cbbrowne","@","cbbrowne.com"
http://linuxfinances.info/info/linuxxian.html
"Feel free to contribute build files. Or work on your motivational
skills, and maybe someone somewhere will write them for you..."
-- "Fredrik Lundh" <effbot(at)telia(dot)com>

From:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To:	Chris Browne <cbbrowne(at)acm(dot)org>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 18:53:47
Message-ID:	4785183B.2050709@cheapcomplexdevices.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Chris Browne wrote:
> _On The Other Hand_, there will be attributes that are *NOT* set in a
> more-or-less chronological order, and Segment Exclusion will be pretty
> useless for these attributes.

Really? I was hoping that it'd be useful for any data
with long runs of the same value repeated - regardless of ordering.

My biggest tables are clustered by zip/postal-code -- which means that
while the City, State, Country attributes aren't monotonically increasing
or decreasing; they are grouped tightly together. I'd expect all queries
for San Francisco to be able to come from at most 2 segments; and all queries
for Texas to be able to come from only a fraction of the whole.

If the segment sizes are configurable - I imagine this would even
be useful for other data - like a people table organized
by last_name,first_name. "John"'s may be scattered through out
the table -- but at least the John Smith's would all be on one
segment, while the Aaron-through-Jim Smith segments might get excluded.

Or am I missing something?

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Chris Browne <cbbrowne(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 19:03:13
Message-ID:	20080109190313.GM6934@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 09, 2008 at 11:47:31AM -0500, Chris Browne wrote:
> simon(at)2ndquadrant(dot)com (Simon Riggs) writes:
> > I think we have an opportunity to bypass the legacy-of-thought that
> > Oracle has left us and implement something more usable.
>
> This seems like a *very* good thing to me, from a couple of
> perspectives.
>
[snip]
> 2. Blindly following what Oracle does has always been a dangerous
> sort of thing to do.
>
> There are two typical risks:
>
> a) There's always the worry that they may have patented some
> part of how they implement things, and if you follow too
> closely, There Be Dragons...

I think that could be equally said of the dynamic partitioning approach.
In fact, it might be more likely since declarative partitioning has been
around for eons.

> b) They have enough billion$ of development dollar$ and
> development re$ource$ that they can follow strategies that
> are too expensive for us to even try to follow.

I don't see that as an argument against the declarative approach.
Reading the details on this thread so far, I think Simon's approach is
probably more complex from an implementation POV.

>
> 3. If, rather than blindly following, we create something at least
> quasi-new, there is the chance of doing fundamentally better.
>
> This very thing happened when it was discovered that IBM had a
> patent on the ARC cacheing scheme; the "clock" system that emerged
> was a lot better than ARC ever was.

Well, I don't think I'm proposing we /blindly follow/ others. I propose
we choose a grammar which takes the best of what others have tried to
do. Oracle's grammar is hideous, IBM's is too restrictive, for example.

> > One major advantage of the dynamic approach is that it can work on
> > multiple dimensions simultaneously, which isn't possible with
> > declarative partitioning. For example if you have a table of Orders then
> > you will be able to benefit from Segment Exclusion on all of these
> > columns, rather than just one of them: OrderId, OrderDate,
> > RequiredByDate, LastModifiedDate. This will result in some "sloppiness"
> > in the partitioning, e.g. if we fill 1 partition a day of Orders, then
> > the OrderId and OrderData columns will start out perfectly arranged. Any
> > particular RequiredByDate will probably be spread out over 7 partitions,
> > but thats way better than being spread out over 365+ partitions.
>
> I think it's worth observing both the advantages and demerits of this
> together.
>
> In effect, with the dynamic approach, Segment Exclusion provides its
> benefits as an emergent property of the patterns of how INSERTs get
> drawn into segments.
>
> The tendancy will correspondly be that Segment Exclusion will be able
> to provide useful constraints for those patterns that can naturally
> emerge from the INSERTs.

Many people, in my experience, doing the kind of data processing which
benefits from partitioning are regularly loading data, rather than
collecting it in an OLTP fashion. Lets take the easily understandable
concept of processing web site traffic. If the amount of data is large
enough to benefit from partitioning, they probably have multiple web
servers and therefore almost certainly multiple log files. If these
files are not sorted into a single file, the records will not have a
naturally progressing chronology: every file we go back to the beginning
of the period of time the load covers. If you add parallelism to your load,
things look even more different. This means you could end up with a
bunch of partitions, under the dynamic model, which all cover the same
time range.

Then there's the way that really big databases are used (say, up around
Simon's upper bound of 16 TB). It is expensive to keep data online so
people aren't. They're loading and unloading data all the time, to
perform different forms of analysis. A common scenario in the example
above might be to unload all but the current month's data and then load
the same month from the previous year. The unload step needs to be
costly (i.e., TRUNCATE). Then, there's no guarantee that what they're
interested in is the date range at all. They may want to compare user
agent (look at bot activity). In this case, the partitioning is across a
small list of strings (well, numbers most likely). Here, the partitions
would all have the same range. Partitioning would be useless.

> We can therefore expect useful constraints for attributes that are
> assigned in some kind of more or less chronological order. Such
> attributes will include:
>
> - Object ID, if set by a sequence
> - Processing dates
>
> There may be a bit of sloppiness, but the constraints may still be
> useful enough to exclude enough segments to improve efficiency.

I really like the idea of the system being able to identify trends and
patterns in the data (actually useful at the user level) but it's the
uncertainty of how the partitioning will work for different kinds of
data that I don't like.

>
> _On The Other Hand_, there will be attributes that are *NOT* set in a
> more-or-less chronological order, and Segment Exclusion will be pretty
> useless for these attributes.
>
> In order to do any sort of "Exclusion" for non-"chronological"
> attributes, it will be necessary to use some mechanism other than the
> patterns that fall out of "natural chronological insertions." If you
> want exclusion on such attributes, then there needs to be some sort of
> rule system to spread such items across additional partitions. Mind
> you, if you do such, that will weaken the usefulness of Segment
> Exclusion. For instance, suppose you have 4 regions, and scatter
> insertions by region. In that case, there will be more segments that
> overlap any given chronological range.

Right, and so the system seems to 'degrade' to a declarative approach.
Consider all the kinds of data other than time series people are storing
in Postgres. Some of the biggest I've seen are GIS information or network
information. Also, what about people with there own data types? It may
be that the data types do not support ordering so range partitioning
does not make sense.

> > When we look at the data in the partition we can look at any number of
> > columns. When we declaratively partition, you get only one connected set
> > of columns, which is one of the the reasons you want multi-dimensional
> > partitioning in the first place.
>
> Upside: Yes, you get to exclude based on examining any number of
> columns.
>
> Downside: You only get the exclusions that are "emergent properties"
> of the data...

There's also a cost involved with calculating all this in the dymanic
approach. Consider tables with a non-trivial number of columns.

Thanks,

Gavin

From:	Chris Browne <cbbrowne(at)acm(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 19:38:21
Message-ID:	60k5miyeky.fsf@dba2.int.libertyrms.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:
> Chris Browne wrote:
>> _On The Other Hand_, there will be attributes that are *NOT* set in a
>> more-or-less chronological order, and Segment Exclusion will be pretty
>> useless for these attributes.
>
> Really? I was hoping that it'd be useful for any data
> with long runs of the same value repeated - regardless of ordering.
>
> My biggest tables are clustered by zip/postal-code -- which means that
> while the City, State, Country attributes aren't monotonically increasing
> or decreasing; they are grouped tightly together. I'd expect all queries
> for San Francisco to be able to come from at most 2 segments; and all queries
> for Texas to be able to come from only a fraction of the whole.
>
>
> If the segment sizes are configurable - I imagine this would even
> be useful for other data - like a people table organized
> by last_name,first_name. "John"'s may be scattered through out
> the table -- but at least the John Smith's would all be on one
> segment, while the Aaron-through-Jim Smith segments might get excluded.
>
> Or am I missing something?

Well, this can head in two directions...

1. Suppose we're not using an "organize in CLUSTER order" approach.

If the data is getting added in roughly "by order of insertion" order,
then there's no reason to expect San Francisco data to be clustered
together.

Ditto for "John Smiths;" we can expect them to be as scattered as
their dates of creation.

1. Suppose we *are* using an "organize in CLUSTER order" approach.

In that case, yes, indeed, you can expect segments to contain specific
ranges of the data.

However, in that case, the ONLY dimension under which the Segment
Exclusion may be expected to be useful is that of the first column of
the index being used. "Smith" should be useful to SE, but not "John,"
because, as a secondary criteria, the first name values will be
scattered across all segments.
--
(reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/linuxdistributions.html
PALINDROME spelled backwards is EMORDNILAP.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc:	Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 20:17:41
Message-ID:	1199909861.4266.514.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote:

> I think Simon's approach is
> probably more complex from an implementation POV.

Much of the implementation is exactly the same, and I'm sure we agree on
more than 50% of how this should work already. We just need to close in
on the remainder.

My current opinion on the SE approach is the opposite of the one above,
though it gets us nowhere just to state it. I'm trying to avoid opinion
and look at the details, which is the reason my viewpoint recently
changed in favour of the dynamic approach as the main thrust for
implementation.

I've written a detailed email this morning to explain where and how the
problems lie, which are nowhere near the syntax level. I haven't ruled
out a declarative approach yet, but I need some detailed technical
review of the issues. I hope you'll be replying to that?

> > 3. If, rather than blindly following, we create something at least
> > quasi-new, there is the chance of doing fundamentally better.
> >
> > This very thing happened when it was discovered that IBM had a
> > patent on the ARC cacheing scheme; the "clock" system that emerged
> > was a lot better than ARC ever was.
>
> Well, I don't think I'm proposing we /blindly follow/ others. I propose
> we choose a grammar which takes the best of what others have tried to
> do. Oracle's grammar is hideous, IBM's is too restrictive, for example.

I assume the new grammar is good and if we do go that way, it sounds
like the right starting place.

> > > One major advantage of the dynamic approach is that it can work on
> > > multiple dimensions simultaneously, which isn't possible with
> > > declarative partitioning. For example if you have a table of Orders then
> > > you will be able to benefit from Segment Exclusion on all of these
> > > columns, rather than just one of them: OrderId, OrderDate,
> > > RequiredByDate, LastModifiedDate. This will result in some "sloppiness"
> > > in the partitioning, e.g. if we fill 1 partition a day of Orders, then
> > > the OrderId and OrderData columns will start out perfectly arranged. Any
> > > particular RequiredByDate will probably be spread out over 7 partitions,
> > > but thats way better than being spread out over 365+ partitions.
> >
> > I think it's worth observing both the advantages and demerits of this
> > together.
> >
> > In effect, with the dynamic approach, Segment Exclusion provides its
> > benefits as an emergent property of the patterns of how INSERTs get
> > drawn into segments.
> >
> > The tendancy will correspondly be that Segment Exclusion will be able
> > to provide useful constraints for those patterns that can naturally
> > emerge from the INSERTs.
>
> Many people, in my experience, doing the kind of data processing which
> benefits from partitioning are regularly loading data, rather than
> collecting it in an OLTP fashion. Lets take the easily understandable
> concept of processing web site traffic. If the amount of data is large
> enough to benefit from partitioning, they probably have multiple web
> servers and therefore almost certainly multiple log files. If these
> files are not sorted into a single file, the records will not have a
> naturally progressing chronology: every file we go back to the beginning
> of the period of time the load covers. If you add parallelism to your load,
> things look even more different. This means you could end up with a
> bunch of partitions, under the dynamic model, which all cover the same
> time range.

Depends how big we make the partitions and how sloppy this is as to
whether that is a problem or not. We might still expect a x100 gain from
using the SE approach depending upon the data volume.

> Then there's the way that really big databases are used (say, up around
> Simon's upper bound of 16 TB). It is expensive to keep data online so
> people aren't. They're loading and unloading data all the time, to
> perform different forms of analysis.

That isn't my experience. That sounds very time consuming.

The storage cost issue was the reason Andrew wanted offline segments,
and why I have been talking about hierarchical storage.

> A common scenario in the example
> above might be to unload all but the current month's data and then load
> the same month from the previous year. The unload step needs to be
> costly (i.e., TRUNCATE). Then, there's no guarantee that what they're
> interested in is the date range at all. They may want to compare user
> agent (look at bot activity). In this case, the partitioning is across a
> small list of strings (well, numbers most likely). Here, the partitions
> would all have the same range. Partitioning would be useless.

I take it you mean SE-based partitioning would be useless, but
declarative partitioning would be useful? I would agree, assuming they
run queries with a few of the small list of strings. Seems like a
contrived case.

> Some of the biggest I've seen are GIS information or network
> information.

Those are good examples of where a declarative approach would be the
only way of getting partitioning to work well. So based on that I'll
agree that some people *need* a declarative approach to get the benefits
of partitioning, rather than just want.

We do already have constraint exclusion. Can we get by with constraint
and segment exclusion together?

> Also, what about people with there own data types? It may
> be that the data types do not support ordering so range partitioning
> does not make sense.

I'd like to hear some examples. Oleg?

> There's also a cost involved with calculating all this in the dymanic
> approach. Consider tables with a non-trivial number of columns.

True, but that balances the cost of doing that at load time.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 20:34:40
Message-ID:	1199910880.4266.517.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2008-01-08 at 02:12 +0000, Gregory Stark wrote:

> I also don't understand how this proposal deals with the more common use case
> of unloading and loading data. Normally in partitioned tables we build the
> data in a side table until the data is all correct then load it as a
> partition. If you treat it as a lower-level object then I don't see that
> working. The layout of the new table won't often match the layout of the
> target partitioned table.

We optimised for that in 8.2, but I would say that not many people
noticed and that it isn't normal.

The problem with that approach, and the reason many people don't use it
is that it requires all data for a partition to be available at the time
you add the partition. That necessarily implies a time delay into the
process of loading data, which is no long acceptable in the world of
straight-through-processing or whatever you call the need for zero
processing delay in an/your industry. So people choose to load data
directly into the main table, allowing it to be immediately available,
though at the cost of some performance.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Chris Browne <cbbrowne(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 22:36:43
Message-ID:	20080109223643.GB999@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 09, 2008 at 02:38:21PM -0500, Chris Browne wrote:
> Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:
> > Or am I missing something?
>
> Well, this can head in two directions...
>
> 1. Suppose we're not using an "organize in CLUSTER order" approach.
>
> If the data is getting added in roughly "by order of insertion" order,
> then there's no reason to expect San Francisco data to be clustered
> together.
>
> Ditto for "John Smiths;" we can expect them to be as scattered as
> their dates of creation.
>
> 1. Suppose we *are* using an "organize in CLUSTER order" approach.
>
> In that case, yes, indeed, you can expect segments to contain specific
> ranges of the data.
>
> However, in that case, the ONLY dimension under which the Segment
> Exclusion may be expected to be useful is that of the first column of
> the index being used. "Smith" should be useful to SE, but not "John,"
> because, as a secondary criteria, the first name values will be
> scattered across all segments.

One of the reasons for using partitions is that the usefulness of
indexes breaks down because of the low cardinality of the indexed
column. Also, the random IO can kill your performance.

In my opinion, CLUSTER (and vacuum, mentioned else where) are not data
management tools.

Thanks,

Gavin

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-09 22:48:03
Message-ID:	20080109224803.GC999@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 09, 2008 at 08:17:41PM +0000, Simon Riggs wrote:
> On Wed, 2008-01-09 at 20:03 +0100, Gavin Sherry wrote:
>
> > I think Simon's approach is
> > probably more complex from an implementation POV.
>
> Much of the implementation is exactly the same, and I'm sure we agree on
> more than 50% of how this should work already. We just need to close in
> on the remainder.
>
> My current opinion on the SE approach is the opposite of the one above,
> though it gets us nowhere just to state it. I'm trying to avoid opinion
> and look at the details, which is the reason my viewpoint recently
> changed in favour of the dynamic approach as the main thrust for
> implementation.
>
> I've written a detailed email this morning to explain where and how the
> problems lie, which are nowhere near the syntax level. I haven't ruled
> out a declarative approach yet, but I need some detailed technical
> review of the issues. I hope you'll be replying to that?

I'm responding to that email seperately. Just a matter or addressing all
the points in the detail it deserves.

>
> > > 3. If, rather than blindly following, we create something at least
> > > quasi-new, there is the chance of doing fundamentally better.
> > >
> > > This very thing happened when it was discovered that IBM had a
> > > patent on the ARC cacheing scheme; the "clock" system that emerged
> > > was a lot better than ARC ever was.
> >
> > Well, I don't think I'm proposing we /blindly follow/ others. I propose
> > we choose a grammar which takes the best of what others have tried to
> > do. Oracle's grammar is hideous, IBM's is too restrictive, for example.
>
> I assume the new grammar is good and if we do go that way, it sounds
> like the right starting place.
>
> > > > One major advantage of the dynamic approach is that it can work on
> > > > multiple dimensions simultaneously, which isn't possible with
> > > > declarative partitioning. For example if you have a table of Orders then
> > > > you will be able to benefit from Segment Exclusion on all of these
> > > > columns, rather than just one of them: OrderId, OrderDate,
> > > > RequiredByDate, LastModifiedDate. This will result in some "sloppiness"
> > > > in the partitioning, e.g. if we fill 1 partition a day of Orders, then
> > > > the OrderId and OrderData columns will start out perfectly arranged. Any
> > > > particular RequiredByDate will probably be spread out over 7 partitions,
> > > > but thats way better than being spread out over 365+ partitions.
> > >
> > > I think it's worth observing both the advantages and demerits of this
> > > together.
> > >
> > > In effect, with the dynamic approach, Segment Exclusion provides its
> > > benefits as an emergent property of the patterns of how INSERTs get
> > > drawn into segments.
> > >
> > > The tendancy will correspondly be that Segment Exclusion will be able
> > > to provide useful constraints for those patterns that can naturally
> > > emerge from the INSERTs.
> >
> > Many people, in my experience, doing the kind of data processing which
> > benefits from partitioning are regularly loading data, rather than
> > collecting it in an OLTP fashion. Lets take the easily understandable
> > concept of processing web site traffic. If the amount of data is large
> > enough to benefit from partitioning, they probably have multiple web
> > servers and therefore almost certainly multiple log files. If these
> > files are not sorted into a single file, the records will not have a
> > naturally progressing chronology: every file we go back to the beginning
> > of the period of time the load covers. If you add parallelism to your load,
> > things look even more different. This means you could end up with a
> > bunch of partitions, under the dynamic model, which all cover the same
> > time range.
>
> Depends how big we make the partitions and how sloppy this is as to
> whether that is a problem or not. We might still expect a x100 gain from
> using the SE approach depending upon the data volume.

This comes back to my problem with the dynamic approach: the user might
get a certain experience but with the declarative approach the
expectation matches the result.

>
> > Then there's the way that really big databases are used (say, up around
> > Simon's upper bound of 16 TB). It is expensive to keep data online so
> > people aren't. They're loading and unloading data all the time, to
> > perform different forms of analysis.
>
> That isn't my experience. That sounds very time consuming.

I cannot offer any evidence but it's what we see companies doing.

> The storage cost issue was the reason Andrew wanted offline segments,
> and why I have been talking about hierarchical storage.

The thing is, people with lots of data usually have good hierarchical
storage systems. Can we really do better?

>
> > A common scenario in the example
> > above might be to unload all but the current month's data and then load
> > the same month from the previous year. The unload step needs to be
> > costly (i.e., TRUNCATE). Then, there's no guarantee that what they're
> > interested in is the date range at all. They may want to compare user
> > agent (look at bot activity). In this case, the partitioning is across a
> > small list of strings (well, numbers most likely). Here, the partitions
> > would all have the same range. Partitioning would be useless.
>
> I take it you mean SE-based partitioning would be useless, but
> declarative partitioning would be useful? I would agree, assuming they
> run queries with a few of the small list of strings. Seems like a
> contrived case.

I don't think it's contrived. How about the use case else where in the
thread which is based on post code?

>
> > Some of the biggest I've seen are GIS information or network
> > information.
>
> Those are good examples of where a declarative approach would be the
> only way of getting partitioning to work well. So based on that I'll
> agree that some people *need* a declarative approach to get the benefits
> of partitioning, rather than just want.
>
> We do already have constraint exclusion. Can we get by with constraint
> and segment exclusion together?

The reason we are (Greenplum) looking at better partitioning is that
using the existing approach is too error prone, too different to other
vendors, too difficult to management and too hard to reconfigure.

>
> > Also, what about people with there own data types? It may
> > be that the data types do not support ordering so range partitioning
> > does not make sense.
>
> I'd like to hear some examples. Oleg?
>
> > There's also a cost involved with calculating all this in the dymanic
> > approach. Consider tables with a non-trivial number of columns.
>
> True, but that balances the cost of doing that at load time.

I agree, but I'll put more detail into the other email.

Thanks,

gavin

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-10 01:22:39
Message-ID:	20080110012238.GA2837@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Simon,

On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote:
> Segment Exclusion
> -----------------
>
> After we note that a segment is read-only we can scan the segment and
> record min/max values for all columns. These are then "implicit
> constraints", which can then be used for segment exclusion in a similar
> manner as we do with the current constraint exclusion mechanism.

What about columns which are varlen? The maximum value could be very long -- 1
GB in fact. Say we decide to not support those. Well, what about
numeric? What about user defined types?

> This would also allow a Merge Join to utilise exclusion for the inner
> plan at execution time. Instead of scanning the whole inner table plan
> from the start, we would be able to start the scan from the appropriate
> segment. This would require us to pass the current value of the outer
> plan down into the inner plan. The typical executor nodes on the inner
> plan would be a SortNode and below that a SeqScan. In that case the
> executor would need to pass the outer value from the MergeJoinNode down
> thru the SortNode to the SeqScan node. The SeqScan node could then
> perform partition exclusion, reducing the time for that scan and also
> reducing the time for the resulting sort. This sounds like it might be
> worth doing in normal cases also. It might turn out that the potentially
> applicable cases are already excluded during planning, I haven't thought
> about that aspect in enough detail yet.

I don't think that would work at all. Say you have have skew of data
across the partitions. You may end up doing a seq scan for every outer
tuple. It would look like a really expensive nested loop.

So, in that case, you've totally undercut the planner. All the steps up
the tree would be based on the stats for scanning the entire table but
you've only returned part of it.

To solve that, you could put the min/max values in a catalog somewhere.
For a 16 TB table, that's 16000 min/max values. That's pretty expensive.
And how do we handle making things non-read-only. You'd have to wait for
all current queries to finish... that could take a long time.

How do we handle crashes during setting of min/max? I presume it needs
WAL support.

What happens if the free space map gives us a free block in a read only
segment. Then we need to look at concurrency and transactionality of
min/max values don't we? If changing it makes it not read-only (which
seems to be the case) it would be trivial to totally degrade your
partitioning scheme to full seq scan. One a 16 TB table. And getting the
performance back might involve a day or more or VACUUMing. Impossible.

> Visibility Map
> --------------
[snip]
> No dynamic shared memory cache is required because any concurrent
> changes to the table would be ignored by a Scan anyway, so it doesn't
> matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
> new scans that start will attempt to lock the table and then perform a
> rel cache check before continuing. So the visibility will be set
> correctly for *that* scan at least.

Is that true in the presence of READ COMMITTED?

>
> In most cases the visibility map can be summarised as a single boolean
> to show whether any 100% visible segments exist. That makes accessing
> the map very cheap in the common, unset case.

What do we do in failure scenarios. Does this go into WAL? Is it
replicated via log shipping techniques. You mention Slony support below.
I don't know that Slony's target is Very Big Tables (up to 16 TB) but
even if it was being used for a 'small' system of a few hundred GB, when
you fail over the system has degraded if you aren't vacuuming it. More
over, the layout of data may be different and therefore the performance
of the segment exclusion. That's a bit of a surprise.

>
> Setting the Visibility Map
> --------------------------
>
> VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of
> them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to
> all current backends. Marking the map will cause a rel cache
> invalidation.

Argh. Remember, we're talking about 16 TB tables here. With the best of
tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With
the existing partitioning approach, there's no such thing.

>
> We would never mark the highest numbered segment
> EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur.
> This is also a useful way of excluding smaller tables from the overheads
> involved for this feature.

What about concurrent update or deletes -- during VACUUM. Surely we
don't have to vacuum full? Concurrent updates and deletes could change
the partition boundaries and that will impact on the accuracy of segment
exclusion (I do not think it leads to data corruption but it might lead
to the inclusion of a segment that shouldn't have been included). That
leads to unpredictable behaviour.

>
> In an insert-only table this would mean that only the last 1 or 2
> segments would be read write, so VACUUM would always run in a bounded
> time, no matter how large the table.

You said else where that unload was low cost. What about segments which
cross the unload boundary. Half of their data needs to be removed, say.
That sounds expensive. And, they'll have to be vacuumed.

[snip]

> Unsetting the visibility map
> ----------------------------
>

[snip]

With the existing partitioning, we could just truncate a partition. This
seems really expensive in comparison.

As I've noted elsewhere, you cannot just undermine the planner like
this. You're plans will be totally bogus. I hope you're not proposing
that. If you're proposing tracking information for each segment then I
there are problems there too.

I think you're making this out to be harder than it is. Dealing with
lots of data is difficult, issuing one DDL command is not such a big
deal.

Thanks,

Gavin

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-10 02:06:38
Message-ID:	20080110020638.GB2837@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Simon,

On Wed, Jan 09, 2008 at 03:08:08PM +0000, Simon Riggs wrote:
> Do people really "like" running all that DDL? There is significant
> manpower cost in implementing and maintaining a partitioning scheme,
> plus significant costs in getting it wrong.

Well... that's impossible for me to say. Most of the people who use lots
of data like using MDX (a large and quite complex reporting and
modelling language from MS, which is more complex than SQL IMHO) so...

>
> If people with large tables like partitioning why is Oracle moving
> towards automated partitioning in 11g? Automated partitioning was one of

Have you used Oracle's partitioning? I'm not surprised they're moving
away ;-).

More seriously, I didn't know that. Do you have a URL?

> the major goals for this next set of Postgres partitioning functionality
> also, whether or not we have declarative partitioning. My thinking is
> lets go for the best ideas and skip over the stuff that will be (is)
> obsolete before its left the design stage.

Well, you're assuming that because declarative partitioning has been around a
while, it's obselete. I don't think that's the case. Postgres has been
around about as long...

> I see many more systems in the Terabyte range now than I did 10 years
> ago, but I see about the same number of DBAs. We'll always need experts,
> but I feel we should be using our expertise to simplify the standard
> cases, not just maintain the same level of difficulty in managing them.
> One of the big benefits of the dynamic partitioning approach is that it
> needs no DDL. So it will work out-of-the-box, for anybody.

It's fine to say that, but people have already started talking about
grammar extensions for dynamic partitions. People want the tools to
manage it. There's difficulty in learning how and when to use the
different VACUUM options that have been discussed.

> Deleting older data would be optimised under the proposed scheme, so
> that's not really a problem. Loading data is actually slightly harder
> and slower with declarative partitioning (see below).
>
> Merging and splitting partitions are tools for fine tuning a very
> complex partitioning scheme. They do also allow a non-linear
> segmentation scheme, which might aid performance in some cases.

What about adding partitions into a set of partitions. Greg's post on
that was dismissed and it might not be representative but we have users
doing that all the time.

> One major advantage of the dynamic approach is that it can work on
> multiple dimensions simultaneously, which isn't possible with
> declarative partitioning. For example if you have a table of Orders then
> you will be able to benefit from Segment Exclusion on all of these
> columns, rather than just one of them: OrderId, OrderDate,
> RequiredByDate, LastModifiedDate. This will result in some "sloppiness"
> in the partitioning, e.g. if we fill 1 partition a day of Orders, then
> the OrderId and OrderData columns will start out perfectly arranged. Any
> particular RequiredByDate will probably be spread out over 7 partitions,
> but thats way better than being spread out over 365+ partitions.
>
> When we look at the data in the partition we can look at any number of
> columns. When we declaratively partition, you get only one connected set
> of columns, which is one of the the reasons you want multi-dimensional
> partitioning in the first place.
>
> > To this end, we (well, Jeff Cohen) looked at the syntax and semantics of
> > partitining in leading databases (Oracle, Informix, DB2) and came up
> > with a highly expressive grammar which takes the best of each I think
> > (I'll post details on the grammar in a seperate thread). The idea is
> > that range (for example, a date range), list (a list of distinct values)
> > and hash partitioning be supported on multiple columns. Partitions can
> > be added, merged, truncated. Partitions can reside on different
> > tablespaces. The existing issues with the rewriter, COPY support and the
> > like go away by smartening up the backend.
>
> > To explore the grammar and semantics Jeff and I (to a small extent) have
> > implemented the parsing of such a grammar in the backend. At the moment,
> > this just results in the generation of a bunch of rules (i.e., it
> > produces the current behaviour, as if someone had written rules
> > themselves) and debugging output.
> >
> > The fully fledged approach will see partitioning rules stored in
> > a new system catalog which can be interrogated by the planner or COPY to
> > determine which partition a given tuple belongs in or which partition(s)
> > a WHERE clause matches.
>
> When loading data we would need to examine each incoming tuple and
> distribute it to the correct place in the partitioned heap. That's a
> non-trivial overhead on data loading that I would like to avoid, since
> if we rely on the natural locality of loaded data then that overhead is
> zero.

Yes, but it's a lot faster than what we have at the moment and the
dynamic approach is not without its costs. I've pointed it out before,
but I will again: updating a key which occurs in all segments would mean
the whole thing needs to be vacuumed to get partitioning back. That's a
MASSIVE cost.

> I'd like to hear from anybody with some descriptions of use cases where
> the differences between what's been proposed and what is missing provide
> a significant gain. If nobody has ever run such tests or had those
> thoughts, lets examine them now to make sure we're doing the right
> thing. I don't want to copy what others have done without considering
> whether it is correct for us.
>

I don't think you're stating the question clearly enough. It's not just
a case of how many people hate typing out DDL commands. How many people
would be adversely affected by the difficulties I've pointed out. If you
update 100 rows in your big table, you're going to have to consider
VACUUMing it since your partition may now have 100 odd read-write tables
selected for scanning, not just 1 or 2.

>
> After thinking about what you've said ISTM that the declarative
> partition approach has these advantages over the dynamic approach
>
> - allows partitions of different sizes, for when a non-linear
> partitioning scheme gives considerable performance gain
>
> - allows partitioning to continue to work whilst UPDATEs continue within
> a particular partition

What about consistency over time? High levels of control for users if
they want it? Even simple things like tablespace support? High
performance unload and load performance much better than now?
Reliable and consistent query plans? Support for datatypes other than
those relating to time.

>
> and these disadvantages
>
> - only works with one set of columns, cannot exclude partitions by other
> columns
>
> - requires manual definition
>
> - requires partition names and numbers to be user visible

In fact the grammar we will propose does not make this mandatory. But
either way, how is that a disadvantage worth listing?

>
> - declarative partitioing can improve the situation, or make it worse,
> depending upon whether the definitions match the reality of the data

Which is the same as dynamic partitioning.

> - acquires table locks during DDL execution, to enforce validity of the
> rules, possibly for extended periods while data moves/is checked
>
> - requires loaded data to execute partitioning code to identify where it
> should insert itself

Sure, that's how it works.

> I would like to proceed in a way that allows dynamic partitioning to
> occur naturally as the default, with the option to add explicit
> partitioning, if required and possible at a later stage.

As the default? What about effects on people who do not need
partitioning?

> Technical Issues
> ----------------
>
> The main problems I see occur in the storage layer. If we can discuss
> what you think is needed at the storage layer then we might stand a
> chance of laying the foundations for both options. The catalog changes
> and executor changes are probably fairly similar for both.

I agree, there are major issues with the storage layer. I could list a
lot of things I'd like to see improved but they come down to taking
advantage of how storage works (reading 1 block is as good as reading
100), compression and much faster load.

>
> My first problem sounds quite mundane: expandability. If you define a
> partition manually and logically then it needs to cope with an arbitrary
> number of rows. That means that the view of a partition as a range of
> contiguous blocks disappears and is replaced by a more complex
> definition. When you turn partitioning on its head and say "What data is
> in this range of blocks?" you don't need to have expandability and the
> partition can be a fixed set of contiguous blocks. That works well with
> any shape of data, so merging/splitting operations are not required
> because we never have fat or thin partitions to cope with. You can also
> redefine partitioning simply by changing the segment size and
> re-VACUUMing the table.

And that's what I don't like about it. It makes VACUUM more expensive
and suddenly it becomes a data management tool. You said elsewhere that
load is faster but it's not. It's two times slower because you /must/
VACUUM after load.

> At the moment, smgr provides the executor with a single uniform view of
> a table as a single contiguous range of blocks. The current
> implementation of md.c allows a table to actually consist of a range of
> files, without the executor knowing. If we violate that, then we'd need
> to change all code that accepts a BlockNumber, i.e. indexes, buffer
> manager etc. So anything we do that is of any complexity must be below
> the smgr layer. Allowing expandable partitions eventually implies a
> non-contiguous range of blocks.
>
> In the worst case coping with a non-contiguous block range means you
> need to consult a block index for each block, which would be horrible.
> One better solution might be to have logical partitions broken up into
> physical partitions all of the same size. Various solutions possible,
> but most seem to detract from the thing which we were seeking in the
> first place and/or seem to restrict the range of block addresses
> available, unless we go for variable-size BlockNumbers also. Urgh.
>
> If anybody thinks we can get around this by putting a maximum size on
> partitions, then I'd say you're mad. The last thing we would want is a
> mechanism that stops you from adding new data if you get the partition
> definitions wrong, especially since my experience is that most people
> set up partitioning in a way that is proven sub-optimal after some time
> in real operation. The same comments would also apply to Oracle's Hash
> Cluster and DB2's Multi-Dimensional Clustering techniques. They're great
> performance features when declared correctly, but almost unusable in
> real applications that can't control what business and the real world
> will throw at them.
>
> So it seems likely that we'd be heading for a level of change within the
> storage layer that would not be easily accepted, given that not
> everybody needs partitioning in Postgres.

Originally, when Jeff and others at Greenplum were looking at better
partitioning they came up with a system quite similar to the one you're
proposing. There were two main reasons it was put to one side: it
undermines the planner and it doesn't give predictable results, the
performance changes and generally degrades over time..

I've talked about the latter else where so, to the former. If the
exclusion is executor driven, the planner cannot help but create a seq
scan plan. The planner will think you're returning 100X rows when really
you end up returning X rows. After that, all decisions made by the
planner are totally bogus. The solution means having the planner look at
those values for each partition so that it can generate an accurate
plan. But then we have locking issues: what if the partition range
changes underneath us?

Postgres wouldn't be the only database with this technology. Netezza
works in a similar way AND it degrades in the that I've said. You can
also assume that lots of obvious approaches to doing this are patented
by them, by the way.

>
> We're probably caught on the question of whether this is a general
> purpose or special purpose database system. Realistically, I have to set
> limits, so I'd have to describe my current goal of being able to manage
> 16TB easily and quickly with Postgres, without too much additional
> thought on the part of the DBA, when they have one or more naturally
> growing tables. Special purpose requirements for parallelism or larger
> databases can use another database system, e.g. GP, XDB, ParAccel etc.

Sure, but this still affects everyone. With the declarative approach,
those who want it, use it. With the dynamic approach, everyone
experiences the cost.

Thanks,

Gavin

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-10 07:25:00
Message-ID:	1199949900.4266.570.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:
> If the exclusion is executor driven, the planner cannot help but
> create a seq scan plan. The planner will think you're returning 100X
> rows when really you end up returning X rows. After that, all
> decisions made by the planner are totally bogus.

One of the most important queries on large tables is handling
WHERE Eventdate = CURRENT DATE;

We cannot perform partition exclusion using this type of WHERE clause at
planning time because the CURRENT DATE function is STABLE.

We can decide that some form of partition exclusion is possible, but the
actual partition we access can *only* be decided within the executor.

That necessarily effects the selectivity estimates. The planner can see
we want some data, but it can't tell which data, so it doesn't know
whether we will hit the day with the most data or the date with the
least data.

You've mentioned this a few times, as if its a problem with dynamic
partitioning, yet its clearly an issue for any form of partitioning.

So it seems clear that we need to make partition exclusion work at
executor time, whatever else we do.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-10 16:51:04
Message-ID:	1199983864.4266.823.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:

> > If people with large tables like partitioning why is Oracle moving
> > towards automated partitioning in 11g? Automated partitioning was one of
>
> Have you used Oracle's partitioning?

Since you ask, yep, certified on it, plus DB2, Teradata and Greenwich.

> What about consistency over time? High levels of control for users if
> they want it? Even simple things like tablespace support? High
> performance unload and load performance much better than now?
> Reliable and consistent query plans? Support for datatypes other than
> those relating to time.

Most of these comments seem like emotional appeals, so refuting them one
by one is just going to look and smell like an argument, and a fruitless
one as well since we agree on so many things.

So, I get the message that you really want the DDL approach and agree
that you've demonstrated there are use cases that need it that you are
interested in. That's fine by me as long as we can each work on parts of
it to get it done. Will it be possible for you to do that?

I feel I can say that because AFAICS we can in principle have both
dynamic and declarative techniques, even on the same table. Around half
of the mechanics would be identical anyway.

So from here I say lets work together to get the basic mechanics right.
My experience with PostgreSQL is that the team/list always comes up with
a better idea in the end and I'm sure we will this time too.

I have one main technical concern which is the storage model, which is
the source of the difference between the two types of partitioning we've
been discussing. I see difficulties originating there, so I will start
another thread to examine just those items in detail.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc:	Chris Browne <cbbrowne(at)acm(dot)org>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-10 17:40:05
Message-ID:	47865875.3070100@cheapcomplexdevices.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

hris Browne wrote:
> _On The Other Hand_, there will be attributes that are *NOT* set in a
> more-or-less chronological order, and Segment Exclusion will be pretty
> useless for these attributes.

Short summary:

With the appropriate clustering, ISTM Segment Exclusion
can be useful on all columns in a table.

Cluster the table by "one-bit-of-column1 || one-bit-of-column-2"...
That way any "col2=value" query could exclude at least about
half the table regardless of if it's monotonically increasing
or even totally random.

It seems one could make Segment Exclusion even useful for
multiple unrelated columns in a table. Consider a large
table of people where one might want segment exclusion to
help with both first name, and last name.

One could cluster the table by "first-letter-of-last-name ||
first-letter-of-first-name". Then a query for last name Smith
could exclude all but the consecutive segments of S's; while
the query for first name John would only have to look in the
26 runs of segments with AJ, BJ, CJ, ...

Perhaps better - hash each column and interleave the bits
col1bit1, col2bit1, col3bit1, col1bit2, col2bit2, col3bit3
If I understand segment exclusion right, that way on any
table bigger than 8 segments any query of col1=val,
or col2=val or col3=val would scan at most half the table;
on a table bigger than 64 segments any such query would
scan at most 1/4 of the table.

Obviously this only benefits the rarely changing parts of
tables; and new and updated values would be in a very hot
segment at the end of the table that wouldn't be segment
excluded much.

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-10 20:43:48
Message-ID:	20080110204348.GQ6934@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 10, 2008 at 07:25:00AM +0000, Simon Riggs wrote:
> On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:
> > If the exclusion is executor driven, the planner cannot help but
> > create a seq scan plan. The planner will think you're returning 100X
> > rows when really you end up returning X rows. After that, all
> > decisions made by the planner are totally bogus.
>
> One of the most important queries on large tables is handling
> WHERE Eventdate = CURRENT DATE;

Really? Well, this isn't handled by the current partitioning approach
(i.e., constraint exclusion) so users have worked around it.

>
> We cannot perform partition exclusion using this type of WHERE clause at
> planning time because the CURRENT DATE function is STABLE.

We can do the exact same thing -- if it's a direction people want to
take. In fact, we can do it better/faster because once we've evaluated one
partition we know that there are no others to evaluate.

> So it seems clear that we need to make partition exclusion work at
> executor time, whatever else we do.

One example doesn't make the rule. Most people doing range partitioning
are going to be doing either specific dates or date ranges -- i.e.,
constants or things that can be folded to constants by the planner. At
least, that's what I've seen.

Thanks,

Gavin

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-10 20:49:03
Message-ID:	20080110204903.GR6934@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 10, 2008 at 04:51:04PM +0000, Simon Riggs wrote:
> On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:
>
> > > If people with large tables like partitioning why is Oracle moving
> > > towards automated partitioning in 11g? Automated partitioning was one of
> >
> > Have you used Oracle's partitioning?
>
> Since you ask, yep, certified on it, plus DB2, Teradata and Greenwich.

It was just a joke.

>
> > What about consistency over time? High levels of control for users if
> > they want it? Even simple things like tablespace support? High
> > performance unload and load performance much better than now?
> > Reliable and consistent query plans? Support for datatypes other than
> > those relating to time.
>
> Most of these comments seem like emotional appeals, so refuting them one
> by one is just going to look and smell like an argument, and a fruitless
> one as well since we agree on so many things.

Sigh. You can call those questions what you like but your approach
doesn't deal with them and the current partitioning approach, for it's
problems, does deal with them.

> So, I get the message that you really want the DDL approach and agree
> that you've demonstrated there are use cases that need it that you are
> interested in. That's fine by me as long as we can each work on parts of
> it to get it done. Will it be possible for you to do that?

I assured you offlist that it was. This is something Greenplum needs
right now and I'm being paid to deliver it.

> I feel I can say that because AFAICS we can in principle have both
> dynamic and declarative techniques, even on the same table. Around half
> of the mechanics would be identical anyway.
>
> So from here I say lets work together to get the basic mechanics right.
> My experience with PostgreSQL is that the team/list always comes up with
> a better idea in the end and I'm sure we will this time too.

I agree. Scrutiny of ideas almost always leads to better ideas.

> I have one main technical concern which is the storage model, which is
> the source of the difference between the two types of partitioning we've
> been discussing. I see difficulties originating there, so I will start
> another thread to examine just those items in detail.

I look forward to it. As promised, I'll soon post some syntax of what we
have been working on.

Thanks,

Gavin

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-10 21:30:10
Message-ID:	1200000610.4266.876.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2008-01-10 at 21:43 +0100, Gavin Sherry wrote:
> On Thu, Jan 10, 2008 at 07:25:00AM +0000, Simon Riggs wrote:
> > On Thu, 2008-01-10 at 03:06 +0100, Gavin Sherry wrote:
> > > If the exclusion is executor driven, the planner cannot help but
> > > create a seq scan plan. The planner will think you're returning 100X
> > > rows when really you end up returning X rows. After that, all
> > > decisions made by the planner are totally bogus.
> >
> > One of the most important queries on large tables is handling
> > WHERE Eventdate = CURRENT DATE;
>
> Really? Well, this isn't handled by the current partitioning approach

Yes, exactly why we need to fix it and why I'm mentioning it here.

> > We cannot perform partition exclusion using this type of WHERE clause at
> > planning time because the CURRENT DATE function is STABLE.
>
> We can do the exact same thing -- if it's a direction people want to
> take. In fact, we can do it better/faster because once we've evaluated one
> partition we know that there are no others to evaluate.

Lost you completely here. I'm explaining to you that *nobody* can solve
those problems solely at planning time, by definition, so it has to be
done at execution time. I'm not saying anything about your way, my way.

> > So it seems clear that we need to make partition exclusion work at
> > executor time, whatever else we do.
>
> One example doesn't make the rule. Most people doing range partitioning
> are going to be doing either specific dates or date ranges -- i.e.,
> constants or things that can be folded to constants by the planner. At
> least, that's what I've seen.

It's not always true that planning time = execution time.

Using CURRENT DATE to access current day, week, month is common in many
applications, as is parameterised SQL for higher transaction rate apps.

We we can't re-write all the SQL, so we need to make it work.

I think if you demand a full function implementation of partitioning,
you'd better take into account *all* of the requirements.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-10 21:34:00
Message-ID:	1200000840.4266.879.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2008-01-10 at 21:49 +0100, Gavin Sherry wrote:

> > So, I get the message that you really want the DDL approach and agree
> > that you've demonstrated there are use cases that need it that you are
> > interested in. That's fine by me as long as we can each work on parts of
> > it to get it done. Will it be possible for you to do that?
>
> I assured you offlist that it was. This is something Greenplum needs
> right now and I'm being paid to deliver it.

Cool, I'll assist you where possible.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-11 01:28:53
Message-ID:	20080111012853.GS6934@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 10, 2008 at 09:30:10PM +0000, Simon Riggs wrote:
> > > We cannot perform partition exclusion using this type of WHERE clause at
> > > planning time because the CURRENT DATE function is STABLE.
> >
> > We can do the exact same thing -- if it's a direction people want to
> > take. In fact, we can do it better/faster because once we've evaluated one
> > partition we know that there are no others to evaluate.
>
> Lost you completely here. I'm explaining to you that *nobody* can solve
> those problems solely at planning time, by definition, so it has to be
> done at execution time. I'm not saying anything about your way, my way.

Sorry, I wasn't clear enough. I was trying to say, if we're going to do
something in the executor (for right or wrong) the declarative approach
can do it too. Since there will be partition bounding information
available, we can do partition selection in the executor (maybe the
planner should tell us to do it).

>
> > > So it seems clear that we need to make partition exclusion work at
> > > executor time, whatever else we do.
> >
> > One example doesn't make the rule. Most people doing range partitioning
> > are going to be doing either specific dates or date ranges -- i.e.,
> > constants or things that can be folded to constants by the planner. At
> > least, that's what I've seen.
>
> It's not always true that planning time = execution time.
>
> Using CURRENT DATE to access current day, week, month is common in many
> applications, as is parameterised SQL for higher transaction rate apps.
>
> We we can't re-write all the SQL, so we need to make it work.
>
> I think if you demand a full function implementation of partitioning,
> you'd better take into account *all* of the requirements.

Okay. As I said above, nothing in declarative partitioning rules out
partition selection with stable functions. So, we lets do it, assuming
everyone else thinks it is a good idea.

Thanks,

Gavin

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-11 08:07:18
Message-ID:	1200038838.4266.887.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-01-11 at 02:28 +0100, Gavin Sherry wrote:
> On Thu, Jan 10, 2008 at 09:30:10PM +0000, Simon Riggs wrote:
> > > > We cannot perform partition exclusion using this type of WHERE clause at
> > > > planning time because the CURRENT DATE function is STABLE.
> > >
> > > We can do the exact same thing -- if it's a direction people want to
> > > take. In fact, we can do it better/faster because once we've evaluated one
> > > partition we know that there are no others to evaluate.
> >
> > Lost you completely here. I'm explaining to you that *nobody* can solve
> > those problems solely at planning time, by definition, so it has to be
> > done at execution time. I'm not saying anything about your way, my way.
>
> Sorry, I wasn't clear enough. I was trying to say, if we're going to do
> something in the executor (for right or wrong) the declarative approach
> can do it too. Since there will be partition bounding information
> available, we can do partition selection in the executor (maybe the
> planner should tell us to do it).

Of course. It's an identical situation for both. Regrettably, none of
your comments about dynamic partitioning and planning were accurate as a
result.

> Okay. As I said above, nothing in declarative partitioning rules out
> partition selection with stable functions. So, we lets do it, assuming
> everyone else thinks it is a good idea.

If you check the archives this was long ago been identified as a
requirement. And I said exactly the things you said, BTW, when trying to
say it didn't matter.

I've kept a list of requests for improvement that I can share with you;
I've always been loathe to publish a list of bad points.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-11 09:25:14
Message-ID:	20080111092514.GT6934@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 11, 2008 at 08:07:18AM +0000, Simon Riggs wrote:
> On Fri, 2008-01-11 at 02:28 +0100, Gavin Sherry wrote:
> > On Thu, Jan 10, 2008 at 09:30:10PM +0000, Simon Riggs wrote:
> > > > > We cannot perform partition exclusion using this type of WHERE clause at
> > > > > planning time because the CURRENT DATE function is STABLE.
> > > >
> > > > We can do the exact same thing -- if it's a direction people want to
> > > > take. In fact, we can do it better/faster because once we've evaluated one
> > > > partition we know that there are no others to evaluate.
> > >
> > > Lost you completely here. I'm explaining to you that *nobody* can solve
> > > those problems solely at planning time, by definition, so it has to be
> > > done at execution time. I'm not saying anything about your way, my way.
> >
> > Sorry, I wasn't clear enough. I was trying to say, if we're going to do
> > something in the executor (for right or wrong) the declarative approach
> > can do it too. Since there will be partition bounding information
> > available, we can do partition selection in the executor (maybe the
> > planner should tell us to do it).
>
> Of course. It's an identical situation for both. Regrettably, none of
> your comments about dynamic partitioning and planning were accurate as a
> result.

That's not true. We will still have planning drive the partition
selection when the predicate is immutable, thus having more accurate
plans. Some but not all plans use current_date and similar.

>
> > Okay. As I said above, nothing in declarative partitioning rules out
> > partition selection with stable functions. So, we lets do it, assuming
> > everyone else thinks it is a good idea.
>
> If you check the archives this was long ago been identified as a
> requirement. And I said exactly the things you said, BTW, when trying to
> say it didn't matter.
>
> I've kept a list of requests for improvement that I can share with you;
> I've always been loathe to publish a list of bad points.

Why, please send them.

Thanks,

Gavin

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-11 09:42:59
Message-ID:	1200044579.4266.933.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gavin,

I've responded to the technical questions you raise here:

On Thu, 2008-01-10 at 02:22 +0100, Gavin Sherry wrote:

> > After we note that a segment is read-only we can scan the segment and
> > record min/max values for all columns. These are then "implicit
> > constraints", which can then be used for segment exclusion in a similar
> > manner as we do with the current constraint exclusion mechanism.
>
> What about columns which are varlen? The maximum value could be very long -- 1
> GB in fact. Say we decide to not support those. Well, what about
> numeric? What about user defined types?

No problem with any datatypes that I can see. Why would there be?

Very long values as boundary values would be a problem with any type of
partitioning, so some sensible heuristics would need to apply in any
case.

> > This would also allow a Merge Join to utilise exclusion for the inner
> > plan at execution time. Instead of scanning the whole inner table plan
> > from the start, we would be able to start the scan from the appropriate
> > segment. This would require us to pass the current value of the outer
> > plan down into the inner plan. The typical executor nodes on the inner
> > plan would be a SortNode and below that a SeqScan. In that case the
> > executor would need to pass the outer value from the MergeJoinNode down
> > thru the SortNode to the SeqScan node. The SeqScan node could then
> > perform partition exclusion, reducing the time for that scan and also
> > reducing the time for the resulting sort. This sounds like it might be
> > worth doing in normal cases also. It might turn out that the potentially
> > applicable cases are already excluded during planning, I haven't thought
> > about that aspect in enough detail yet.
>
> I don't think that would work at all. Say you have have skew of data
> across the partitions. You may end up doing a seq scan for every outer
> tuple. It would look like a really expensive nested loop.

No, you've misunderstood. I said start the plan, i.e. do this once, not
for every row. That would be ridiculous.

> > If we collect data for all columns then many of our implicit constraints
> > would be useless. e.g. if a column only has a few values and these are
> > present in all segments. Matching our predicate against all constraints
> > would be expensive, so we must weed out poor constraints. We would do
> > this by removing any constraint that overlapped more than 10% of other
> > segments. Various heuristics would likely need to be discovered during
> > development to make this work without resorting to manual commands.
> >
> > Note that all of this exclusion is happening in the executor, not the
> > planner. That allows this form of exclusion to work with stable
> > functions and parameters without problem.
>
> So, in that case, you've totally undercut the planner. All the steps up
> the tree would be based on the stats for scanning the entire table but
> you've only returned part of it.

Answered already. Not an issue solely for this particular approach.

> To solve that, you could put the min/max values in a catalog somewhere.
> For a 16 TB table, that's 16000 min/max values. That's pretty expensive.

We agreed the partitioning could and would be adjustable, so the number
need not be 16000.

The number of partitions we can support needs to be fairly high to
support a wide range of applications. Requests for 1-2000 partitions
seem fairly typical to me and we need to be able to handle that, however
we do partitioning.

> And how do we handle making things non-read-only. You'd have to wait for
> all current queries to finish... that could take a long time.

No, I explained that it would not require locking or waiting. The
existing queries would just not take advantage of the newly read-only
state until they complete.

> How do we handle crashes during setting of min/max? I presume it needs
> WAL support.

It's in a catalog table, as we've said. Handled.

> What happens if the free space map gives us a free block in a read only
> segment. Then we need to look at concurrency and transactionality of
> min/max values don't we?

I discussed that and explained how it would be handled.

> > Visibility Map
> > --------------
> [snip]
> > No dynamic shared memory cache is required because any concurrent
> > changes to the table would be ignored by a Scan anyway, so it doesn't
> > matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
> > new scans that start will attempt to lock the table and then perform a
> > rel cache check before continuing. So the visibility will be set
> > correctly for *that* scan at least.
>
> Is that true in the presence of READ COMMITTED?

Yes, because READ COMMITTED has nothing to do with it.

The SVM would be refreshed via the relcache invalidation mechanism,
which gets checked every time we obtain a lock we did not already hold.
In the case of a table with bits set in the SVM we would need to recheck
that each time we try to obtain a lock, even if we already hold it.

> > In most cases the visibility map can be summarised as a single boolean
> > to show whether any 100% visible segments exist. That makes accessing
> > the map very cheap in the common, unset case.
>
> What do we do in failure scenarios. Does this go into WAL?

Yes

> Is it
> replicated via log shipping techniques.

Yes

> > Setting the Visibility Map
> > --------------------------
> >
> > VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of
> > them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to
> > all current backends. Marking the map will cause a rel cache
> > invalidation.
>
> Argh. Remember, we're talking about 16 TB tables here. With the best of
> tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With
> the existing partitioning approach, there's no such thing.

I would differ with you there. The first query after a large COPY will
be forced to set the hint bits and re-write the table. If you just
loaded 16TB then you'd have to rewrite 16TB, which as you point out can
take a while. The VACUUM will kick in automatically following a large
load and either it or a query will perform the re-writing.

For a 16TB table with declared partitions, each one of those tuples
would need to have their partition calculated prior to insertion. I
think the cost of post-assessing the partition boundaries is the same or
cheaper than the cost of pre-assessing the partition location.

> > We would never mark the highest numbered segment
> > EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur.
> > This is also a useful way of excluding smaller tables from the overheads
> > involved for this feature.
>
> What about concurrent update or deletes -- during VACUUM. Surely we
> don't have to vacuum full? Concurrent updates and deletes could change
> the partition boundaries and that will impact on the accuracy of segment
> exclusion (I do not think it leads to data corruption but it might lead
> to the inclusion of a segment that shouldn't have been included). That
> leads to unpredictable behaviour.

The VACUUM would need to set a read-only pending flag. After the VACUUM
has finished any update or delete would clear the flag. As you say its
possible that concurrent updates and deletes could have changed rows in
blocks behind the first VACUUM scan. After the VACUUM we re-scan a
read-only pending segment to derive boundary values. During this scan,
if we find a change made by a transaction after the VACUUM's xmin then
we would clear the flag. So concurrent changes of any kind would be
caught and we don't need locks or VACUUM FULL.

> > In an insert-only table this would mean that only the last 1 or 2
> > segments would be read write, so VACUUM would always run in a bounded
> > time, no matter how large the table.
>
> You said else where that unload was low cost. What about segments which
> cross the unload boundary. Half of their data needs to be removed, say.
> That sounds expensive. And, they'll have to be vacuumed.

Well, yes, but if you wanted this to be properly optimised you would
need to delete using the boundary values of the partition, so its
effectively equivalent to deleting a declared partition.

> With the existing partitioning, we could just truncate a partition. This
> seems really expensive in comparison.

I explained it would be possible to do that if it was required.

> > There would be additional complexity in selectivity estimation and plan
> > costing. The above proposal allows dynamic segment exclusion, which
> > cannot be assessed at planning time anyway, so suggestions welcome...
>
> As I've noted elsewhere, you cannot just undermine the planner like
> this. You're plans will be totally bogus. I hope you're not proposing
> that. If you're proposing tracking information for each segment then I
> there are problems there too.

Discussed already, not an issue related solely to dynamic partitioning.

I agree, that's not really the hard part.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-11 11:49:50
Message-ID:	1200052190.4266.1018.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-01-11 at 10:25 +0100, Gavin Sherry wrote:
> >
> > Of course. It's an identical situation for both. Regrettably, none of
> > your comments about dynamic partitioning and planning were accurate as a
> > result.
>
> That's not true. We will still have planning drive the partition
> selection when the predicate is immutable, thus having more accurate
> plans.

Not really.

The planner already evaluates stable functions at plan time to estimate
selectivity against statistics. It can do the same here.

The boundary values can't be completely trusted at plan time because
they are dynamic, but they're at least as accurate as ANALYZE statistics
(and probably derived at identical times), so can be used as estimates.
So I don't see any reason to imagine the plans will be badly adrift.

We're back to saying that if the visibility map is volatile, then SE
won't help you much. I agree with that and haven't argued otherwise.
Does saying it make us throw away SE? No, at least, not yet and not for
that reason.

SE does what I was looking for it to do, but doesn't do all of what
you'd like to achieve with partitioning, because we're looking at
different use cases. I'm sure you'd agree that all large databases are
not the same and that they can have very different requirements. I'd
characterise our recent positions on this that I've been focused on
archival requirements, whereas you've been focused on data warehousing.
The difference really lies in how much activity and of what kind occurs
on a table. You don't unload and reload archives regularly, nor do you
perform random updates against the whole table. I'm sure we'd quickly
agree that many of the challenges you've seen recently at Greenplum
would not be possible with core Postgres, so I'm not really trying too
hard to compete.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-11 12:40:21
Message-ID:	1200055221.4266.1048.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-01-07 at 12:53 -0800, Ron Mayer wrote:

> Is my understanding right that these Segment Visibility Maps could
> help this case as well?

Yes, I think so.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	"Zeugswetter Andreas ADI SD" <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-11 16:31:14
Message-ID:	E1539E0ED7043848906A8FF995BDA57902B622EA@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I've kept a list of requests for improvement that I can share with
you;
> I've always been loathe to publish a list of bad points.

I think it would help understand the proposal if you also present the
shortcomings.
When you presented the positive and negative points, the negative list
did look intentionally short :-)
This imho provokes negative replies, the average hackers that reply are
no dummies eighter.

Some of the issues I see, with the proposal made so far:
- segment boundaries will imho sometimes be too fuzzy for a reasonably
short segment describing where clause
- fault isolation
- huge global indexes (index create/reorg needs to repeatedly sort data
that is long static)
- unpredictability
- when you delete large parts of the table you want that to be
instantaneous and cause little work on the db side
- partitioning that also partitions the indexes, this is imho a must
have for huge non static tables
- when the user tweaks segment size this is already declarative
- some indexes only needed for recent data (a partial index would need
to slide == recreate)

The whole subject is imho very interesting, and I expect more feedback
after 8.3 is out.

I am also in the declarative camp. In my projects the partitioning is
the developer/designer's
responsibility, and thus all add/drop partition tasks are automated, no
dba.
Needless to say, all existing declarative implementations lack some of
the essential features on the implementation side, e.g.:
- use the btree order of partitions in plans that need more than one
partition
- a useful syntax that allows automatic creation/exclusion of partitions
(e.g. each month of a timestamp in one partition)
e.g. partition 'hugetable_'||to_char(ts, 'YYYYMM') with
extend(ts, year to month)
- unique indexes, equally partitioned like table, that don't contain the
partitioning column[s]
- some lack expression based
- some lack instantaneous attach using a prefilled preindexed table
- some lack detach
- some need separate tablespace per partition

Andreas

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Zeugswetter Andreas ADI SD <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>
Cc:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-11 17:31:30
Message-ID:	1200072690.4266.1191.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-01-11 at 17:31 +0100, Zeugswetter Andreas ADI SD wrote:
> > I've kept a list of requests for improvement that I can share with
> you;
> > I've always been loathe to publish a list of bad points.

The "list of bad points" doesn't refer to shortcomings in my proposal,
which I would never hide. It refers to a list of general requirements
for improvements to partitioning that I have accumulated over a number
of years and will publish as soon as I've retyped it. It's not a secret,
it all comes from things discussed on list at various points.

> I think it would help understand the proposal if you also present the
> shortcomings.
> When you presented the positive and negative points, the negative list
> did look intentionally short :-)

If I am aware of a downside in any proposal, I always identify it.
That's why the damn things are so long.

I design many things, but only post some of them. The proposals I do
post always have more positive than negative posts, otherwise I wouldn't
waste anybody's time.

I regularly get comments my proposal style. First, my proposals are too
long and they need a summary. Next that I don't explain myself enough
and need to stop leaving gaps that have people assume I've not thought
it through thoroughly. So I try to put the summary in for some people
and the detail for others.

For this proposal I identified the target use case at the very top of
the proposal, and IMHO it does meet the needs of that very well, as many
people have agreed. It doesn't do everything that everybody wants
because the list is very long indeed, probably man years of effort, all
things considered. We won't cross that gap in a single step.

I'm working on a revised version which will include all of the comments
and points that people have made, about 20 different topics in all from
my notes. In progress here:
http://developer.postgresql.org/index.php/Segment_Exclusion

> This imho provokes negative replies, the average hackers that reply are
> no dummies eighter.

I post to -hackers because I know there's no dummies here.

If I've provoked negative replies, I'm happy to apologise to all.

Thanks for your candid thoughts,

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-11 19:03:41
Message-ID:	20080111190341.GU6934@europa.idg.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 11, 2008 at 11:49:50AM +0000, Simon Riggs wrote:
> On Fri, 2008-01-11 at 10:25 +0100, Gavin Sherry wrote:
> > >
> > > Of course. It's an identical situation for both. Regrettably, none of
> > > your comments about dynamic partitioning and planning were accurate as a
> > > result.
> >
> > That's not true. We will still have planning drive the partition
> > selection when the predicate is immutable, thus having more accurate
> > plans.
>
> Not really.
>
> The planner already evaluates stable functions at plan time to estimate
> selectivity against statistics. It can do the same here.
>
> The boundary values can't be completely trusted at plan time because
> they are dynamic, but they're at least as accurate as ANALYZE statistics
> (and probably derived at identical times), so can be used as estimates.
> So I don't see any reason to imagine the plans will be badly adrift.

Okay, it's good that you want the planner to look at those. Did you
consider the point I made about the sheer amount of data the planner
would have to consider for large cases?

> We're back to saying that if the visibility map is volatile, then SE
> won't help you much. I agree with that and haven't argued otherwise.
> Does saying it make us throw away SE? No, at least, not yet and not for
> that reason.

Yes, I'm not against SE I just think that only having it would see a
serious regression for larger user. Personally, I think SE would be a
great idea for append only tables since it removes the thing I'm most
worried about with it: the need to vacuum to 'turn it on'.

>
> SE does what I was looking for it to do, but doesn't do all of what
> you'd like to achieve with partitioning, because we're looking at
> different use cases. I'm sure you'd agree that all large databases are
> not the same and that they can have very different requirements. I'd
> characterise our recent positions on this that I've been focused on
> archival requirements, whereas you've been focused on data warehousing.

I think that sums it up, although I'd also say that declarative
partitioning is suitable for all those with largish amounts of data
which they know how they want stored. This points to another case that
SE suits: those who don't know how or (maybe more importantly) don't
care to manage their data.

I'll go back to what I said above. SE looks like a good performance
boost for archival read only data. If we tighten up the definitions of
how some tables can be used -- append only -- then we can remove the
vacuum requirement and also change other characteristics of the storage.
For example, reduced visibilty information, compression, etc. These are
hot topics for people with that kind of data.

> The difference really lies in how much activity and of what kind occurs
> on a table. You don't unload and reload archives regularly, nor do you

I couldn't agree more.

> perform random updates against the whole table. I'm sure we'd quickly
> agree that many of the challenges you've seen recently at Greenplum
> would not be possible with core Postgres, so I'm not really trying too
> hard to compete.

There we diverge. Yes, Greenplum produces systems for very large amounts
of data, peta-byte range in fact. However, the architecture --
individual post-masters on a single CPU with their own storage -- means
that we see the problems of non-distributed database users when we look
at data at the node level. This is why I say that VACUUMing such
systems, under the SE model, after you do a load of data is just
impossible (I know that after time the cost of vacuum will stabilise but
there's always the initial data load).

Thanks,

Gavin

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-11 19:32:24
Message-ID:	1200079944.4266.1227.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2008-01-11 at 20:03 +0100, Gavin Sherry wrote:

> Okay, it's good that you want the planner to look at those. Did you
> consider the point I made about the sheer amount of data the planner
> would have to consider for large cases?

Sorry, thought I had somewhere in all those emails...

If you really do have 16TB of data and its in the use case of mostly
read only, volatile in last portion of table, then it would be
straightforward to increase segment size.

Whatever form of partitioning we go for we do need to allow 1-2,000
partitions fairly easily. We can't sustain a sequential method of
applying the rules, which gives O(n) behaviour. We need some form of
indexing/tree search mechanism that gives roughly O(logN) behaviour.

> > We're back to saying that if the visibility map is volatile, then SE
> > won't help you much. I agree with that and haven't argued otherwise.
> > Does saying it make us throw away SE? No, at least, not yet and not
> for
> > that reason.
>
> Yes, I'm not against SE I just think that only having it would see a
> serious regression for larger user.

If we do find regressions, then I'd suggest we look at ways of turning
it on/off automatically. We can keep track of the volatility in the map.
We can also have an off switch if you're really against it. But I'm
still skeptical.

I think we can sustain the last few segments in a table being volatile,
as long as the greater proportion of segments are not.

The volatile part of the table is the subject of most of the queries
anyway, so excluding it from seq scans isn't that important. Index
access to historical data doesn't slow down whether or not you have a
volatile map.

> Personally, I think SE would be a
> great idea for append only tables since it removes the thing I'm most
> worried about with it: the need to vacuum to 'turn it on'.

You do need to VACUUM anyway, so thats no problem. Sure you have to scan
it to derive the boundary values, but thats one scan that saves many in
the future.

> I'll go back to what I said above. SE looks like a good performance
> boost for archival read only data. If we tighten up the definitions of
> how some tables can be used -- append only -- then we can remove the
> vacuum requirement and also change other characteristics of the
> storage.
> For example, reduced visibilty information, compression, etc. These
> are
> hot topics for people with that kind of data.

SE isn't aimed at solely INSERT-only data. It's much wider than that.

An ERP/SCM type application would easily benefit, say where orders are
received and processed within a relatively short period, but order data
needs to be maintained online for 2+ years. Anything where the table
grows either by date, or by increasing key, that has a read-only "tail".
That's a lot of applications and a ton of data. It might not apply to
the "Customer" table in that app, but it will apply to Orders, OrderItem
etc.

We can compress older read-only data as a way of shrinking file size.
That's way easier, plus it applies to more real-world cases than trying
to remove all the visibility stuff. The Insert-only people will still be
able to take advantage of it compression, but the more general purpose
people will never be able to make use of the visibility removal code.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	August Zajonc <augustz(at)augustz(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dynamic Partitioning using Segment Visibility Maps
Date:	2008-01-12 01:53:37
Message-ID:	47881DA1.3040404@augustz.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> Happy New Year, everybody.
>
> This proposal follows on from previous thinking about partitioning,
> where I've taken up Andrew Sullivan's suggestion to re-examine the
> current partitioning concept of using tables as partitions. So I've come
> up with an alternative concept to allow us to discuss the particular
> merits of each. ISTM that this new proposal has considerable potential.

I've been lurking and reading a huge number of threads on partitioning.
I see that postgresql is likely to give the user lots of knobs to define
partitions very flexibly, which is a good thing for things like sales
region reports etc.

All that to say, I hope some form of this very automatic tunning makes
it in. This automatic option would provide a benefit (not perfect but
improved) for a significant set of use cases. Even better, it is trivial
to setup, though I would want a knob for the underlying partition sizes,
1GB feels a bit too big for some situations.

Even expensive databases have found I think that there is a cost to
administrative complexity. In many cases someone may not be ready to go
down the declarative path, but be open to allowing the system take some
optimizing approaches. What I especially like is the ability to later
mark things read only. Perhaps a consultant who checks in on things
monthly might mark partitions in that form.

Good luck though with it all, great to see this.

- August