Re: Proposed adjustments in MaxTupleSize and toastthresholds

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Proposed adjustments in MaxTupleSize and toast thresholds
Date: 2007-02-02 20:11:55
Message-ID: 23594.1170447115@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been looking into Pavan Deolasee's recent discovery that when
storing a maximum-length toast tuple, heap_insert uselessly recurses
to toast_insert_or_update, wasting a nontrivial number of cycles.
It turns out there are several interrelated mistakes here, which are
wasting space as well as cycles.

First off, as to the exact nature of what's happening: the toast code
is designed so that when breaking down a large datum, it's divided into
rows with data payloads of exactly TOAST_MAX_CHUNK_SIZE bytes each.
On a 4-byte-MAXALIGN machine, this means the rows have total t_len of
exactly TOAST_TUPLE_THRESHOLD, which is what was intended. However,
that value is not a multiple of 4. Hence when heapam.c compares
MAXALIGN(tup->t_len) > TOAST_TUPLE_THRESHOLD
it decides the tuple needs re-toasting.

I noted before that this does not happen on an 8-byte-MAXALIGN machine,
but had not understood exactly why. The reason is the outer MAXALIGN
call in the definition

#define TOAST_MAX_CHUNK_SIZE (TOAST_TUPLE_THRESHOLD - \
MAXALIGN( \
MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) + \
sizeof(Oid) + \
sizeof(int32) + \
VARHDRSZ))

On a 4-byte machine that call doesn't do anything, but on an 8-byte
machine it causes the value of TOAST_MAX_CHUNK_SIZE to be reduced by
4, which means that t_len of a toast row comes out 4 bytes smaller
than on a 4-byte machine, which makes it smaller than
TOAST_TUPLE_THRESHOLD even after maxalign'ing. Hence no recursion.

That MAXALIGN is actually *wrong* now that I look at it: it's
effectively supposing that there is padding alignment after the varlena
length word for the chunk data, which of course there is not. But we
can't change TOAST_MAX_CHUNK_SIZE without forcing an initdb.

Instead we can fix the recursion by removing the MAXALIGN() operations in
heapam.c and tuptoaster.c that compare tuple lengths to the thresholds.
This effectively moves the threshold for tuple compression up a couple
bytes, which is a safe change to make, and makes the comparisons
slightly cheaper to boot. I propose doing that in 8.2 (and maybe older
branches after we get a bit more testing of it).

But the real problem is that we've got sloppy choices of the thresholds
and sizes. In the first place, TOAST_MAX_CHUNK_SIZE is being set at a
value that makes every toast row have two wasted padding bytes after it
(turns out it's the same on both 4- and 8-byte machines, though the
specific size of the rows differs). This is silly, we should be using a
TOAST_MAX_CHUNK_SIZE that makes the actual row length come out at
exactly a MAXALIGN multiple. In the second place, examination of toast
tables will show you that on a page with four maximum-length toast rows,
there are 12 free bytes on a 4-byte machine and 28 free on an 8-byte
machine (not counting the aforementioned padding bytes after each row).
That's fine at first glance; because of alignment considerations it's
actually the best we can do. The trouble is that TOAST_TUPLE_THRESHOLD
is derived from MaxTupleSize, which is derived on the assumption that we
should leave 32 bytes for "special space" on heap pages. If we actually
had such special space, it wouldn't fit. This happens because the
threshold calculation is just

#define TOAST_TUPLE_THRESHOLD (MaxTupleSize / 4)

which fails to account for the "line pointers" needed for all but the
first tuple. These errors cancel out at the moment, but wouldn't if we
changed anything about the page header or special space layout.

What I suggest we do about this in HEAD is:

1. Rename MaxTupleSize to MaxHeapTupleSize, and get rid of the
MaxSpecialSpace allotment in its calculation. We don't use special
space on heap pages and we shouldn't be artificially restricting tuple
length to allow for something that's unlikely to appear in the future.
(Note: yes, I know it's been suggested to keep free-space maps in some
heap pages, but that need not factor into a MaxHeapTupleSize limit: big
tuples can simply go into a page without any free-space map.)

2. Fix TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET to be correctly
calculated (properly allowing for line pointers) and to be MAXALIGN
multiples. The threshold value should be exactly the size of the
largest tuple that you can put four of onto one page. Fix
TOAST_MAX_CHUNK_SIZE so that it is *not* necessarily a MAXALIGN
multiple, but rather causes the total length of a toast tuple to come
out that way. This guarantees minimum space wastage on toast pages.

This will force initdb due to changing chunk sizes in toast tables, but
unless we're going to reject Heikki's patch to merge cmin/cmax, there is
no hope of an in-place upgrade for 8.3 anyway.

BTW, while I was looking at this I noticed that BTMaxItemSize is
incorrectly calculated as well: it's coming out a couple bytes smaller
than it could safely be. And with a different page header size it
could come out a couple bytes larger, instead :-(. Perhaps this is
related to Heikki's recent observation that there always seemed to be
some extra free space on btree pages? I think the correct calculation
would be

#define BTMaxItemSize(page) \
MAXALIGN_DOWN((PageGetPageSize(page) - \
MAXALIGN(sizeof(PageHeaderData) + 2 * sizeof(ItemIdData)) -
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)

where MAXALIGN_DOWN rounds down to the nearest maxalign multiple,
instead of up.

Comments?

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Proposed adjustments in MaxTupleSize and toastthresholds
Date: 2007-02-05 16:45:36
Message-ID: 1170693936.3645.400.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2007-02-02 at 15:11 -0500, Tom Lane wrote:
> 2. Fix TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET to be correctly
> calculated (properly allowing for line pointers) and to be MAXALIGN
> multiples. The threshold value should be exactly the size of the
> largest tuple that you can put four of onto one page. Fix
> TOAST_MAX_CHUNK_SIZE so that it is *not* necessarily a MAXALIGN
> multiple, but rather causes the total length of a toast tuple to come
> out that way. This guarantees minimum space wastage on toast pages.

Jan suggested to me a while back that having a configurable toast
threshold would be a useful thing, when that table is also updated
reasonably frequently. ISTM like a good idea, so a prototype has been
written - nothing to do with Pavan's comments though. As you might
expect, it does help in cases where we would otherwise produce lots of
UPDATEd versions of a 1000 byte row, as well as on MIS queries that
often don't pay much attention to text strings.

This then allows the user some control over how much data gets toasted
out of the main row. Many applications have long text fields of 100s of
characters, for example a customer's stated, cleaned and previous
addresses might together be VARCHAR(750), yet we may also want to UPDATE
them regularly to store their current_balance.

TOAST_MAX_CHUNK_SIZE can be fixed, though TOAST_TUPLE_THRESHOLD and
TOAST_TUPLE_TARGET could be settable for a table using a WITH parameter.
It would seem like overkill to allow the threshold and target to differ
when setting the parameter.

If configurable, only MAXALIGNed values would be allowed.

Sounds like a good time to suggest making these values configurable,
within certain reasonable bounds to avoid bad behaviour.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Proposed adjustments in MaxTupleSize and toastthresholds
Date: 2007-02-05 16:52:50
Message-ID: 7501.1170694370@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> Sounds like a good time to suggest making these values configurable,
> within certain reasonable bounds to avoid bad behaviour.

Actually, given what we've just learned --- namely that choosing these
values at random is a bad idea --- I'd want to see a whole lot of
positive evidence before adding such a configuration knob.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Proposed adjustments in MaxTupleSize andtoastthresholds
Date: 2007-02-05 17:34:37
Message-ID: 1170696877.3645.424.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-02-05 at 11:52 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > Sounds like a good time to suggest making these values configurable,
> > within certain reasonable bounds to avoid bad behaviour.
>
> Actually, given what we've just learned --- namely that choosing these
> values at random is a bad idea --- I'd want to see a whole lot of
> positive evidence before adding such a configuration knob.

Sure. My understanding of the process we'd like to follow on this sort
of thing is:

1. make proposal, test for unseen negative effects or basic rejections
2. code performance prototype
3. assemble performance evidence
4. debate utility
5. complete coding
6. further review

Step 3 is always there for performance work, so even if you don't
mention it, I'll assume everybody wants to see that as soon as possible
before we progress.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgreSQL(dot)org, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Proposed adjustments in MaxTupleSize and toastthresholds
Date: 2007-02-06 00:18:26
Message-ID: 45C7C952.50607@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2/5/2007 11:52 AM, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>> Sounds like a good time to suggest making these values configurable,
>> within certain reasonable bounds to avoid bad behaviour.
>
> Actually, given what we've just learned --- namely that choosing these
> values at random is a bad idea --- I'd want to see a whole lot of
> positive evidence before adding such a configuration knob.

Some of the evidence is TOAST itself. Every time you do not SET a column
that has been toasted into external storage during an UPDATE, you win
because the columns data isn't read during the scan for the row to
update, it isn't read during heap_update(), it isn't actually updated at
all (the toast reference is copied as is and the external value reused),
and not a single byte of the external data is bloating WAL. If someone
knows that 99% of their updates will not hit certain text columns in
their tables, actually forcing them to be compressed no matter what and
to be stored external if they exceed 100 bytes will be a win.

Of course, this is a bit different from Simon's approach. What I
describe here is a per pg_attribute configuration to enforce a certain
new toaster behavior. Since we already have something that gives the
toaster a per column cluestick (like not to bother trying to compress),
it might be much easier to implement then Simon's proposal. It would
require that the toaster goes over the initial heap tuple for those
specially configured columns even if the tuple is below the toast
threshold, which suggests that a pg_class.relhasspecialtoastneeds could
be useful. But I think as for fine tuning capabilities, a column
insensitive maximum tuple size is insufficient anyway.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, teramoto(dot)junji(at)lab(dot)ntt(dot)co(dot)jp
Subject: Re: Proposed adjustments in MaxTupleSize andtoastthresholds
Date: 2007-02-06 03:10:31
Message-ID: 20070206115336.5FB1.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote:

> > Actually, given what we've just learned --- namely that choosing these
> > values at random is a bad idea --- I'd want to see a whole lot of
> > positive evidence before adding such a configuration knob.
>
> 3. assemble performance evidence
>
> Step 3 is always there for performance work, so even if you don't
> mention it, I'll assume everybody wants to see that as soon as possible
> before we progress.

There was a performance evidence using TOAST in order to partial updates.
It added a flag of force toasting. The toast threshold suggested now is
more flexible than it, but I think it is one of the evidences.

Vertical Partitioning with TOAST
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00013.php

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgreSQL(dot)org>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Proposed adjustments in MaxTupleSize and toastthresholds
Date: 2007-02-06 03:45:19
Message-ID: 77125378-1BE6-46DF-BFC9-F3399B7CBDA2@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 5, 2007, at 10:45 AM, Simon Riggs wrote:
> Jan suggested to me a while back that having a configurable toast
> threshold would be a useful thing, when that table is also updated
> reasonably frequently.

While we're in there it probably makes sense to allow a configurable
value for when to compress as well.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgreSQL(dot)org>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Proposed adjustments in MaxTupleSize andtoastthresholds
Date: 2007-02-06 10:05:53
Message-ID: 1170756353.3645.464.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-02-05 at 19:18 -0500, Jan Wieck wrote:
> On 2/5/2007 11:52 AM, Tom Lane wrote:
> > "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> >> Sounds like a good time to suggest making these values configurable,
> >> within certain reasonable bounds to avoid bad behaviour.
> >
> > Actually, given what we've just learned --- namely that choosing these
> > values at random is a bad idea --- I'd want to see a whole lot of
> > positive evidence before adding such a configuration knob.
>
> Some of the evidence is TOAST itself. Every time you do not SET a column
> that has been toasted into external storage during an UPDATE, you win
> because the columns data isn't read during the scan for the row to
> update, it isn't read during heap_update(), it isn't actually updated at
> all (the toast reference is copied as is and the external value reused),
> and not a single byte of the external data is bloating WAL. If someone
> knows that 99% of their updates will not hit certain text columns in
> their tables, actually forcing them to be compressed no matter what and
> to be stored external if they exceed 100 bytes will be a win.

Yes, thats the main use case.

> Of course, this is a bit different from Simon's approach. What I
> describe here is a per pg_attribute configuration to enforce a certain
> new toaster behavior. Since we already have something that gives the
> toaster a per column cluestick (like not to bother trying to compress),
> it might be much easier to implement then Simon's proposal. It would
> require that the toaster goes over the initial heap tuple for those
> specially configured columns even if the tuple is below the toast
> threshold, which suggests that a pg_class.relhasspecialtoastneeds could
> be useful. But I think as for fine tuning capabilities, a column
> insensitive maximum tuple size is insufficient anyway.

Well, sounds like we both want the same thing. The only discussion seems
to be about user interface.

Setting it per column is much better for very fine tuning, but setting
them in isolation doesn't help decide what to do when you have lots of
medium length strings where the sum exceeds the toast target.

IMHO it would be better to have an col-level "storage priority" (default
0) and then an table-level settable toast target. So we start applying
the storage handling mechanisms on the highest priority columns and keep
going in descending order until we are under the limit for the table.

ALTER TABLE foo
ALTER COLUMN foocol
SET STORAGE EXTERNAL PRIORITY 5
WITH
(toast_target = 400); /* must be MAXALIGNed value */

Equal priorities are allowed, in which case lowest attribute id wins,
i.e. current behaviour remains the default.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <teramoto(dot)junji(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: Proposed adjustments in MaxTupleSizeandtoastthresholds
Date: 2007-02-06 10:14:45
Message-ID: 1170756885.3645.469.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-02-06 at 12:10 +0900, ITAGAKI Takahiro wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote:
>
> > > Actually, given what we've just learned --- namely that choosing these
> > > values at random is a bad idea --- I'd want to see a whole lot of
> > > positive evidence before adding such a configuration knob.
> >
> > 3. assemble performance evidence
> >
> > Step 3 is always there for performance work, so even if you don't
> > mention it, I'll assume everybody wants to see that as soon as possible
> > before we progress.
>
> There was a performance evidence using TOAST in order to partial updates.
> It added a flag of force toasting. The toast threshold suggested now is
> more flexible than it, but I think it is one of the evidences.
>
> Vertical Partitioning with TOAST
> http://archives.postgresql.org/pgsql-hackers/2005-12/msg00013.php
>

Apologies to Junji-san. I'd thought my idea was original, but it seems
we think along similar lines.

That is the kind of performance gain I see possible.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Proposed adjustments in MaxTupleSize andtoastthresholds
Date: 2007-02-19 21:08:16
Message-ID: 200702192108.l1JL8Gi14704@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Added to TODO:

* Consider allowing configuration of TOAST thresholds

http://archives.postgresql.org/pgsql-hackers/2007-02/msg00213.php

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Mon, 2007-02-05 at 19:18 -0500, Jan Wieck wrote:
> > On 2/5/2007 11:52 AM, Tom Lane wrote:
> > > "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > >> Sounds like a good time to suggest making these values configurable,
> > >> within certain reasonable bounds to avoid bad behaviour.
> > >
> > > Actually, given what we've just learned --- namely that choosing these
> > > values at random is a bad idea --- I'd want to see a whole lot of
> > > positive evidence before adding such a configuration knob.
> >
> > Some of the evidence is TOAST itself. Every time you do not SET a column
> > that has been toasted into external storage during an UPDATE, you win
> > because the columns data isn't read during the scan for the row to
> > update, it isn't read during heap_update(), it isn't actually updated at
> > all (the toast reference is copied as is and the external value reused),
> > and not a single byte of the external data is bloating WAL. If someone
> > knows that 99% of their updates will not hit certain text columns in
> > their tables, actually forcing them to be compressed no matter what and
> > to be stored external if they exceed 100 bytes will be a win.
>
> Yes, thats the main use case.
>
> > Of course, this is a bit different from Simon's approach. What I
> > describe here is a per pg_attribute configuration to enforce a certain
> > new toaster behavior. Since we already have something that gives the
> > toaster a per column cluestick (like not to bother trying to compress),
> > it might be much easier to implement then Simon's proposal. It would
> > require that the toaster goes over the initial heap tuple for those
> > specially configured columns even if the tuple is below the toast
> > threshold, which suggests that a pg_class.relhasspecialtoastneeds could
> > be useful. But I think as for fine tuning capabilities, a column
> > insensitive maximum tuple size is insufficient anyway.
>
> Well, sounds like we both want the same thing. The only discussion seems
> to be about user interface.
>
> Setting it per column is much better for very fine tuning, but setting
> them in isolation doesn't help decide what to do when you have lots of
> medium length strings where the sum exceeds the toast target.
>
> IMHO it would be better to have an col-level "storage priority" (default
> 0) and then an table-level settable toast target. So we start applying
> the storage handling mechanisms on the highest priority columns and keep
> going in descending order until we are under the limit for the table.
>
> ALTER TABLE foo
> ALTER COLUMN foocol
> SET STORAGE EXTERNAL PRIORITY 5
> WITH
> (toast_target = 400); /* must be MAXALIGNed value */
>
> Equal priorities are allowed, in which case lowest attribute id wins,
> i.e. current behaviour remains the default.
>
> --
> Simon Riggs
> EnterpriseDB http://www.enterprisedb.com
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +