Reducing overhead for repeat de-TOASTing

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Reducing overhead for repeat de-TOASTing
Date: 2008-06-16 19:35:26
Message-ID: 6861.1213644926@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Recent discussions with the PostGIS hackers led me to think about ways
to reduce overhead when the same TOAST value is repeatedly detoasted.
In the example shown here
http://archives.postgresql.org/pgsql-hackers/2008-06/msg00384.php
90% of the runtime is being consumed by repeated detoastings of a single
datum. That is certainly an outlier case, but we've heard complaints of
TOAST being slow before. The solution I'm about to propose should fix
this, and as a bonus it will reduce the problem of memory leakage when
a detoasted value doesn't get pfreed.

What I am imagining is that the tuple toaster will maintain a cache of
recently-detoasted values, and that pg_detoast_datum() returns a pointer
to a cache entry rather than a freshly palloc'd value. The cache lookup
key is the toast table OID plus value OID. Now pg_detoast_datum() has no
idea where the pointer it returns will go, so the big problem with this
scheme is that it's hard to tell how long a cache entry needs to live.
But we can resolve that by ruling that the required lifetime is the same
as the value would have had if it'd really been palloc'd --- IOW, until
the memory context that was current at the time gets deleted or reset.
This can be implemented by allowing the toaster cache to hook into the
memory context delete/reset calls. (We have no such capability now, but
there have been some previous cases where it would've come in handy, so
I think this would be a useful extension anyhow. It should be an actual
hook that can be used by anyone, not something where mcxt.c is
specifically passing control to tuptoaster.) Once a cache entry is no
longer required anywhere, it can be removed, and will be if needed to
shrink the cache to some target size (work_mem perhaps, or is it worth
inventing a separate GUC for this?). But we can hang onto the value
longer if it's not taking up needed space. This is important since
the use pattern exhibited in the PostGIS problem involves repeat
detoastings that are separated by MemoryContextResets --- if we free
the cache entry as soon as possible, we'll not gain anything.

The other cache management problem that would have to be solved is
to invalidate entries when the underlying TOAST table or individual
TOAST value is deleted. This is exactly like the problem for tuples
in the syscaches, and can be solved via sinval messaging. (The
reason we need to worry about this is to guard against the possibility
that the same identifier is re-used for a new TOAST value; otherwise
we could just let dead values age out of the cache.)

A change of this sort has the potential to break a lot of code, but
I think the only thing we'd really have to do is turn PG_FREE_IF_COPY()
into a no-op. In general, functions are not supposed to scribble on
pass-by-reference input datums, and since most functions don't actually
distinguish whether their inputs got detoasted (except perhaps by using
PG_FREE_IF_COPY()), that means that they won't be scribbling on the
toast cache entry either.

It would be possible to extend this concept to caching toast slice
fetches (pg_detoast_datum_slice() calls) if we add the offset/length
arguments to the cache lookup key. I'm not certain if that's worth the
trouble though --- any feelings about that? Also, that case is much
more likely to have callers that think they can scribble on the result,
since they "know" it must be a palloc'd value.

One unsolved problem is that this scheme doesn't provide any way to cache
the result of decompressing an inline-compressed datum, because those have
no unique ID that could be used for a lookup key. This puts a bit of a
damper on the idea of making PG_FREE_IF_COPY() a no-op --- if it is, then
detoasting a datum of that type would indeed cause a memory leak. The
best idea I have at the moment is to continue to have inline decompression
produce a palloc'd value, leave PG_FREE_IF_COPY() as-is, and arrange
for pfree() on toast cache entries to be a no-op. (Which we can do by
setting up a special memory context methods pointer for them.) It'd
be cool if there were a way to cache decompression though. Ideas?

Comments, better ideas? Anyone think this is too much trouble to take
for the problem?

regards, tom lane


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-17 00:45:00
Message-ID: 20080617004500.GT31154@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> One unsolved problem is that this scheme doesn't provide any way to cache
> the result of decompressing an inline-compressed datum, because those have
> no unique ID that could be used for a lookup key.

That's pretty unfortunate.

> Ideas?

Not at the moment, but given the situation it really does strike me as
something we want to solve. Inventing an ID would likely be overkill
or wouldn't solve the problem anyway, I'm guessing...

> Comments, better ideas? Anyone think this is too much trouble to take
> for the problem?

I definitely think it's worth it, even if it doesn't handle an
inline-compressed datum. PostGIS is certainly a good use case for why,
but I doubt it's the only one.

Thanks!

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-17 02:34:06
Message-ID: 12180.1213670046@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> Comments, better ideas? Anyone think this is too much trouble to take
>> for the problem?

> I definitely think it's worth it, even if it doesn't handle an
> inline-compressed datum.

Yeah. I'm not certain how much benefit we could get there anyway.
If the datum isn't out-of-line then there's a small upper limit on how
big it can be and hence a small upper limit on how long it takes to
decompress. It's not clear that a complicated caching scheme would
pay for itself.

The profile shown here:
http://postgis.refractions.net/pipermail/postgis-devel/2008-June/003081.html
shows that the problem the PostGIS guys are looking at is definitely an
out-of-line case (in fact, it looks like the datum wasn't even compressed).

regards, tom lane


From: Jeff <threshar(at)torgo(dot)978(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-17 12:18:54
Message-ID: A959131D-B4CF-4059-9BB3-7DDF3E596ADC@torgo.978.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Jun 16, 2008, at 3:35 PM, Tom Lane wrote:

>
> to a cache entry rather than a freshly palloc'd value. The cache
> lookup
> key is the toast table OID plus value OID. Now pg_detoast_datum()
> has no
...
> the result of decompressing an inline-compressed datum, because
> those have
> no unique ID that could be used for a lookup key. This puts a bit
> of a

Wouldn't the tid fit this? or table oid + tid?

--
Jeff Trout <jeff(at)jefftrout(dot)com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff <threshar(at)threshar(dot)is-a-geek(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-17 13:49:15
Message-ID: 19905.1213710555@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff <threshar(at)threshar(dot)is-a-geek(dot)com> writes:
> On Jun 16, 2008, at 3:35 PM, Tom Lane wrote:
>> the result of decompressing an inline-compressed datum, because those
>> have no unique ID that could be used for a lookup key. This puts a
>> bit of a

> Wouldn't the tid fit this? or table oid + tid?

No. The killer reason why not is that at the time we need to decompress
a datum, we don't know what row it came from. There are some other
problems too...

regards, tom lane


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-17 18:08:46
Message-ID: 4857FDAE.7040103@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> But we can resolve that by ruling that the required lifetime is the same
> as the value would have had if it'd really been palloc'd --- IOW, until
> the memory context that was current at the time gets deleted or reset.

Many support functions of GiST/GIN live in very short memory context - only for
one call. So, that cache invalidation technique doesn't give any advantage
without rearranging this part.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-17 18:31:03
Message-ID: 3446.1213727463@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
>> But we can resolve that by ruling that the required lifetime is the same
>> as the value would have had if it'd really been palloc'd --- IOW, until
>> the memory context that was current at the time gets deleted or reset.

> Many support functions of GiST/GIN live in very short memory context - only for
> one call. So, that cache invalidation technique doesn't give any advantage
> without rearranging this part.

Right, but I think I've got that covered. The memory context reset
won't actually flush the toast cache entry, it effectively just drops its
reference count. We'll only drop cache entries when under memory
pressure (or if they're invalidated by toast table updates/deletes).

regards, tom lane


From: "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-17 21:39:29
Message-ID: 51494DB187D98F4C88DBEBF1F5F6D42303A1E50B@edb06.mail01.enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > I definitely think it's worth it, even if it doesn't handle an
> > inline-compressed datum.
>
> Yeah. I'm not certain how much benefit we could get there anyway.
> If the datum isn't out-of-line then there's a small upper limit on how
> big it can be and hence a small upper limit on how long it takes to
> decompress. It's not clear that a complicated caching scheme would
> pay for itself.

Well there's a small upper limit per-instance but the aggregate could still be significant if you have a situation like btree scans which are repeatedly detoasting the same datum. Note that the "inline compressed" case includes packed varlenas which are being copied just to get their alignment right. It would be nice to get rid of that palloc/pfree bandwidth.

I don't really see a way to do this though. If we hook into the original datum's mcxt we could use the pointer itself as a key. But if the original datum comes from a buffer that doesn't work.

One thought I had -- which doesn't seem to go anywhere, but I thought was worth mentioning in case you see a way to leverage it that I don't -- is that if the toast key is already in the cache then deform_tuple could substitute the cached value directly instead of waiting for someone to detoast it. That means we can save all the subsequent trips to the toast cache manager. I'm not sure that would give us a convenient way to know when to unpin the toast cache entry though. It's possible that some code is aware that deform_tuple doesn't allocate anything currently and therefore doesn't set the memory context to anything that will live as long as the data it returns.

Incidentally, I'm on vacation and reading this via an awful webmail interface. So I'm likely to miss some interesting stuff for a couple weeks. I suppose the Snr ratio of the list is likely to move but I'm not sure which direction...


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-18 10:52:01
Message-ID: 1213786321.9468.93.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Mon, 2008-06-16 at 15:35 -0400, Tom Lane wrote:
> Recent discussions with the PostGIS hackers led me to think about ways
> to reduce overhead when the same TOAST value is repeatedly detoasted.
> In the example shown here
> http://archives.postgresql.org/pgsql-hackers/2008-06/msg00384.php
> 90% of the runtime is being consumed by repeated detoastings of a single
> datum. That is certainly an outlier case, but we've heard complaints of
> TOAST being slow before. The solution I'm about to propose should fix
> this, and as a bonus it will reduce the problem of memory leakage when
> a detoasted value doesn't get pfreed.
>
> What I am imagining is that the tuple toaster will maintain a cache of
> recently-detoasted values...

> Comments, better ideas? Anyone think this is too much trouble to take
> for the problem?

You've not covered the idea that we just alter the execution so we just
detoast once. If we tried harder to reduce the number of detoastings
then we would benefit all of the cases you mention, including internal
decompression. We would use memory, yes, but then so would a cache of
recently detoasted values.

If we see that the index scan key is toastable/ed then the lowest levels
of the plan can create an expanded copy of the tuple and pass that
upwards. We may need to do this in a longer lived context and explicitly
free previous tuples to avoid memory bloat, but we'd have the same
memory usage and same memory freeing issues as with caching. It just
seems more direct and more obvious, especially since it is just an
internal version of the workaround, which was to create a function to
perform early detoasting. Maybe this could be done inside the IndexScan
node when a tuple arrives with toasted values(s) for the scan key
attribute(s).

I presume there's various reasons why you've ruled that out, but with
such a complex proposal it seems worth revisiting the alternatives, even
if just to document them for the archives.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-18 13:45:35
Message-ID: 26164.1213796735@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> You've not covered the idea that we just alter the execution so we just
> detoast once.

That's because I already considered and rejected that idea. There's
no very good place to do it. See thread on postgis-devel:

http://postgis.refractions.net/pipermail/postgis-devel/2008-June/003091.html

Aside from the problems mentioned there, there's the issue that a lower
plan level doesn't have any way to know whether the value will be needed
at all. We could look for references to the Var but it's entirely
possible that the Var is being passed to some function that doesn't
require a fully detoasted result. It wouldn't do for this
"optimization" to disable the slice-fetch feature...

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-18 14:07:33
Message-ID: 1213798053.9468.132.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Wed, 2008-06-18 at 09:45 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > You've not covered the idea that we just alter the execution so we just
> > detoast once.
>
> That's because I already considered and rejected that idea. There's
> no very good place to do it. See thread on postgis-devel:
>
> http://postgis.refractions.net/pipermail/postgis-devel/2008-June/003091.html
>
> Aside from the problems mentioned there, there's the issue that a lower
> plan level doesn't have any way to know whether the value will be needed
> at all. We could look for references to the Var but it's entirely
> possible that the Var is being passed to some function that doesn't
> require a fully detoasted result. It wouldn't do for this
> "optimization" to disable the slice-fetch feature...

Agreed. Yet I'm thinking that a more coherent approach to optimising the
tuple memory usage in the executor tree might be better than the special
cases we seem to have in various places. I don't know what that is, or
even if its possible though.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Reducing overhead for repeat de-TOASTing
Date: 2008-06-18 16:01:35
Message-ID: 28020.1213804895@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Agreed. Yet I'm thinking that a more coherent approach to optimising the
> tuple memory usage in the executor tree might be better than the special
> cases we seem to have in various places. I don't know what that is, or
> even if its possible though.

Yeah. I had tried to think of a way to manage the cached detoasted
value as part of the TupleTableSlot in which the toasted datum is
(normally) stored, but there seems no way to know which slot that is
at the point where pg_detoast_datum is invoked --- and as mentioned
earlier, speculatively detoasting things at the point of the slot access
seems a loser.

[ thinks a bit more ... ] But there's always more than one way to
skin a cat. Right now, when you fetch a toasted attribute value
out of a Slot, what you get is a pointer to a stored-on-disk TOAST
pointer, ie

0x80 or 0x01
length (18)
struct varatt_external

Now the Slot knows which attributes are varlena (it has a tupdesc)
so it could easily check whether it's about to return one of these.
It could instead return a pointer to, say

0x80 or 0x01
length (more than 18)
struct varatt_external
pointer to Slot
pointer to detoasted value, or NULL if not detoasted yet

and that pointer-to-Slot would give us the hook we need to manage
the detoasting when and if pg_detoast_datum gets called. Both
this struct and the ultimately decompressed value would be auxiliary
memory belonging to the Slot, and would go away at slot clear.
(This is certain to work since a not-toasted pass-by-ref datum
in the tuple would have that same lifetime.)

Come to think of it, if Slots are going to manage detoasted copies
of attributes, we could have them auto-detoast inline-compressed
Datums at the time of fetch. The argument that this might be
wasted work has a lot less force for that case.

I am not sure this is a better scheme than the backend-wide cache,
but it's worth thinking about. It would have a lot less management
overhead. On the other hand it couldn't amortize detoastings across
repeated tuple fetches (such as could happen in a join, or successive
queries on the same value).

regards, tom lane