Re: [COMMITTERS] pgsql: Permit dump/reload of not-too-large >1GB tuples

Lists: pgsql-committerspgsql-hackers
From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: Permit dump/reload of not-too-large >1GB tuples
Date: 2016-12-02 04:05:29
Message-ID: E1cCf6H-0002Y6-F5@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Permit dump/reload of not-too-large >1GB tuples

Our documentation states that our maximum field size is 1 GB, and that
our maximum row size of 1.6 TB. However, while this might be attainable
in theory with enough contortions, it is not workable in practice; for
starters, pg_dump fails to dump tables containing rows larger than 1 GB,
even if individual columns are well below the limit; and even if one
does manage to manufacture a dump file containing a row that large, the
server refuses to load it anyway.

This commit enables dumping and reloading of such tuples, provided two
conditions are met:

1. no single column is larger than 1 GB (in output size -- for bytea
this includes the formatting overhead)
2. the whole row is not larger than 2 GB

There are three related changes to enable this:

a. StringInfo's API now has two additional functions that allow creating
a string that grows beyond the typical 1GB limit (and "long" string).
ABI compatibility is maintained. We still limit these strings to 2 GB,
though, for reasons explained below.

b. COPY now uses long StringInfos, so that pg_dump doesn't choke
trying to emit rows longer than 1GB.

c. heap_form_tuple now uses the MCXT_ALLOW_HUGE flag in its allocation
for the input tuple, which means that large tuples are accepted on
input. Note that at this point we do not apply any further limit to the
input tuple size.

The main reason to limit to 2 GB is that the FE/BE protocol uses 32 bit
length words to describe each row; and because the documentation is
ambiguous on its signedness and libpq does consider it signed, we cannot
use the highest-order bit. Additionally, the StringInfo API uses "int"
(which is 4 bytes wide in most platforms) in many places, so we'd need
to change that API too in order to improve, which has lots of fallout.

Backpatch to 9.5, which is the oldest that has
MemoryContextAllocExtended, a necessary piece of infrastructure. We
could apply to 9.4 with very minimal additional effort, but any further
than that would require backpatching "huge" allocations too.

This is the largest set of changes we could find that can be
back-patched without breaking compatibility with existing systems.
Fixing a bigger set of problems (for example, dumping tuples bigger than
2GB, or dumping fields bigger than 1GB) would require changing the FE/BE
protocol and/or changing the StringInfo API in an ABI-incompatible way,
neither of which would be back-patchable.

Authors: Daniel Vérité, Álvaro Herrera
Reviewed by: Tomas Vondra
Discussion: https://postgr.es/m/20160229183023.GA286012@alvherre.pgsql

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/fa2fa995528023b2e6ba1108f2f47558c6b66dcd

Modified Files
--------------
src/backend/access/common/heaptuple.c | 4 ++-
src/backend/commands/copy.c | 8 ++---
src/backend/lib/stringinfo.c | 66 ++++++++++++++++++++++++++++-------
src/include/lib/stringinfo.h | 18 +++++++---
4 files changed, 74 insertions(+), 22 deletions(-)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [COMMITTERS] pgsql: Permit dump/reload of not-too-large >1GB tuples
Date: 2016-12-06 03:10:46
Message-ID: 27737.1480993846@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> Permit dump/reload of not-too-large >1GB tuples

I apologize for not having paid close enough attention earlier, but:
this patch is absolutely unacceptable for the back branches and MUST
be reverted there. Adding another field to StringInfoData is an ABI
change that will certainly break large numbers of extensions. We
can't expect those to get recompiled for minor releases.

regards, tom lane


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [COMMITTERS] pgsql: Permit dump/reload of not-too-large >1GB tuples
Date: 2016-12-06 03:38:03
Message-ID: 20161206033802.dklbbnrzfnx6ddhc@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> > Permit dump/reload of not-too-large >1GB tuples
>
> I apologize for not having paid close enough attention earlier, but:
> this patch is absolutely unacceptable for the back branches and MUST
> be reverted there. Adding another field to StringInfoData is an ABI
> change that will certainly break large numbers of extensions. We
> can't expect those to get recompiled for minor releases.

Oh, I see the problem now -- it can (and frequently is) stack allocated,
not just palloc'ed using the StringInfo api, and changing the size
breaks that. Rats. I'll revert tomorrow.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Permit dump/reload of not-too-large >1GB tuples
Date: 2016-12-09 14:51:47
Message-ID: 20161209145147.d5m4bo7b2b2gy5cl@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Alvaro Herrera wrote:
> Tom Lane wrote:
> > Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> > > Permit dump/reload of not-too-large >1GB tuples
> >
> > I apologize for not having paid close enough attention earlier, but:
> > this patch is absolutely unacceptable for the back branches and MUST
> > be reverted there. Adding another field to StringInfoData is an ABI
> > change that will certainly break large numbers of extensions. We
> > can't expect those to get recompiled for minor releases.
>
> Oh, I see the problem now -- it can (and frequently is) stack allocated,
> not just palloc'ed using the StringInfo api, and changing the size
> breaks that. Rats. I'll revert tomorrow.

Done.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [COMMITTERS] pgsql: Permit dump/reload of not-too-large >1GB tuples
Date: 2017-01-16 22:48:42
Message-ID: 21407.1484606922@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> Permit dump/reload of not-too-large >1GB tuples

I noticed that this commit has created an overflow hazard on 64-bit
hardware: it's not difficult to attempt to make a tuple exceeding 4GB.
You just need a few GB-sized input values. But that's uncool for
a couple of reasons:

* HeapTupleData.t_len can't represent it (it's only uint32)

* heap_form_tuple sets up the tuple to possibly become a composite
datum, meaning it applies SET_VARSIZE. That's going to silently
overflow for any size above 1GB.

In short, as this stands it's a security problem.

I'm not sure whether it's appropriate to try to change t_len to type Size
to address the first problem. We could try, but I don't know what the
fallout would be. Might be better to just add a check disallowing tuple
size >= 4GB.

As for the second problem, we clearly don't have any prospect of allowing
composite datums that exceed 1GB, since they have to have varlena length
words. We can allow plain HeapTuples to exceed that, but it'd be
necessary to distinguish whether the value being built is going to become
a composite datum or not. That requires an API break for heap_form_tuple,
or else a separate function to convert a HeapTuple to Datum (and apply a
suitable length check). Either way it's a bit of a mess.

In any case, a great deal more work is needed before this can
possibly be safe. I recommend reverting it pending that.

regards, tom lane