Quick Links

Re: Performance Improvement by reducing WAL for Update Operation

Lists:	pgsql-hackers

From:	Amit kapila <amit(dot)kapila(at)huawei(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	"hlinnakangas(at)vmware(dot)com" <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Performance Improvement by reducing WAL for Update Operation
Date:	2012-10-25 17:34:14
Message-ID:	6C0B27F7206C9E4CA54AE035729E9C382854435D@szxeml509-mbx
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, October 25, 2012 5:43 AM, Noah Misch wrote:
On Tue, Oct 23, 2012 at 08:21:54PM -0400, Noah Misch wrote:
>> -Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8- -WAL(at)-c8-
>> HEAD,-F80 816 1644 6528 1821 MiB
>> xlogscale,-F80 824 1643 6551 1826 MiB
>> xlogscale+lz,-F80 717 1466 5924 1137 MiB
>> xlogscale+lz,-F100 753 1508 5948 1548 MiB
>
>> Those are short runs with no averaging of multiple iterations; don't put too
>> much faith in the absolute numbers.

> I decided to rerun those measurements with three 15-minute runs. I removed
> the -F100 test and added wal_update_changes_v2.patch (delta encoding version)
> to the mix. Median results:

> -Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8- -WAL(at)-c8-
> HEAD,-F80 832 1679 6797 44 GiB
> scale,-F80 830 1679 6798 44 GiB
> scale+lz,-F80 736 1498 6169 11 GiB
> scale+delta,-F80 841 1713 7056 10 GiB

> The numbers varied little across runs. So we see the same general trends as
> with the short runs; overall performance is slightly higher across the board,
> and the fraction of WAL avoided is much higher. I'm suspecting the patches
> shrink WAL better in these longer runs because the WAL of a short run contains
> a higher density of full-page images.

I have fixed all the review comments raised in delta encoding approach raised by you (for needs toast, for now I have kept the code as it will not go in patch of encoding for it. however it can be changed.).
I have also fixed the major comment about this patch by Heikki and Tom [use memcmp to find modified columns].
The patch containing review comments fixed for delta encoding method is attached with this mail.

The readings with modified patch are as below, the detailed configuration used in the file attached:

-Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8-
scale,-F80 834 1451 2701
scale+lz,-F80 659 1276 4650
scale+delta+review_fixed,-F80 873 1704 5509

The results are similar to your findings except for 8 client result.
I have one doubt that my m/c is 4core m/c on which I am taking data whereas yours is 8 core.

So tommorow I shall post the results with 1,2,4,8 clients as well.
Any further suggestions?

With Regards,
Amit Kapila.

Attachment	Content-Type	Size
wal_update_changes_v3.patch	application/octet-stream	30.5 KB
pgbench_wal_memcmp.htm	text/html	59.7 KB

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Amit kapila'" <amit(dot)kapila(at)huawei(dot)com>, "'Noah Misch'" <noah(at)leadboat(dot)com>
Cc:	<hlinnakangas(at)vmware(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Performance Improvement by reducing WAL for Update Operation
Date:	2012-10-26 10:59:43
Message-ID:	001b01cdb369$021d8740$065895c0$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, October 25, 2012 11:04 PM Amit kapila wrote:
> On Thursday, October 25, 2012 5:43 AM, Noah Misch wrote:
> On Tue, Oct 23, 2012 at 08:21:54PM -0400, Noah Misch wrote:
> >> -Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8- -WAL(at)-c8-
> >> HEAD,-F80 816 1644 6528 1821 MiB
> >> xlogscale,-F80 824 1643 6551 1826 MiB
> >> xlogscale+lz,-F80 717 1466 5924 1137 MiB
> >> xlogscale+lz,-F100 753 1508 5948 1548 MiB
> >
> >> Those are short runs with no averaging of multiple iterations; don't
> >> put too much faith in the absolute numbers.
>
> > I decided to rerun those measurements with three 15-minute runs. I
> > removed the -F100 test and added wal_update_changes_v2.patch (delta
> > encoding version) to the mix. Median results:
>
> > -Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8- -WAL(at)-c8-
> > HEAD,-F80 832 1679 6797 44 GiB
> > scale,-F80 830 1679 6798 44 GiB
> > scale+lz,-F80 736 1498 6169 11 GiB
> > scale+delta,-F80 841 1713 7056 10 GiB
>
> > The numbers varied little across runs. So we see the same general
> > trends as with the short runs; overall performance is slightly higher
> > across the board, and the fraction of WAL avoided is much higher. I'm
> > suspecting the patches shrink WAL better in these longer runs because
> > the WAL of a short run contains a higher density of full-page images.
>
> I have fixed all the review comments raised in delta encoding approach
> raised by you (for needs toast, for now I have kept the code as it will
> not go in patch of encoding for it. however it can be changed.).
> I have also fixed the major comment about this patch by Heikki and Tom
> [use memcmp to find modified columns].
> The patch containing review comments fixed for delta encoding method is
> attached with this mail.
>
> The readings with modified patch are as below, the detailed
> configuration used in the file attached:
>
> -Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-
> c8-
> scale,-F80 834 1451
> 2701
> scale+lz,-F80 659 1276
> 4650
> scale+delta+review_fixed,-F80 873 1704 5509
>
>
> The results are similar to your findings except for 8 client result.
> I have one doubt that my m/c is 4core m/c on which I am taking data
> whereas yours is 8 core.
>
> So tommorow I shall post the results with 1,2,4,8 clients as well.
> Any further suggestions?

The data with 1,2,4,8 clients is as below:

-Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c4- -tps(at)-c8-
HEAD,-F80 874 1350 2129 2733
scale,-F80 884 1393 2106 2671
scale+lz,-F80 722 1342 2646 4568
scale+delta,-F80 892 1670 3323 5481

From the above readings, it is clear that delta encoding approach has better
performance and scaling.
Do you see the need of any further investigation from myside?

With Regards,
Amit Kapila.

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	"hlinnakangas(at)vmware(dot)com" <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Performance Improvement by reducing WAL for Update Operation
Date:	2012-10-26 22:32:33
Message-ID:	20121026223233.GD24744@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 25, 2012 at 05:34:14PM +0000, Amit kapila wrote:
> I have fixed all the review comments raised in delta encoding approach raised by you (for needs toast, for now I have kept the code as it will not go in patch of encoding for it. however it can be changed.).
> I have also fixed the major comment about this patch by Heikki and Tom [use memcmp to find modified columns].

Could you elaborate on your reason for continuing to treat TOAST as a special
case? As best I recall, the only reason to do so before was the fact that
TOAST can change the physical representation of a column even when executor
did not change its logical content. Since you're no longer relying on the
executor's opinion of what changed, a TOASTed value is not special.

> The patch containing review comments fixed for delta encoding method is attached with this mail.

In my previous review, I said:

Given [not relying on the executor to know which columns changed], why not
treat the tuple as an opaque series of bytes and not worry about datum
boundaries? When several narrow columns change together, say a sequence
of sixteen smallint columns, you will use fewer binary delta commands by
representing the change with a single 32-byte substitution. If an UPDATE
changes just part of a long datum, the delta encoding algorithm will still
be able to save considerable space. That case arises in many forms:
changing one word in a long string, changing one element in a long array,
changing one field of a composite-typed column. Granted, this makes the
choice of delta encoding algorithm more important.

We may be leaving considerable savings on the table by assuming that column
boundaries are the only modified-range boundaries worth recognizing. What is
your willingness to explore general algorithms for choosing such boundaries?
Such an investigation may, of course, be a dead end.

If you conclude that finding sub-column similarity is not worthwhile, at least
teach your algorithm to aggregate runs of changing or unchanging columns into
fewer delta instructions. If a table contains twenty unchanging bool columns,
you currently use at least 80 bytes to encode that fact. By treating the run
of columns as a unit for delta encoding purposes, you could encode it in 23
bytes. The same applies for runs of changing columns.

Like Heikki, I'm left wondering why your custom delta encoding is
preferable to an encoding from the literature. Your encoding has much in
common with VCDIFF, even sharing two exact command names. If a custom
encoding is the right thing, code comments or a README section should at
least discuss the advantages over an established alternative.

Reflecting on those comments, I'm no longer too concerned about your use of a
novel format to encode the delta. For example, VCDIFF is designed to be
flexible and self-describing; we don't need that. But I would still review it
for useful ideas when designing the encoding for PostgreSQL.

Idle thought: it might pay off to use 1-byte sizes and offsets most of the
time. Tuples shorter than 256 bytes are common; for longer tuples, we can
afford wider offsets.

I still think this deserves attention. You currently need two bits to select
the delta command. Here are a few strategies to consider:

1) Store the commands in groups of eight. Each group starts with two bytes
representing the eight commands, followed by the corresponding 2-byte lengths
and ADD data blocks. The last group may end early.

2) Add extra delta commands HEAP_UPDATE_WAL_OPT_COPY16, _ADD16, _IGN16. Now
each command specifier needs three bits. Again store commands in groups of
eight, this time three bytes per group header. Expect a 1-byte length for the
original commands and a 2-byte length for _*16 commands.

2) Store each 2-bit command with a 14-bit length in a uint16. We would need
to do something grotty for --with-blocksize=32, where a 14-bit length is not
always enough.

I mentioned that heap_delta_{encode,decode} should have essentially-inverse
argument lists. That remains unchanged in this version.

Review the comments added and moved in your patch; some have become obsolete.
At least one is missing detail I requested in the previous review.

> + /*
> + * get_tuple_info - Gets the tuple offset and value.
> + *
> + * calculates the attribute value and offset, where the attribute ends in the
> + * tuple based on the attribute number and previous fetched attribute info.
> + *
> + * offset (I/P and O/P variable) - Input as end of previous attribute offset
> + * and incase if it is a first attribute then it's value is zero.
> + * Output as end of the current attribute in the tuple.
> + * usecacheoff (I/P and O/P variable) - Attribute cacheoff can be used or not.
> + */
> + static void
> + get_tuple_info(Form_pg_attribute *att, HeapTuple tuple, bits8 *bp,
> + bool hasnulls, int attnum, Datum *value, uint16 *offset,
> + bool *usecacheoff)

This name is too generic and not particularly tied to what the function does.
As far as I can see, this is a variant of heap_getattr() that also returns the
offset and has some redundant arguments. I suggest heap_getattr_with_offset()
for a name and making its argument list more like heap_getattr().

> + if (wal_offset < heaptup->t_len)
> + {
> + wal_tup->t_len = wal_offset;
> + wal_tup->t_self = heaptup->t_self;
> + wal_tup->t_tableOid = heaptup->t_tableOid;
> +
> + return true;
> + }

The passed-in wal_tup happens to always have MaxHeapTupleSize of storage.
However, heap_delta_encode() cannot verify that and does not document it as an
precondition. Furthermore, it makes no effort to avoid overrunning the buffer
before finally checking the accumulated length here. The encoded data could
have grown well beyond MaxHeapTupleSize.

> +
> + memcpy(wal_tup, heaptup, sizeof(HeapTuple));
> + return false;

You would need heap_copytuple_with_tuple() here, but perhaps just specify the
contents of wal_tup as undefined when heap_delta_encode() returns false.

> + }

> + #define MinHeapTupleSizeForDeltaUpdate 64

This is now unused.

Overall, there's still plenty to do before this patch is ready. I'm marking
it Returned with Feedback. I look forward to the benefits it will bring when
finished.

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Noah Misch'" <noah(at)leadboat(dot)com>
Cc:	<hlinnakangas(at)vmware(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Performance Improvement by reducing WAL for Update Operation
Date:	2012-10-27 11:27:46
Message-ID:	004201cdb436$16fa52b0$44eef810$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:
> On Thu, Oct 25, 2012 at 05:34:14PM +0000, Amit kapila wrote:
> > I have fixed all the review comments raised in delta encoding approach
> raised by you (for needs toast, for now I have kept the code as it will
> not go in patch of encoding for it. however it can be changed.).
> > I have also fixed the major comment about this patch by Heikki and Tom
> [use memcmp to find modified columns].

> Could you elaborate on your reason for continuing to treat TOAST as a
> special
> case? As best I recall, the only reason to do so before was the fact
> that
> TOAST can change the physical representation of a column even when
> executor
> did not change its logical content. Since you're no longer relying on
> the
> executor's opinion of what changed, a TOASTed value is not special.

I thought for initial version of patch, without this change, patch will have
less impact and less test.
However now in the new version I shall take care of TOASTed value as
suggested by you.

> > The patch containing review comments fixed for delta encoding method
> is attached with this mail.
>
> In my previous review, I said:
>
> Given [not relying on the executor to know which columns changed],
> why not
> treat the tuple as an opaque series of bytes and not worry about
> datum
> boundaries? When several narrow columns change together, say a
> sequence
> of sixteen smallint columns, you will use fewer binary delta
> commands by
> representing the change with a single 32-byte substitution. If an
> UPDATE
> changes just part of a long datum, the delta encoding algorithm
> will still
> be able to save considerable space. That case arises in many
> forms:
> changing one word in a long string, changing one element in a long
> array,
> changing one field of a composite-typed column. Granted, this
> makes the
> choice of delta encoding algorithm more important.
>
> We may be leaving considerable savings on the table by assuming that
> column
> boundaries are the only modified-range boundaries worth recognizing.
> What is
> your willingness to explore general algorithms for choosing such
> boundaries?
> Such an investigation may, of course, be a dead end.

For this patch I am interested to go with delta encoding approach based on
column boundaries.

However I shall try to do it separately and if it gives positive results
then I will share with hackers.
I will try with VCDiff once or let me know if you have any other algorithm
in mind.
The other positive point I am seeing for exploring some standard diff
algorithm is, incase it gives positive results, we can even try to explore
for some standard compression algorithm for Full_Page_Writes as well to
reduce WAL further.

> If you conclude that finding sub-column similarity is not worthwhile, at
> least
> teach your algorithm to aggregate runs of changing or unchanging columns
> into
> fewer delta instructions. If a table contains twenty unchanging bool
> columns,
> you currently use at least 80 bytes to encode that fact. By treating
> the run
> of columns as a unit for delta encoding purposes, you could encode it in
> 23
> bytes.

Do you mean to say handle for non-continuous unchanged columns?
I believe for continuous unchanged columns its already handled until there
are any alignment changes. Example

create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7
bool,
f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool,
f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool,

f20 bool, f21 bool);

insert into tbl values(10,
't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
't');

update tbl set f1 = 20;

The delta algorithm for the above operation reduced the size of the tuple
from 24 bytes to 12 bytes.

4 bytes - IGN command and LEN
4 bytes - ADD command and LEN
4 bytes - Data block

> The same applies for runs of changing columns.

I shall handle for changed columns similar to unchanged columns.

> Like Heikki, I'm left wondering why your custom delta encoding is
> preferable to an encoding from the literature. Your encoding has
> much in
> common with VCDIFF, even sharing two exact command names. If a
> custom
> encoding is the right thing, code comments or a README section
> should at
> least discuss the advantages over an established alternative.
>
> Reflecting on those comments, I'm no longer too concerned about your use
> of a
> novel format to encode the delta. For example, VCDIFF is designed to be
> flexible and self-describing; we don't need that. But I would still
> review it
> for useful ideas when designing the encoding for PostgreSQL.
>
> Idle thought: it might pay off to use 1-byte sizes and offsets
> most of the
> time. Tuples shorter than 256 bytes are common; for longer
> tuples, we can
> afford wider offsets.
>
> I still think this deserves attention. You currently need two bits to
> select
> the delta command. Here are a few strategies to consider:
>
> 1) Store the commands in groups of eight. Each group starts with two
> bytes
> representing the eight commands, followed by the corresponding 2-byte
> lengths
> and ADD data blocks. The last group may end early.
>
> 2) Add extra delta commands HEAP_UPDATE_WAL_OPT_COPY16, _ADD16, _IGN16.
> Now
> each command specifier needs three bits. Again store commands in groups
> of
> eight, this time three bytes per group header. Expect a 1-byte length
> for the
> original commands and a 2-byte length for _*16 commands.
>
> 2) Store each 2-bit command with a 14-bit length in a uint16. We would
> need
> to do something grotty for --with-blocksize=32, where a 14-bit length is
> not
> always enough.

I shall also refer VCDiff format and based on you suggestion, I shall modify
the encoding algorithm.

> I mentioned that heap_delta_{encode,decode} should have essentially-
> inverse
> argument lists. That remains unchanged in this version.

> Review the comments added and moved in your patch; some have become
> obsolete.
> At least one is missing detail I requested in the previous review.
>
> > + /*
> > + * get_tuple_info - Gets the tuple offset and value.
> > + *
> > + * calculates the attribute value and offset, where the attribute
> ends in the
> > + * tuple based on the attribute number and previous fetched
> attribute info.
> > + *
> > + * offset (I/P and O/P variable) - Input as end of previous
> attribute offset
> > + * and incase if it is a first attribute then it's
value
> is zero.
> > + * Output as end of the current attribute in the tuple.
> > + * usecacheoff (I/P and O/P variable) - Attribute cacheoff can be
> used or not.
> > + */
> > + static void
> > + get_tuple_info(Form_pg_attribute *att, HeapTuple tuple, bits8 *bp,
> > + bool hasnulls, int attnum, Datum *value, uint16
> *offset,
> > + bool *usecacheoff)
>
> This name is too generic and not particularly tied to what the function
> does.
> As far as I can see, this is a variant of heap_getattr() that also
> returns the
> offset and has some redundant arguments. I suggest
> heap_getattr_with_offset()
> for a name and making its argument list more like heap_getattr().
>
> > + if (wal_offset < heaptup->t_len)
> > + {
> > + wal_tup->t_len = wal_offset;
> > + wal_tup->t_self = heaptup->t_self;
> > + wal_tup->t_tableOid = heaptup->t_tableOid;
> > +
> > + return true;
> > + }
>
> The passed-in wal_tup happens to always have MaxHeapTupleSize of
> storage.
> However, heap_delta_encode() cannot verify that and does not document it
> as an
> precondition. Furthermore, it makes no effort to avoid overrunning the
> buffer
> before finally checking the accumulated length here. The encoded data
> could
> have grown well beyond MaxHeapTupleSize.
>
> > +
> > + memcpy(wal_tup, heaptup, sizeof(HeapTuple));
> > + return false;
>
> You would need heap_copytuple_with_tuple() here, but perhaps just
> specify the
> contents of wal_tup as undefined when heap_delta_encode() returns false.
>
> > + }
>
> > + #define MinHeapTupleSizeForDeltaUpdate 64
>
> This is now unused.

I shall handle review comments in new version.

> Overall, there's still plenty to do before this patch is ready. I'm
> marking
> it Returned with Feedback. I look forward to the benefits it will bring
> when
> finished.

Thank you for the review.

With Regards,
Amit Kapila.

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	hlinnakangas(at)vmware(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Performance Improvement by reducing WAL for Update Operation
Date:	2012-10-27 18:04:01
Message-ID:	20121027180401.GA1870@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Oct 27, 2012 at 04:57:46PM +0530, Amit Kapila wrote:
> On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:
> > Could you elaborate on your reason for continuing to treat TOAST as a
> > special
> > case? As best I recall, the only reason to do so before was the fact
> > that
> > TOAST can change the physical representation of a column even when
> > executor
> > did not change its logical content. Since you're no longer relying on
> > the
> > executor's opinion of what changed, a TOASTed value is not special.
>
> I thought for initial version of patch, without this change, patch will have
> less impact and less test.

Not that I'm aware. If you still think so, please explain.

> For this patch I am interested to go with delta encoding approach based on
> column boundaries.

Fair enough.

> > If you conclude that finding sub-column similarity is not worthwhile, at
> > least
> > teach your algorithm to aggregate runs of changing or unchanging columns
> > into
> > fewer delta instructions. If a table contains twenty unchanging bool
> > columns,
> > you currently use at least 80 bytes to encode that fact. By treating
> > the run
> > of columns as a unit for delta encoding purposes, you could encode it in
> > 23
> > bytes.
>
> Do you mean to say handle for non-continuous unchanged columns?

My statement above was a mess.

> I believe for continuous unchanged columns its already handled until there
> are any alignment changes. Example
>
> create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7
> bool,
> f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool,
> f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool,
>
> f20 bool, f21 bool);
>
> insert into tbl values(10,
> 't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
> 't');
>
> update tbl set f1 = 20;
>
> The delta algorithm for the above operation reduced the size of the tuple
> from 24 bytes to 12 bytes.
>
> 4 bytes - IGN command and LEN
> 4 bytes - ADD command and LEN
> 4 bytes - Data block

I now see that this case is already handled. Sorry for the noise.
Incidentally, I tried this variant:

create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7 bool,
f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool,
f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool,
f20 bool, f21 bool, f22 int, f23 int);
insert into tbl values(1,
't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
't', 2, 3);
update tbl set f1 = 2, f22 = 4, f23 = 6;

It yielded an erroneous delta: IGN 4, ADD 4, COPY 24, IGN 4, ADD 4, COPY 28,
IGN 4, ADD 4. (The delta happens to be longer than the data and goes unused).

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	'Noah Misch' <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Performance Improvement by reducing WAL for Update Operation
Date:	2012-10-27 18:57:31
Message-ID:	508C2E9B.5070201@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 27.10.2012 14:27, Amit Kapila wrote:
> On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:
>> In my previous review, I said:
>>
>> Given [not relying on the executor to know which columns changed],
>> why not
>> treat the tuple as an opaque series of bytes and not worry about
>> datum
>> boundaries? When several narrow columns change together, say a
>> sequence
>> of sixteen smallint columns, you will use fewer binary delta
>> commands by
>> representing the change with a single 32-byte substitution. If an
>> UPDATE
>> changes just part of a long datum, the delta encoding algorithm
>> will still
>> be able to save considerable space. That case arises in many
>> forms:
>> changing one word in a long string, changing one element in a long
>> array,
>> changing one field of a composite-typed column. Granted, this
>> makes the
>> choice of delta encoding algorithm more important.
>>
>> We may be leaving considerable savings on the table by assuming that
>> column
>> boundaries are the only modified-range boundaries worth recognizing.
>> What is
>> your willingness to explore general algorithms for choosing such
>> boundaries?
>> Such an investigation may, of course, be a dead end.
>
> For this patch I am interested to go with delta encoding approach based on
> column boundaries.
>
> However I shall try to do it separately and if it gives positive results
> then I will share with hackers.
> I will try with VCDiff once or let me know if you have any other algorithm
> in mind.

One idea is to use the LZ format in the WAL record, but use your
memcmp() code to construct it. I believe the slow part in LZ compression
is in trying to locate matches in the "history", so if you just replace
that with your code that's aware of the column boundaries and uses
simple memcmp() to detect what parts changed, you could create LZ
compressed output just as quickly as the custom encoded format. It would
leave the door open for making the encoding smarter or to do actual
compression in the future, without changing the format and the code to
decode it.

- Heikki

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>
Cc:	"'Noah Misch'" <noah(at)leadboat(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Performance Improvement by reducing WAL for Update Operation
Date:	2012-10-29 14:32:11
Message-ID:	00a201cdb5e2$2f6d9700$8e48c500$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sunday, October 28, 2012 12:28 AM Heikki Linnakangas wrote:
> On 27.10.2012 14:27, Amit Kapila wrote:
> > On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:
> >> In my previous review, I said:
> >>
> >> Given [not relying on the executor to know which columns changed],
> >> why not
> >
> > For this patch I am interested to go with delta encoding approach
> based on
> > column boundaries.
> >
> > However I shall try to do it separately and if it gives positive
> results
> > then I will share with hackers.
> > I will try with VCDiff once or let me know if you have any other
> algorithm
> > in mind.

> One idea is to use the LZ format in the WAL record, but use your
> memcmp() code to construct it. I believe the slow part in LZ compression
> is in trying to locate matches in the "history", so if you just replace
> that with your code that's aware of the column boundaries and uses
> simple memcmp() to detect what parts changed, you could create LZ
> compressed output just as quickly as the custom encoded format. It would
> leave the door open for making the encoding smarter or to do actual
> compression in the future, without changing the format and the code to
> decode it.

This is good idea. I shall try it.

In the existing algorithm for storing the new data which is not present in
the history, it needs 1 control byte for
every 8 bytes of new data which can increase the size of the compressed
output as compare to our delta encoding approach.

Shall we modify the LZ Algorithm little bit, so that it can work best for
our case:

Approach-1
---------------
Is it possible to increase the control data from 1 bit to 2 bits [0 - new
data, 1 - pick from history based on OFFSET-LENGTH, 2 - Length and new data]
The new bit value (2) is to handle the new field data as a continuous stream
of data, instead of treating every byte as a new data.

Approach-2
---------------
Use only one bit for control data [0 - Length and new data, 1 - pick from
history based on OFFSET-LENGTH]
The modified bit value (0) is to handle the new field data as a continuous
stream of data, instead of treating every byte as a new data.

With Regards,
Amit Kapila.

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Noah Misch'" <noah(at)leadboat(dot)com>
Cc:	<hlinnakangas(at)vmware(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Performance Improvement by reducing WAL for Update Operation
Date:	2012-10-29 14:38:58
Message-ID:	00a301cdb5e3$21ae7f20$650b7d60$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Saturday, October 27, 2012 11:34 PM Noah Misch
> On Sat, Oct 27, 2012 at 04:57:46PM +0530, Amit Kapila wrote:
> > On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:
> > > Could you elaborate on your reason for continuing to treat TOAST as
> a
> > > special
> > > case? As best I recall, the only reason to do so before was the
> fact
> > > that
> > > TOAST can change the physical representation of a column even when
> > > executor
> > > did not change its logical content. Since you're no longer relying
> on
> > > the
> > > executor's opinion of what changed, a TOASTed value is not special.
> >
> > I thought for initial version of patch, without this change, patch
> will have
> > less impact and less test.
>
> Not that I'm aware. If you still think so, please explain.
>
> > For this patch I am interested to go with delta encoding approach
> based on
> > column boundaries.
>
> Fair enough.
>
> > > If you conclude that finding sub-column similarity is not
> worthwhile, at
> > > least
> > > teach your algorithm to aggregate runs of changing or unchanging
> columns
> > > into
> > > fewer delta instructions. If a table contains twenty unchanging
> bool
> > > columns,
> > > you currently use at least 80 bytes to encode that fact. By
> treating
> > > the run
> > > of columns as a unit for delta encoding purposes, you could encode
> it in
> > > 23
> > > bytes.
> >
> > Do you mean to say handle for non-continuous unchanged columns?
>
> My statement above was a mess.
>
> > I believe for continuous unchanged columns its already handled until
> there
> > are any alignment changes. Example
> >
> > create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool,
> f7
> > bool,
> > f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13
> bool,
> > f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19
> bool,
> >
> > f20 bool, f21 bool);
> >
> > insert into tbl values(10,
> >
> 't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
> 't',
> > 't');
> >
> > update tbl set f1 = 20;
> >
> > The delta algorithm for the above operation reduced the size of the
> tuple
> > from 24 bytes to 12 bytes.
> >
> > 4 bytes - IGN command and LEN
> > 4 bytes - ADD command and LEN
> > 4 bytes - Data block
>
> I now see that this case is already handled. Sorry for the noise.
> Incidentally, I tried this variant:
>
> create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7
> bool,
> f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13
> bool,
> f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19
> bool,
> f20 bool, f21 bool, f22 int, f23 int);
> insert into tbl values(1,
> 't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
> 't',
> 't', 2, 3);
> update tbl set f1 = 2, f22 = 4, f23 = 6;
>
> It yielded an erroneous delta: IGN 4, ADD 4, COPY 24, IGN 4, ADD 4, COPY
> 28,
> IGN 4, ADD 4. (The delta happens to be longer than the data and goes
> unused).

I think with new algorithm based on inputs by you this case will be handled
in much better way.
I am planning to try 2 approaches:
1. try to Use LZ compression in the manner suggested by Heikki as if it
works, it can be simpler.
2. devise new algorithm based on your suggestions and referring LZ/VCdiff
algorithms.

With Regards,
Amit Kapila.