Quick Links

Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Lists:	pgsql-hackers

From:	Amit kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	[WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-03 11:46:42
Message-ID:	6C0B27F7206C9E4CA54AE035729E9C3828520311@szxeml509-mbs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Problem statement:

-----------------------------------

Reducing wal size for an update operation for performance improvement.

Advantages:

---------------------
1. Observed increase in performance with pgbench when server is running in sync_commit off mode.
a. with pgbench (tpc_b) - 13%
b. with modified pgbench (such that size of modified columns are less than all row) - 83%

2. WAL size is reduced

Design/Impementation:

------------------------------

Currently the change is done only for fixed length columns for simple tables and the tuple should not contain NULLS.

This is a Proof of concept, the design and implementation needs to be changed based on final design required for handling other scenario's

Update operation:
-----------------------------
1. Check for the simple table or not.(No toast, No before update triggers)
2. Works only for not null tuples.
3. Identify the modified columns from the target entry.
4. Based on the modified column list, check for any variable length columns are modified, if so this optimization is not applied.
5. Identify the offset and length for the modified columns and store it as an optimized WAL tuple in the following format.
Note: Wal update header is modified to denote whether wal update optimization is done or not.
WAL update header + Tuple header(no change from previous format) +
[offset(2bytes)] [length(2 bytes)] [changed data value]
[offset(2bytes)] [length(2 bytes)] [changed data value]
....
....

Recovery:

----------------
The following steps are only incase of the tuple is optimized.

6. For forming the new tuple, old tuple is required.(including if the old tuple does not require any modifications also).
7. Form the new tuple based on the format specified in the 5th point.
8. once new tuple is framed, follow the exisiting behavior.

Frame the new tuple from old tuple and WAL record:

1. The length of the data which is needs to be copied from old tuple is calculated as
the difference of offset present in the WAL record and the old tuple offset.
(for the first time, the old tuple offset value is zero)
2. Once the old tuple data copied, then increase the offset for old tuple by the
copied length.
3. Get the length and value of modified column from WAL record, copy it into new tuple.
4. Increase the old tuple offset with the modified column length.
5. Repeat this procedure until the WAL record reaches the end.
6. If any remaining left out old tuple data will be copied.

Test results:

----------------------
1. The pgbench test run for 10min.

2. pgbench result for tpc-b is attached with this mail as pgbench_org

3. modified pgbench(such that size of modified columns are less than all row) result for tpc-b is attached with this mail as pgbench_1800_300

Modified pgbench code:

---------------------------------------
1. Schema of the tables are modified as added some extra fields to increase the record size to 1800.
2. The tcp_b benchmark suite to do only update operations.
3. The update operation changed as to update 3 columns with 300 bytes out of total size of 1800 bytes.
4. During initialization of tables removed the NULL value insertions.

I am working on solution to handle other scenarios like variable length columns, tuple contain NULLs, handling for before triggers.

Please provide suggestions/objections?

With Regards,
Amit Kapila.

Attachment	Content-Type	Size
wal_update_changes.patch	text/plain	20.7 KB
pgbench_modified.c	text/plain	63.2 KB
pgbench_org.htm	text/html	17.9 KB
pgbench_1800_300.htm	text/html	17.8 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-03 20:03:00
Message-ID:	501C2E74.7060002@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03.08.2012 14:46, Amit kapila wrote:
> Currently the change is done only for fixed length columns for simple tables and the tuple should not contain NULLS.
>
> This is a Proof of concept, the design and implementation needs to be changed based on final design required for handling other scenario's
>
> Update operation:
> -----------------------------
> 1. Check for the simple table or not.(No toast, No before update triggers)
> 2. Works only for not null tuples.
> 3. Identify the modified columns from the target entry.
> 4. Based on the modified column list, check for any variable length columns are modified, if so this optimization is not applied.
> 5. Identify the offset and length for the modified columns and store it as an optimized WAL tuple in the following format.
> Note: Wal update header is modified to denote whether wal update optimization is done or not.
> WAL update header + Tuple header(no change from previous format) +
> [offset(2bytes)] [length(2 bytes)] [changed data value]
> [offset(2bytes)] [length(2 bytes)] [changed data value]
> ....
> ....

The performance will need to be re-verified after you fix these
limitations. Those limitations need to be fixed before this can be applied.

It would be nice to use some well-known binary delta algorithm for this,
rather than invent our own. OTOH, we have more knowledge of the
attribute boundaries, so a custom algorithm might work better. In any
case, I'd like to see the code to do the delta encoding/decoding to be
put into separate functions, outside of heapam.c. It would be good for
readability, and we might want to reuse this in other places too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-04 06:41:44
Message-ID:	002001cd720c$36e89a80$a4b9cf80$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com]
Sent: Saturday, August 04, 2012 1:33 AM
On 03.08.2012 14:46, Amit kapila wrote:
>> Currently the change is done only for fixed length columns for simple
tables and the tuple should not contain NULLS.
>
>> This is a Proof of concept, the design and implementation needs to be
changed based on final design required for handling other scenario's
>>
>> Update operation:
>> -----------------------------
>> 1. Check for the simple table or not.(No toast, No before update
triggers)
>> 2. Works only for not null tuples.
>> 3. Identify the modified columns from the target entry.
>> 4. Based on the modified column list, check for any variable length
columns are modified, if so this optimization is not applied.
>> 5. Identify the offset and length for the modified columns and store it
as an optimized WAL tuple in the following format.
>> Note: Wal update header is modified to denote whether wal update
optimization is done or not.
>> WAL update header + Tuple header(no change from previous format)
+
>> [offset(2bytes)] [length(2 bytes)] [changed data value]
>> [offset(2bytes)] [length(2 bytes)] [changed data value]
>> ....
>> ....

> The performance will need to be re-verified after you fix these
> limitations. Those limitations need to be fixed before this can be
applied.

Yes, I agree that solution should fix these limitations and performance
numbers needs to be re-verified.
Currently in my mind the work to be done is as follows:

1. Solution which can handle Variable length columns and NULLs
2. Handling of Before Triggers
3. Can the solution for fixed length columns be same as Variable length
columns and NULLS.
4. Make the final code patch which addresses all the above.

Please suggest if there are more things that needs to be handled?

For the 3rd point, currently the solution for fixed length columns cannot
handle the case of variable length columns and NULLS. The reason is for
fixed length columns there is no need of diff technology between old and new
tuple, however for other cases it will be required.
For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
changed columns of new tuple in WAL, it will be sufficient to do the replay
of WAL. However to handle other cases we need to use diff mechanism.

Can we do something like if the changed columns are fixed length and doesn't
contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
other cases store diff format.

This has advantage that for Updates containing only fixed length columns
don't have to pay penality of doing diff between new and old tuple. Also we
can do the whole work in 2 parts, one for fixed length columns and second to
handle other cases.

> It would be nice to use some well-known binary delta algorithm for this,
> rather than invent our own. OTOH, we have more knowledge of the
> attribute boundaries, so a custom algorithm might work better.

I shall work on this and post after initial work.

> In any case, I'd like to see the code to do the delta encoding/decoding to
be
> put into separate functions, outside of heapam.c. It would be good for
> readability, and we might want to reuse this in other places too.

Agreed. I shall take care of doing it in suggested way.

With Regards,
Amit Kapila.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 07:06:29
Message-ID:	CA+U5nML63d+uO51qQEMqFGwfbPRAzEyepm8y4g2nfD=y7J-d+Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3 August 2012 12:46, Amit kapila <amit(dot)kapila(at)huawei(dot)com> wrote:

> Frame the new tuple from old tuple and WAL record:

Sounds good.

I'd suggest we do this only when the saving is large enough for
benefit, rather than do this every time.

You don't mention whether or not the old and the new tuple are on the
same data block.

Personally, I think it will be important to ensure the above,
otherwise recovery will require much additional code for that case.
And that code will be prone to race conditions and performance issues.

Please also bear in mind that Andres will be looking to include the PK
columns in every WAL record for BDR. That could be an option, but I
doubt there is much value in excluding PK columns. I think I'd want
them to be there for debugging purposes so we can prove this code is
correct in production, since otherwise this could be a source of data
loss bugs.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Simon Riggs'" <simon(at)2ndQuadrant(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 08:49:34
Message-ID:	001801cd760b$e6b6f440$b424dcc0$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: Simon Riggs [mailto:simon(at)2ndQuadrant(dot)com]
Sent: Thursday, August 09, 2012 12:36 PM
On 3 August 2012 12:46, Amit kapila <amit(dot)kapila(at)huawei(dot)com> wrote:

>> Frame the new tuple from old tuple and WAL record:

> Sounds good.
Thanks.

> I'd suggest we do this only when the saving is large enough for
> benefit, rather than do this every time.
Do you mean to say that when length of updated values of tuple is less
than some threshold(1/3 or 2/3, etc..) value of
total length?

> You don't mention whether or not the old and the new tuple are on the
> same data block.

WAL reduction is done for the case even when old and new are on different
data blocks as well.

> Personally, I think it will be important to ensure the above,
> otherwise recovery will require much additional code for that case.

In recovery currently also, it handles the case when old and new are on
different page such that
it has to read old page to get the old tuple.

The modifications needs to ensure handling of following cases:

a. When there is backup block,and old-new tuples are on different page
Currently it doesn't read the old page,
However for new implementation it needs to read old page for this case
also.

b. When changes are already applied on page [line : if (XLByteLE(lsn,
PageGetLSN(page))); function: heap_xlog_update]
Currently it doesn't read the old page,
However for new implementation it needs to read old page for this case
also.

> And that code will be prone to race conditions and performance issues.

Are you telling performance issues, as now we may need to read old page in
some of the cases
when earlier it was not reading?
If yes, then I think as I have mentioned above, according to me above 2
cases are not very usual cases.
However the benefit of Update operation on running server is good enough
as it reduces the WAL volume.
If other then above, then please suggest me.

> Please also bear in mind that Andres will be looking to include the PK
> columns in every WAL record for BDR. That could be an option, but I
> doubt there is much value in excluding PK columns.

Agreed. However once the implementation by Andres is done I can merge both
codes and
take the performance data again, based on which we can take decision.

With Regards,
Amit Kapila.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 09:18:53
Message-ID:	CA+U5nMLTPY1BxnwXKBwNcKHPhUh9jf8egJQPGdYN+vb+Lz-rkA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9 August 2012 09:49, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:

>> I'd suggest we do this only when the saving is large enough for
>> benefit, rather than do this every time.
> Do you mean to say that when length of updated values of tuple is less
> than some threshold(1/3 or 2/3, etc..) value of
> total length?

Some heuristic, yes, similar to TOAST's minimum threshold. To attempt
removal of rows in all cases would not be worth it, so we need a fast
path way of saying lets just take all of the columns.

>> You don't mention whether or not the old and the new tuple are on the
>> same data block.
>
> WAL reduction is done for the case even when old and new are on different
> data blocks as well.

That makes me feel nervous. I doubt the marginal gain is worth it.
Most updates don't cross blocks.

>> Please also bear in mind that Andres will be looking to include the PK
>> columns in every WAL record for BDR. That could be an option, but I
>> doubt there is much value in excluding PK columns.
>
> Agreed. However once the implementation by Andres is done I can merge both
> codes and
> take the performance data again, based on which we can take decision.

It won't happen like that because there won't be a single point where
Andres is done. If you agree, then its worth doing it that way to
begin with, rather than requiring us to revisit the same section of
code twice.

One huge point that needs to be thought through is how we prove this
code actually works on WAL/recovery side. A normal regression test
won't prove that and we don't have a framework in place for that.

If you think about what you'll need to do to prove you haven't made
some fatal corruption of WAL, its going to look a lot like logical
replication tests. Worst case here is that mistakes on this patch will
show up as Andres' mistakes. So there is a stronger connection to
Andres' work than it first appears.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 10:30:48
Message-ID:	50239158.6010601@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09.08.2012 12:18, Simon Riggs wrote:
> On 9 August 2012 09:49, Amit Kapila<amit(dot)kapila(at)huawei(dot)com> wrote:
>
>> WAL reduction is done for the case even when old and new are on different
>> data blocks as well.
>
> That makes me feel nervous. I doubt the marginal gain is worth it.
> Most updates don't cross blocks.

That was my first instinctive reaction too. But if the mechanism works
just as well for cross-page updates, seems a bit strange to not use it.

One argument would be that if for some reason the old block is corrupt
or lost, you would not be able to recover the new version of the tuple
from the WAL alone. At the moment, it's nice that the WAL record
contains all the information required to reconstruct the new tuple,
regardless of the old data block contents. But then again, full-page
writes cover that too. There will be a full-page image of the old block
in the WAL anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 11:11:54
Message-ID:	CA+U5nML=3CoCdrjwMZYuDC2sbFmO03fxpdHrrCgEdnBdB4=kDQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9 August 2012 11:30, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 09.08.2012 12:18, Simon Riggs wrote:
>>
>> On 9 August 2012 09:49, Amit Kapila<amit(dot)kapila(at)huawei(dot)com> wrote:
>>
>>> WAL reduction is done for the case even when old and new are on
>>> different
>>> data blocks as well.
>>
>>
>> That makes me feel nervous. I doubt the marginal gain is worth it.
>> Most updates don't cross blocks.
>
>
> That was my first instinctive reaction too. But if the mechanism works just
> as well for cross-page updates, seems a bit strange to not use it.
>
> One argument would be that if for some reason the old block is corrupt or
> lost, you would not be able to recover the new version of the tuple from the
> WAL alone. At the moment, it's nice that the WAL record contains all the
> information required to reconstruct the new tuple, regardless of the old
> data block contents.

Exactly. If we lose the first block in a checkpoint, we could lose all
updates to rows in that page and all other pages linked to it over a
whole checkpoint duration. Basically, page corruption will propogate
from block to block if we do this.

Given the marginal gain because of a low percentage of cross-block
updates, I'm not keen. Low percentage because HOT tries hard to keep
things on same block - even for non-HOT updates (which is the case,
even though it sounds weird).

> But then again, full-page writes cover that too. There
> will be a full-page image of the old block in the WAL anyway.

Right, but we're planning to remove that, so its not a safe assumption
to use when building new code.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Simon Riggs'" <simon(at)2ndQuadrant(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 11:17:15
Message-ID:	001901cd7620$887a9b60$996fd220$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Simon Riggs
Sent: Thursday, August 09, 2012 2:49 PM
On 9 August 2012 09:49, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:

>>> I'd suggest we do this only when the saving is large enough for
>>> benefit, rather than do this every time.
>> Do you mean to say that when length of updated values of tuple is less
>> than some threshold(1/3 or 2/3, etc..) value of
>> total length?

> Some heuristic, yes, similar to TOAST's minimum threshold. To attempt
> removal of rows in all cases would not be worth it, so we need a fast
> path way of saying lets just take all of the columns.

Yes, it has to be done. Currently I have 2 ideas to take care of this:
a. Based on number of updated columns
b. Based on length of updated values
If you have any other idea or you favor among one of the above, let me
know your opinion.

>>> You don't mention whether or not the old and the new tuple are on the
>>> same data block.
>
>> WAL reduction is done for the case even when old and new are on
different
>> data blocks as well.

> That makes me feel nervous. I doubt the marginal gain is worth it.
> Most updates don't cross blocks.

How can it be proved whether gain is marginal or substantial to handle the
case.

One way is test after modification:
I have updated pg_bench tpc_b case:
1. Schema is such that it contains 1800 length rows
2. tpc_b only has updates
3. length of updated column values is 300.
4. All tables has 100% fill factor.
5. Vacuum is OFF

So in such a run, I think many should be updates are across blocks. But not
sure, neither I have verified it in any way.
The above run has given a good performance improvement.

>>> Please also bear in mind that Andres will be looking to include the PK
>>> columns in every WAL record for BDR. That could be an option, but I
>>> doubt there is much value in excluding PK columns.
>
>> Agreed. However once the implementation by Andres is done I can merge
both
>> codes and
>> take the performance data again, based on which we can take decision.

> It won't happen like that because there won't be a single point where
> Andres is done. If you agree, then its worth doing it that way to
> begin with, rather than requiring us to revisit the same section of
> code twice.

This optimization is to reduce the amount of WAL and definitely adding
anything extra will
have some impact.
However if there is no better way other than by including PK in WAL, then I
don't have any problem.

> One huge point that needs to be thought through is how we prove this
> code actually works on WAL/recovery side. A normal regression test
> won't prove that and we don't have a framework in place for that.

My initial idea to validate recovery :
1. Manual Test: a. To generate enough scenarios for update operation.
b. For each scenario, make sure Replay happens properly.
2. Community Review.

With Regards,
Amit Kapila.

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 11:29:09
Message-ID:	50239F05.5090004@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09.08.2012 14:11, Simon Riggs wrote:
> Given the marginal gain because of a low percentage of cross-block
> updates, I'm not keen. Low percentage because HOT tries hard to keep
> things on same block - even for non-HOT updates (which is the case,
> even though it sounds weird).

That depends entirely on the workload. If you do a bulk update that
updates every row on the table, most are going to be cross-block
updates, and the WAL size does matter.

>> But then again, full-page writes cover that too. There
>> will be a full-page image of the old block in the WAL anyway.
>
> Right, but we're planning to remove that, so its not a safe assumption
> to use when building new code.

I don't think we're going to get rid of full-page images any time soon.
I guess you could easily check if full-page writes are enabled, though,
and only do it for cross-page updates if it is.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 11:59:19
Message-ID:	CA+U5nMJBGcuHkVELHRMsU6fsMCijVw0-SiGsh1wE5E-mn538rQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9 August 2012 12:17, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:

> This optimization is to reduce the amount of WAL and definitely adding
> anything extra will have some impact.

Of course. The question is "How much impact?". Each tweak has
progressively less and less gain. This isn't a binary choice.

Squeezing the last ounce of performance at the expense of all other
concerns is not a sensible goal, IMHO, nor do we attempt that
elsewhere.

Given we're making no attempt to remove full page writes, which is
clearly the biggest source of WAL volume currently, micro optimisation
of other factors seems unwarranted at this stage.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Simon Riggs'" <simon(at)2ndQuadrant(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 12:56:04
Message-ID:	001d01cd762e$563e7780$02bb6680$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com]
Sent: Thursday, August 09, 2012 4:59 PM
On 09.08.2012 14:11, Simon Riggs wrote:

>>> But then again, full-page writes cover that too. There
>>> will be a full-page image of the old block in the WAL anyway.
>
>> Right, but we're planning to remove that, so its not a safe assumption
>> to use when building new code.

> I don't think we're going to get rid of full-page images any time soon.
> I guess you could easily check if full-page writes are enabled, though,
> and only do it for cross-page updates if it is.

According to my understanding you are talking about corruption due to
partial page writes which can be handled by full-page image of WAL. Correct
me if I misunderstood.
Based on that, even if full-page image is removed it will be maintained by
double buffer write[an alternative solution to full-page writes for some of
the paths] for the case of corrupt page handling.

With Regards,
Amit Kapila.

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	'Simon Riggs' <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 13:09:21
Message-ID:	5023B681.2020705@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09.08.2012 15:56, Amit Kapila wrote:
> From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com]
> Sent: Thursday, August 09, 2012 4:59 PM
> On 09.08.2012 14:11, Simon Riggs wrote:
>
>>>> But then again, full-page writes cover that too. There
>>>> will be a full-page image of the old block in the WAL anyway.
>>
>>> Right, but we're planning to remove that, so its not a safe assumption
>>> to use when building new code.
>
>> I don't think we're going to get rid of full-page images any time soon.
>> I guess you could easily check if full-page writes are enabled, though,
>> and only do it for cross-page updates if it is.
>
> According to my understanding you are talking about corruption due to
> partial page writes which can be handled by full-page image of WAL. Correct
> me if I misunderstood.

I meant corruption caused by anything, like disk failure, bugs, cosmic
rays, etc. The point is that currently the WAL record contains all the
information required to reconstruct the old tuple. With a diff method,
that's no longer the case, so if the old tuple gets corrupt for whatever
reason, that error will be propagated to the new tuple.

It's not an issue as long as everything works correctly, but some
redundancy is nice when you're trying to resurrect a corrupt database.
That's what we're talking about here. That said, I don't think it's a
big deal for this patch, at least not as long as full-page writes are
enabled.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Simon Riggs'" <simon(at)2ndQuadrant(dot)com>, "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 13:10:40
Message-ID:	001e01cd7630$609ef3b0$21dcdb10$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: Simon Riggs [mailto:simon(at)2ndQuadrant(dot)com]
Sent: Thursday, August 09, 2012 5:29 PM
On 9 August 2012 12:17, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:

>> This optimization is to reduce the amount of WAL and definitely adding
>> anything extra will have some impact.

> Of course. The question is "How much impact?". Each tweak has
> progressively less and less gain. This isn't a binary choice.

> Squeezing the last ounce of performance at the expense of all other
> concerns is not a sensible goal, IMHO, nor do we attempt that
> elsewhere.

> Given we're making no attempt to remove full page writes, which is
> clearly the biggest source of WAL volume currently, micro optimisation
> of other factors seems unwarranted at this stage.

What I am pointing from WAL reduction is about Update operation performance
and
full-page writes doesn't have direct correlation with Update operation
except for
a case of first time update of page after checkpoint.

With Regards,
Amit Kapila.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 16:39:03
Message-ID:	CA+TgmoZXJwZt9u3ipk=a+Gipw6ShcDFvs+uCd-sUqUQxryP=XA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 9, 2012 at 9:09 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I meant corruption caused by anything, like disk failure, bugs, cosmic rays,
> etc. The point is that currently the WAL record contains all the information
> required to reconstruct the old tuple. With a diff method, that's no longer
> the case, so if the old tuple gets corrupt for whatever reason, that error
> will be propagated to the new tuple.
>
> It's not an issue as long as everything works correctly, but some redundancy
> is nice when you're trying to resurrect a corrupt database. That's what
> we're talking about here. That said, I don't think it's a big deal for this
> patch, at least not as long as full-page writes are enabled.

So suppose that the following sequence of events occurs:

1. Tuple A on page 1 is updated. The new version, tuple B, is placed on page 2.
2. The table is vacuumed, removing tuple A.
3. Page 1 is written durably to disk.
4. Crash.

If reconstructing tuple B requires possession of tuple A, it seems
that we are now screwed.

No?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 16:43:48
Message-ID:	5023E8C4.2030804@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09.08.2012 19:39, Robert Haas wrote:
> On Thu, Aug 9, 2012 at 9:09 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> I meant corruption caused by anything, like disk failure, bugs, cosmic rays,
>> etc. The point is that currently the WAL record contains all the information
>> required to reconstruct the old tuple. With a diff method, that's no longer
>> the case, so if the old tuple gets corrupt for whatever reason, that error
>> will be propagated to the new tuple.
>>
>> It's not an issue as long as everything works correctly, but some redundancy
>> is nice when you're trying to resurrect a corrupt database. That's what
>> we're talking about here. That said, I don't think it's a big deal for this
>> patch, at least not as long as full-page writes are enabled.
>
> So suppose that the following sequence of events occurs:
>
> 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on page 2.
> 2. The table is vacuumed, removing tuple A.
> 3. Page 1 is written durably to disk.
> 4. Crash.
>
> If reconstructing tuple B requires possession of tuple A, it seems
> that we are now screwed.

Not with full_page_writes=on, as crash recovery will restore the old
page contents. But you're right, with full_page_writes=off you are screwed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-09 17:48:03
Message-ID:	CA+Tgmobi7ZA=A1z4TdEoUwv5tRy4E0RV2DOHna_nxy8wpY0P8A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 9, 2012 at 12:43 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> So suppose that the following sequence of events occurs:
>>
>> 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on
>> page 2.
>> 2. The table is vacuumed, removing tuple A.
>> 3. Page 1 is written durably to disk.
>> 4. Crash.
>>
>> If reconstructing tuple B requires possession of tuple A, it seems
>> that we are now screwed.
>
> Not with full_page_writes=on, as crash recovery will restore the old page
> contents. But you're right, with full_page_writes=off you are screwed.

I think the property that recovery only needs to worry about each
block individually is one that we want to preserve. Supporting this
optimizating only when full_page_writes=off seems ugly, and I also
agree with Simon's objection upthread: the current design minimizes
the chances of corruption propagating from block to block. Even if
the proposed design is bullet-proof as of this moment (at least with
full_page_writes=on) it seems very possible that it could get
accidentally broken by future code changes, leading to hard-to-find
data corruption bugs. It might also complicate other things that we
will want to do down the line, like parallelizing recovery.

In the pgbench testing I've done, almost all of the updates are HOT,
provided you run the test long enough to reach steady state, so
restricting this optimization to HOT updates shouldn't hurt that case
(or similar real-world cases) very much. Of course there are probably
also real-world cases where HOT applies only seldom, and those cases
won't get the benefit of this, but you can't win them all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Robert Haas'" <robertmhaas(at)gmail(dot)com>, "'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	"'Simon Riggs'" <simon(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-10 05:25:39
Message-ID:	002701cd76b8$94da31c0$be8e9540$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: Robert Haas [mailto:robertmhaas(at)gmail(dot)com]
Sent: Thursday, August 09, 2012 11:18 PM
On Thu, Aug 9, 2012 at 12:43 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> So suppose that the following sequence of events occurs:
>>
>>> 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on
>>> page 2.
>>> 2. The table is vacuumed, removing tuple A.
>>> 3. Page 1 is written durably to disk.
>>> 4. Crash.
>>
>>> If reconstructing tuple B requires possession of tuple A, it seems
>>> that we are now screwed.
>
>> Not with full_page_writes=on, as crash recovery will restore the old page
>> contents. But you're right, with full_page_writes=off you are screwed.

> I think the property that recovery only needs to worry about each
> block individually is one that we want to preserve. Supporting this
> optimizating only when full_page_writes=off seems ugly,

I think recovery needs to worry about multiple blocks as well in some cases.
Please see below case and correct me if I am wrong.
I think currently also there can be problems in case of full_page_writes=off
for crash recovery.
1. Tuple A on page 1 is updated. The new version, tuple B, is placed on
page 2.
2. Page 1 is Partially written to disk.
3. During recovery, it can so appear that there is no need to update XMAX
and other related things in Old tuple
as LSN is greater than WAL lsn.
4. Now also there can be other problems related to tuple visibility.

> and I also
> agree with Simon's objection upthread: the current design minimizes
> the chances of corruption propagating from block to block. Even if
> the proposed design is bullet-proof as of this moment (at least with
> full_page_writes=on) it seems very possible that it could get
> accidentally broken by future code changes, leading to hard-to-find
> data corruption bugs. It might also complicate other things that we
> will want to do down the line, like parallelizing recovery.

I can see the problem incase we remove full-page-writes concept and replace
with some
other equivalent concept which doesn't have current flexibility.

With Regards,
Amit Kapila.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-30 17:53:29
Message-ID:	CA+TgmoaJBr8VFkpb5RiebtHTqMpObu-NVyObhoPE+qohwgwcMA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Aug 10, 2012 at 1:25 AM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
>> I think the property that recovery only needs to worry about each
>> block individually is one that we want to preserve. Supporting this
>> optimizating only when full_page_writes=off seems ugly,
>
> I think recovery needs to worry about multiple blocks as well in some cases.
> Please see below case and correct me if I am wrong.
> I think currently also there can be problems in case of full_page_writes=off
> for crash recovery.
> 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on
> page 2.
> 2. Page 1 is Partially written to disk.
> 3. During recovery, it can so appear that there is no need to update XMAX
> and other related things in Old tuple
> as LSN is greater than WAL lsn.
> 4. Now also there can be other problems related to tuple visibility.

Well, you're only supposed to turn full_page_writes=off if partial
page writes are impossible on your system. If you turn off
full_page_writes on a system where partial page writes are impossible,
then you've intentionally broken crash recovery, and you get to keep
both pieces.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Robert Haas'" <robertmhaas(at)gmail(dot)com>
Cc:	"'Heikki Linnakangas'" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "'Simon Riggs'" <simon(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date:	2012-08-31 03:09:52
Message-ID:	002701cd8726$17714a60$4653df20$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, August 30, 2012 11:23 PM Robert Haas
[mailto:robertmhaas(at)gmail(dot)com] wrote:
On Fri, Aug 10, 2012 at 1:25 AM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
>>> I think the property that recovery only needs to worry about each
>>> block individually is one that we want to preserve. Supporting this
>>> optimizating only when full_page_writes=off seems ugly,
>
>> I think recovery needs to worry about multiple blocks as well in some
cases.
>> Please see below case and correct me if I am wrong.
>> I think currently also there can be problems in case of
full_page_writes=off
>> for crash recovery.
>> 1. Tuple A on page 1 is updated. The new version, tuple B, is placed on
>> page 2.
>> 2. Page 1 is Partially written to disk.
>> 3. During recovery, it can so appear that there is no need to update XMAX
>> and other related things in Old tuple
>> as LSN is greater than WAL lsn.
>> 4. Now also there can be other problems related to tuple visibility.

> Well, you're only supposed to turn full_page_writes=off if partial
> page writes are impossible on your system. If you turn off
> full_page_writes on a system where partial page writes are impossible,

I think you mean to say "full_page_writes on a system where partial page
writes are possible."
Because if partial page writes are impossible then user should keep
full_page_writes = OFF.

> then you've intentionally broken crash recovery, and you get to keep
> both pieces.

Robert, in broad I got your and Simon's idea that we should do
optimization of WAL (Reduce) in case update happens
on same page. I have implemented the final Patch which does WAL
optimization only in case when updated tuple is on same
page. Also we have observed that with fillfactor 80 the performance
improvement is good.

With Regards,
Amit Kapila.