Quick Links

BUG #6748: sequence value may be conflict in some cases

Lists:	pgsql-bugspgsql-hackers

From:	meixiangming(at)huawei(dot)com
To:	pgsql-bugs(at)postgresql(dot)org
Subject:	BUG #6748: sequence value may be conflict in some cases
Date:	2012-07-23 01:51:56
Message-ID:	E1St7oS-0004Rg-Oc@wrigleys.postgresql.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

The following bug has been logged on the website:

Bug reference: 6748
Logged by: Xiangming Mei
Email address: meixiangming(at)huawei(dot)com
PostgreSQL version: 9.1.4
Operating system: SUSE 11 X64
Description:

Reproduce steps:
1. create a table using serial such as:
CREATE TABLE tbl_aaa
(
sno serial NOT NULL, //自增序列，等同于sequence
userid int NOT NULL,
CONSTRAINT tbl_aaa_pkey PRIMARY KEY (sno)
);
2. insert a tuple:
INSERT INTO tbl_aaa values (DEFAULT, 1);
3. postgersql instance crashed, may be power lost or killed by someone:
4. restart the instance then insert another tuple into tbl_aaa, you will get
a error message:
INSERT INTO tbl_aaa values (DEFAULT, 2);
ERROR: SQLSTATE 23505: duplicate key value violates unique constraint
"tbl_aaa_pkey"
LOCATION: _bt_check_unique, nbtinsert.c:300

I think this is because there is no logit when 'is_called' become from 'f'
to 't'.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	meixiangming(at)huawei(dot)com
Cc:	pgsql-bugs(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: BUG #6748: sequence value may be conflict in some cases
Date:	2012-07-23 18:43:34
Message-ID:	4360.1343069014@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

meixiangming(at)huawei(dot)com writes:
> [ freshly-created sequence has wrong state after crash ]

I didn't believe this at first, but sure enough, it fails just as
described if you force a crash between the first and second nextval
calls for the sequence. This used to work ...

The change that broke it turns out to be the ALTER SEQUENCE OWNED BY
call that we added to serial-column creation circa 8.2; although on
closer inspection I think any ALTER SEQUENCE before the first nextval
call would be problematic. The real issue is the ancient kluge in
sequence creation that writes something different into the WAL log
than what it leaves behind in shared buffers:

/* We do not log first nextval call, so "advance" sequence here */
/* Note we are scribbling on local tuple, not the disk buffer */
newseq->is_called = true;
newseq->log_cnt = 0;

The tuple in buffers has log_cnt = 1, is_called = false, but the initial
XLOG_SEQ_LOG record shows log_cnt = 0, is_called = true. So if we crash
at this point, after recovery it looks like one nextval() has already
been done. However, AlterSequence generates another XLOG_SEQ_LOG record
based on what's in shared buffers, so after replay of that, we're back
to the "original" state where it does not appear that any nextval() has
been done.

I'm of the opinion that this kluge needs to be removed; it's just insane
that we're not logging the same state we leave in our buffers. To do
that, we need to fix nextval() so that the first nextval call generates
an xlog entry; that is, if we are changing is_called to true we ought to
consider that as a reason to force an xlog entry. I think way back when
we thought it was a good idea to avoid making two xlog entries during
creation and immediate use of a sequence, but considering all the other
xlog entries involved in creation of a sequence object, this is a pretty
silly "optimization". (Besides, it merely postpones the first
nextval-driven xlog entry from the first to the second nextval call.)

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: [BUGS] BUG #6748: sequence value may be conflict in some cases
Date:	2012-07-25 01:21:06
Message-ID:	9025.1343179266@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

I wrote:
> The tuple in buffers has log_cnt = 1, is_called = false, but the initial
> XLOG_SEQ_LOG record shows log_cnt = 0, is_called = true. So if we crash
> at this point, after recovery it looks like one nextval() has already
> been done. However, AlterSequence generates another XLOG_SEQ_LOG record
> based on what's in shared buffers, so after replay of that, we're back
> to the "original" state where it does not appear that any nextval() has
> been done.

> I'm of the opinion that this kluge needs to be removed; it's just insane
> that we're not logging the same state we leave in our buffers.

Attached is a WIP patch for that. I had to leave nextval() logging
a different state from what it puts into buffers, since without that
we can't have the optimization of one-WAL-record-per-N-nextval-calls,
which seems worth preserving. But I was able to get rid of the other
routines' doing it. I think it's all right for nextval to do it because
the only difference between what it logs and what it leaves in the
buffer is that last_value is advanced and log_cnt is zero; in particular
there is not a difference in is_called state.

For awhile I was convinced that nextval() was broken in another way,
because it does MarkBufferDirty() before changing the buffer contents,
which sure *looks* wrong; what if a checkpoint writes the
not-yet-modified buffer contents and then clears the dirty bit?
However, I now think this is safe because we hold the buffer content
lock exclusively until we've finished making the changes. A checkpoint
might observe the dirty bit set and attempt to write the page, but it
will block trying to get shared buffer content lock until we complete
the changes. So that should be all right, and I've added some comments
to explain it, but if anyone thinks it's wrong speak up!

Another thing that I think is a real bug is that AlterSequence needs
to reset log_cnt to zero any time it changes any of the sequence
generation parameters, to ensure that the new values will be applied
promptly. I've done that in the attached.

I also modified the code so that log_cnt is initialized to zero not one
when a sequence is created or reset to is_called = false. This doesn't
make any functional difference given the change to force logging a WAL
record when is_called becomes set, but I think it's a good change
anyway: it doesn't make any sense anymore to imply that one sequence
value has been pre-reserved. However, if anyone's particularly annoyed
by the first diff hunk in the regression tests, we could undo that.

Lastly, I modified read_seq_tuple (nee read_info) to return a HeapTuple
pointer to the sequence's tuple, and made use of that to make the rest
of the code simpler and less dependent on page-layout assumptions.
At this point this is only a code-beautification change; it was
functionally necessary in an earlier version of the patch, but I didn't
back it out when I got rid of the change that made it necessary.

Oh, there are also some hunks in here to add debug logging, which I was
using for testing. I'm undecided whether to remove those or clean them
up some more and wrap them in something like #ifdef SEQUENCE_DEBUG.

One other point about the regression test diffs: formerly, the typical
state seen at line 191 of sequence.out was log_cnt = 32. Now it is 31,
because the first nextval call increases log_cnt to 32 and then the
second decreases it to 31. The failure mode described in
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01359.php
is no longer possible, because a checkpoint before the first nextval
call won't change its behavior. But instead, if a checkpoint happens
*between* those two nextvals, the second nextval will decide to create
a WAL entry and it will increase log_cnt to 32. So we still need the
variant expected file sequence_1.out, but now it shows log_cnt = 32
instead of the other way around.

Comments anyone?

regards, tom lane

Attachment	Content-Type	Size
sequence-fixes.patch	text/x-patch	28.2 KB