Freezing without write I/O

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Freezing without write I/O
Date: 2013-09-19 01:20:00
Message-ID: CAMkU=1y4wjBFYTQRVVn-v2Etz6yJrGOpTNd3d01wXjE6+dOokw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 18, 2013 at 12:55 PM, Jeff Janes
<jeff(dot)janes(at)gmail(dot)com<javascript:_e({}, 'cvml',
'jeff(dot)janes(at)gmail(dot)com');>
> wrote:

> On Mon, Sep 16, 2013 at 6:59 AM, Heikki Linnakangas <
> hlinnakangas(at)vmware(dot)com <javascript:_e({}, 'cvml',
> 'hlinnakangas(at)vmware(dot)com');>> wrote:
>
>>
>> Here's a rebased version of the patch, including the above-mentioned
>> fixes. Nothing else new.
>
>
> I've applied this to 0892ecbc015930d, the last commit to which it applies
> cleanly.
>
> When I test this by repeatedly incrementing a counter in a randomly chosen
> row, then querying the whole table and comparing the results to what my
> driver knows they should be, I get discrepancies.
>
> No crash/recovery needs to be done to get the behavior.
>
> The number of rows is correct, so one version of every row is visible, but
> it is sometimes the wrong version.
>
> The discrepancy arises shortly after the first time this type of message
> appears:
>
> 6930 UPDATE 2013-09-18 12:36:34.519 PDT:LOG: started new XID range, XIDs
> 1000033-, MultiXIDs 1-, tentative LSN 0/FA517F8
> 6930 UPDATE 2013-09-18 12:36:34.519 PDT:STATEMENT: update foo set
> count=count+1 where index=$1
> 6928 UPDATE 2013-09-18 12:36:34.521 PDT:LOG: closed old XID range at
> 1000193 (LSN 0/FA58A08)
> 6928 UPDATE 2013-09-18 12:36:34.521 PDT:STATEMENT: update foo set
> count=count+1 where index=$1
>
> I'll work on getting the driver to shutdown the database the first time it
> finds a problem so that autovac doesn't destroy evidence.
>

I have uploaded the script to reproduce, and a tarball of the data
directory (when started, it will go through recovery. table "foo" is in
the jjanes database and role.)

https://drive.google.com/folderview?id=0Bzqrh1SO9FcEek51NGEzRmFDVEE&usp=sharing

The row with index=8499 should have count of 8, but really has count of 4,
and is only findable by seq scan, there is no such row by index scan.

select ctid,* from foo where index=8499;
select ctid,* from foo where index+0=8499;

select * from heap_page_items(get_raw_page('foo',37)) where lp=248 \x\g\x
Expanded display is on.
-[ RECORD 1 ]---------
lp | 248
lp_off | 8160
lp_flags | 1
lp_len | 32
t_xmin | 2
t_xmax | 0
t_field3 | 0
t_ctid | (37,248)
t_infomask2 | 32770
t_infomask | 10496
t_hoff | 24
t_bits |
t_oid |

So the xmax is 0 when it really should not be.

What I really want to do is find the not-visible ctids which would have
8499 for index, but I don't know how to do that.

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Farina 2013-09-19 02:09:18 Some interesting news about Linux 3.12 OOM
Previous Message David Johnston 2013-09-18 23:36:27 Re: Not In Foreign Key Constraint