Re: streaming header too small

Lists: pgsql-hackers
From: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: streaming header too small
Date: 2012-01-09 06:34:11
Message-ID: CAJKUy5hLChrLCGN2SrRuEFqXVTyRE3LyiqjRpsEq4e8YN99d9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I was trying pg_basebackup on head, i used this command:
"""
postgres(at)jaime:/usr/local/pgsql/9.2$ bin/pg_basebackup -D $PWD/data2
-x stream -P -p 54392
"""

i got this error
"""
19093/19093 kB (100%), 1/1 tablespace
pg_basebackup: streaming header too small: 17
pg_basebackup: child process exited with error 1
"""

now, this streaming header size is defined in
src/bin/pg_basebackup/receivelog.c as "#define STREAMING_HEADER_SIZE
(1+8+8+8)", so WTF is this?
what are these numbers? shouldn't be at least a comment explaining
those? more important it's seems obvious something broke that, unless
i misunderstood something which is completely possible, and that the
way is do it it will broke again in the future if the header change

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming header too small
Date: 2012-01-09 10:09:35
Message-ID: CABUevEx3TRDDvqa_Af9BzAo2W7QpAsMTfb8OrKfK8iwOJtewUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 9, 2012 at 07:34, Jaime Casanova <jaime(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> I was trying pg_basebackup on head, i used this command:
> """
> postgres(at)jaime:/usr/local/pgsql/9.2$ bin/pg_basebackup -D $PWD/data2
> -x stream -P -p 54392
> """
>
> i got this error
> """
> 19093/19093 kB (100%), 1/1 tablespace
> pg_basebackup: streaming header too small: 17
> pg_basebackup: child process exited with error 1
> """
>
> now, this streaming header size is defined in
> src/bin/pg_basebackup/receivelog.c as "#define STREAMING_HEADER_SIZE
> (1+8+8+8)", so WTF is this?
> what are these numbers? shouldn't be at least a comment explaining
> those? more important it's seems obvious something broke that, unless

Those numbers are the size of WalDataMessageHeader - a struct which is
not available in the frontend, or at least wasn't at the time.

> i misunderstood something which is completely possible, and that the
> way is do it it will broke again in the future if the header change

Without looking at the details, I'm pretty sure it's the keepalive
message patch (64233902d22ba42846397cb7551894217522fad4).That one does
introduce a new message that's exactly that size.

pg_basebackup assumes the only kind of messages that can arrive are
the data messages, and this is no longer true. But if you check the
code for pg_basebackup, you'll see it checks the size of the message
*before* it checks the type of the message, which is why you get a
misleading error.

I'll dig into the details later - but you could try backing out that
patch to confirm if that's the problem.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming header too small
Date: 2012-01-09 11:00:15
Message-ID: CABUevEzFpa7Q7ANYpx+2-PZhSndqh7OFYGidCUm3Fy_MhEvbGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 9, 2012 at 11:09, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> On Mon, Jan 9, 2012 at 07:34, Jaime Casanova <jaime(at)2ndquadrant(dot)com> wrote:
>> Hi,
>>
>> I was trying pg_basebackup on head, i used this command:
>> """
>> postgres(at)jaime:/usr/local/pgsql/9.2$ bin/pg_basebackup -D $PWD/data2
>> -x stream -P -p 54392
>> """
>>
>> i got this error
>> """
>> 19093/19093 kB (100%), 1/1 tablespace
>> pg_basebackup: streaming header too small: 17
>> pg_basebackup: child process exited with error 1
>> """
>>
>> now, this streaming header size is defined in
>> src/bin/pg_basebackup/receivelog.c as "#define STREAMING_HEADER_SIZE
>> (1+8+8+8)", so WTF is this?
>> what are these numbers? shouldn't be at least a comment explaining
>> those? more important it's seems obvious something broke that, unless
>
> Those numbers are the size of WalDataMessageHeader - a struct which is
> not available in the frontend, or at least wasn't at the time.
>
>> i misunderstood something which is completely possible, and that the
>> way is do it it will broke again in the future if the header change
>
> Without looking at the details, I'm pretty sure it's the keepalive
> message patch (64233902d22ba42846397cb7551894217522fad4).That one does
> introduce a new message that's exactly that size.
>
> pg_basebackup assumes the only kind of messages that can arrive are
> the data messages, and this is no longer true. But if you check the
> code for pg_basebackup, you'll see it checks the size of the message
> *before* it checks the type of the message, which is why you get a
> misleading error.
>
> I'll dig into the details later - but you could try backing out that
> patch to confirm if that's the problem.

Confirmed that is it, and attached are two patches to fix it. The
first one I intend to backport to 9.1, since it just fixes the error
message. The other one is for 9.2. I'll also look at a better way to
get that structure size. comments?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Attachment Content-Type Size
0001-Reorder-check-for-streaming-message-and-header-size.patch text/x-patch 1.3 KB
0002-Accept-and-ignore-keepalive-messages-in-pg_basebacku.patch text/x-patch 1.4 KB

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming header too small
Date: 2012-01-09 17:11:01
Message-ID: CABUevEyv+YCQ26eXFKgBetDekeSp_YCJs80m1N=bFy3a47-GOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 9, 2012 at 12:00, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> On Mon, Jan 9, 2012 at 11:09, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>> On Mon, Jan 9, 2012 at 07:34, Jaime Casanova <jaime(at)2ndquadrant(dot)com> wrote:
>>> Hi,
>>>
>>> I was trying pg_basebackup on head, i used this command:
>>> """
>>> postgres(at)jaime:/usr/local/pgsql/9.2$ bin/pg_basebackup -D $PWD/data2
>>> -x stream -P -p 54392
>>> """
>>>
>>> i got this error
>>> """
>>> 19093/19093 kB (100%), 1/1 tablespace
>>> pg_basebackup: streaming header too small: 17
>>> pg_basebackup: child process exited with error 1
>>> """
>>>
>>> now, this streaming header size is defined in
>>> src/bin/pg_basebackup/receivelog.c as "#define STREAMING_HEADER_SIZE
>>> (1+8+8+8)", so WTF is this?
>>> what are these numbers? shouldn't be at least a comment explaining
>>> those? more important it's seems obvious something broke that, unless
>>
>> Those numbers are the size of WalDataMessageHeader - a struct which is
>> not available in the frontend, or at least wasn't at the time.
>>
>>> i misunderstood something which is completely possible, and that the
>>> way is do it it will broke again in the future if the header change
>>
>> Without looking at the details, I'm pretty sure it's the keepalive
>> message patch (64233902d22ba42846397cb7551894217522fad4).That one does
>> introduce a new message that's exactly that size.
>>
>> pg_basebackup assumes the only kind of messages that can arrive are
>> the data messages, and this is no longer true. But if you check the
>> code for pg_basebackup, you'll see it checks the size of the message
>> *before* it checks the type of the message, which is why you get a
>> misleading error.
>>
>> I'll dig into the details later - but you could try backing out that
>> patch to confirm if that's the problem.
>
> Confirmed that is it, and attached are two patches to fix it. The
> first one I intend to backport to 9.1, since it just fixes the error
> message. The other one is for 9.2. I'll also look at a better way to
> get that structure size.  comments?

Patch applied.

Realized there is no need to backpatch, because this code didn't even
exist in 9.1. The streaming mode of pg_basebackup (which is the only
affected one) didn't exist then...

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Selena Deckelmann <selena(at)chesnok(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming header too small
Date: 2013-02-20 00:11:23
Message-ID: CAN1EF+wsF3TRFjji0QmTYZe7g3FRCn47Ux2KcLyFMafe+5S0tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 9, 2012 at 9:11 AM, Magnus Hagander <magnus(at)hagander(dot)net> wrote:

> On Mon, Jan 9, 2012 at 12:00, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> > On Mon, Jan 9, 2012 at 11:09, Magnus Hagander <magnus(at)hagander(dot)net>
> wrote:
> >> On Mon, Jan 9, 2012 at 07:34, Jaime Casanova <jaime(at)2ndquadrant(dot)com>
> wrote:
> >>> Hi,
> >>>
> >>> I was trying pg_basebackup on head, i used this command:
> >>> """
> >>> postgres(at)jaime:/usr/local/pgsql/9.2$ bin/pg_basebackup -D $PWD/data2
> >>> -x stream -P -p 54392
> >>> """
> >>>
> >>> i got this error
> >>> """
> >>> 19093/19093 kB (100%), 1/1 tablespace
> >>> pg_basebackup: streaming header too small: 17
> >>> pg_basebackup: child process exited with error 1
> >>> """
> >>>
> >>> now, this streaming header size is defined in
> >>> src/bin/pg_basebackup/receivelog.c as "#define STREAMING_HEADER_SIZE
> >>> (1+8+8+8)", so WTF is this?
> >>> what are these numbers? shouldn't be at least a comment explaining
> >>> those? more important it's seems obvious something broke that, unless
> >>
> >> Those numbers are the size of WalDataMessageHeader - a struct which is
> >> not available in the frontend, or at least wasn't at the time.
> >>
> >>> i misunderstood something which is completely possible, and that the
> >>> way is do it it will broke again in the future if the header change
> >>
> >> Without looking at the details, I'm pretty sure it's the keepalive
> >> message patch (64233902d22ba42846397cb7551894217522fad4).That one does
> >> introduce a new message that's exactly that size.
> >>
> >> pg_basebackup assumes the only kind of messages that can arrive are
> >> the data messages, and this is no longer true. But if you check the
> >> code for pg_basebackup, you'll see it checks the size of the message
> >> *before* it checks the type of the message, which is why you get a
> >> misleading error.
> >>
> >> I'll dig into the details later - but you could try backing out that
> >> patch to confirm if that's the problem.
> >
> > Confirmed that is it, and attached are two patches to fix it. The
> > first one I intend to backport to 9.1, since it just fixes the error
> > message. The other one is for 9.2. I'll also look at a better way to
> > get that structure size. comments?
>
> Patch applied.
>
> Realized there is no need to backpatch, because this code didn't even
> exist in 9.1. The streaming mode of pg_basebackup (which is the only
> affected one) didn't exist then...
>

So, I just ran into a similar issue backing up a 9.2.1 server using
pg_basebackup version 9.2.3:

pg_basebackup: starting background WAL receiver
pg_basebackup: streaming header too small: 25

I've had it happen two times in a row. I'm going to try again...

But -- what would be helpful here? I can recompile pg_basebackup with more
debugging...

-selena

--
http://chesnok.com


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Selena Deckelmann <selena(at)chesnok(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming header too small
Date: 2013-02-20 10:29:20
Message-ID: 5124A580.2000304@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20.02.2013 02:11, Selena Deckelmann wrote:
> So, I just ran into a similar issue backing up a 9.2.1 server using
> pg_basebackup version 9.2.3:
>
> pg_basebackup: starting background WAL receiver
> pg_basebackup: streaming header too small: 25
>
>
> I've had it happen two times in a row. I'm going to try again...
>
> But -- what would be helpful here? I can recompile pg_basebackup with more
> debugging...

Hmm, 25 bytes would be the size of the WAL data packet, if it contains
just the header and no actual WAL data. I think pg_basebackup should
accept that - it's not unreasonable that the server might send such a
packet sometimes.

Looking at the walsender code, it's not supposed to ever send such a
packet. But I suspect there's one corner-case where it might: if the
current send location is at an xlogid boundary, so that we previously
sent the last byte from the last WAL segment in the previous logical
xlog file, and the WAL flush position points to byte 0 in the beginning
of the new WAL file. Both of those positions are in fact the same thing,
but we have two different ways to represent the same position. For
example, if we've already sent up to WAL position (sentPtr in walsender.c):

xlogid = 4
xrecoff = XLogFileSize

and GetFlushRecPtr() returns:

xlogid = 5
xrecoff = 0

Those both point to the same position. But the check in XLogSend that
decides if there is any work to do uses XLByteLE() to check if they are
equal, and XLByteLE() treats the latter to be greater than the former.
So, in that situation, XLogSend() would decide that it has work to do,
but there actually isn't, so it would send 0 bytes of WAL data.

I'm not sure how GetFlushRecPtr() could return such a position, though.
But I'm also not convinced that it can't happen.

It would be fairly easy to fix walsender to not send anything in that
situation. It would also be easy to fix pg_basebackup to not treat it as
an error. We probably should do both.

In 9.3, the XLogRecPtr representation changed so that there is only one
value for a boundary position like that, so this is a 9.2-only issue.

- Heikki


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, Selena Deckelmann <selena(at)chesnok(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming header too small
Date: 2013-02-20 14:23:45
Message-ID: CABUevEz5b_=0gQoDy2MaeFpFcQznsqekU6DvCLF-XsnzytNDwQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 20, 2013 11:29 AM, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
wrote:
>
> On 20.02.2013 02:11, Selena Deckelmann wrote:
>>
>> So, I just ran into a similar issue backing up a 9.2.1 server using
>> pg_basebackup version 9.2.3:
>>
>> pg_basebackup: starting background WAL receiver
>> pg_basebackup: streaming header too small: 25
>>
>>
>> I've had it happen two times in a row. I'm going to try again...
>>
>> But -- what would be helpful here? I can recompile pg_basebackup with
more
>> debugging...
>
>
> Hmm, 25 bytes would be the size of the WAL data packet, if it contains
just the header and no actual WAL data. I think pg_basebackup should accept
that - it's not unreasonable that the server might send such a packet
sometimes.
>
> Looking at the walsender code, it's not supposed to ever send such a
packet. But I suspect there's one corner-case where it might: if the
current send location is at an xlogid boundary, so that we previously sent
the last byte from the last WAL segment in the previous logical xlog file,
and the WAL flush position points to byte 0 in the beginning of the new WAL
file. Both of those positions are in fact the same thing, but we have two
different ways to represent the same position. For example, if we've
already sent up to WAL position (sentPtr in walsender.c):
>
> xlogid = 4
> xrecoff = XLogFileSize
>
> and GetFlushRecPtr() returns:
>
> xlogid = 5
> xrecoff = 0
>
> Those both point to the same position. But the check in XLogSend that
decides if there is any work to do uses XLByteLE() to check if they are
equal, and XLByteLE() treats the latter to be greater than the former. So,
in that situation, XLogSend() would decide that it has work to do, but
there actually isn't, so it would send 0 bytes of WAL data.
>
> I'm not sure how GetFlushRecPtr() could return such a position, though.
But I'm also not convinced that it can't happen.
>
> It would be fairly easy to fix walsender to not send anything in that
situation. It would also be easy to fix pg_basebackup to not treat it as an
error. We probably should do both.
>
> In 9.3, the XLogRecPtr representation changed so that there is only one
value for a boundary position like that, so this is a 9.2-only issue.

That does sound like a reasonable explanation and fix. Heck, probably
enough to just put the fix in pg_basebackup since it's gone in 9.3 anyway.

But I'd really like to confirm this is the actual situation before
considering it fixed, since it's clearly very intermittent.

Selena, was this reasonably reproducible for you? Would it be possible to
get a network trace of it to show of that's the kind of package coming
across, or by hacking up pg_basebackup to print the exact position it was
at when the problem occurred?

/Magnus


From: Selena Deckelmann <selena(at)chesnok(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming header too small
Date: 2013-02-20 15:53:54
Message-ID: CAN1EF+xw-4FfGY-L+eZSqe2rE7EaAbhf4kGuRZgPXw+svPokqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 20, 2013 at 6:23 AM, Magnus Hagander <magnus(at)hagander(dot)net>wrote:

>
> On Feb 20, 2013 11:29 AM, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
> wrote:
> >
> > On 20.02.2013 02:11, Selena Deckelmann wrote:
> >>
> >> So, I just ran into a similar issue backing up a 9.2.1 server using
> >> pg_basebackup version 9.2.3:
> >>
> >> pg_basebackup: starting background WAL receiver
> >> pg_basebackup: streaming header too small: 25
> >>
> >>
> >> I've had it happen two times in a row. I'm going to try again...
> >>
> >> But -- what would be helpful here? I can recompile pg_basebackup with
> more
> >> debugging...
> >
> >
> > Hmm, 25 bytes would be the size of the WAL data packet, if it contains
> just the header and no actual WAL data. I think pg_basebackup should accept
> that - it's not unreasonable that the server might send such a packet
> sometimes.
> >
> > Looking at the walsender code, it's not supposed to ever send such a
> packet. But I suspect there's one corner-case where it might: if the
> current send location is at an xlogid boundary, so that we previously sent
> the last byte from the last WAL segment in the previous logical xlog file,
> and the WAL flush position points to byte 0 in the beginning of the new WAL
> file. Both of those positions are in fact the same thing, but we have two
> different ways to represent the same position. For example, if we've
> already sent up to WAL position (sentPtr in walsender.c):
> >
> > xlogid = 4
> > xrecoff = XLogFileSize
> >
> > and GetFlushRecPtr() returns:
> >
> > xlogid = 5
> > xrecoff = 0
> >
> > Those both point to the same position. But the check in XLogSend that
> decides if there is any work to do uses XLByteLE() to check if they are
> equal, and XLByteLE() treats the latter to be greater than the former. So,
> in that situation, XLogSend() would decide that it has work to do, but
> there actually isn't, so it would send 0 bytes of WAL data.
> >
> > I'm not sure how GetFlushRecPtr() could return such a position, though.
> But I'm also not convinced that it can't happen.
> >
> > It would be fairly easy to fix walsender to not send anything in that
> situation. It would also be easy to fix pg_basebackup to not treat it as an
> error. We probably should do both.
> >
> > In 9.3, the XLogRecPtr representation changed so that there is only one
> value for a boundary position like that, so this is a 9.2-only issue.
>

Thanks for the debugging, Heikki!

> That does sound like a reasonable explanation and fix. Heck, probably
> enough to just put the fix in pg_basebackup since it's gone in 9.3 anyway.
>
> But I'd really like to confirm this is the actual situation before
> considering it fixed, since it's clearly very intermittent.
>
> Selena, was this reasonably reproducible for you? Would it be possible to
> get a network trace of it to show of that's the kind of package coming
> across, or by hacking up pg_basebackup to print the exact position it was
> at when the problem occurred?
>

This is happening with a very busy 700 GB system, so I'm going to rule out
a network trace out for the moment. The error is occurring "sometime" in
the middle of the backup. Last time it was at least 30-40 minutes into a 2
hr backup.

I'll see about hacking up pg_basebackup. I'm doing nightly backups so
hopefully (?) it will happen again in the next few days.

-selena

--
http://chesnok.com


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Selena Deckelmann <selena(at)chesnok(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming header too small
Date: 2013-02-20 15:55:22
Message-ID: CABUevEzwipq1YCGsnBZKJCgJkJZDm8V2X=_UftNgYniQmmKjew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 20, 2013 at 4:53 PM, Selena Deckelmann <selena(at)chesnok(dot)com> wrote:
>
>
>
> On Wed, Feb 20, 2013 at 6:23 AM, Magnus Hagander <magnus(at)hagander(dot)net>
> wrote:
>>
>>
>> On Feb 20, 2013 11:29 AM, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
>> wrote:
>> >
>> > On 20.02.2013 02:11, Selena Deckelmann wrote:
>> >>
>> >> So, I just ran into a similar issue backing up a 9.2.1 server using
>> >> pg_basebackup version 9.2.3:
>> >>
>> >> pg_basebackup: starting background WAL receiver
>> >> pg_basebackup: streaming header too small: 25
>> >>
>> >>
>> >> I've had it happen two times in a row. I'm going to try again...
>> >>
>> >> But -- what would be helpful here? I can recompile pg_basebackup with
>> >> more
>> >> debugging...
>> >
>> >
>> > Hmm, 25 bytes would be the size of the WAL data packet, if it contains
>> > just the header and no actual WAL data. I think pg_basebackup should accept
>> > that - it's not unreasonable that the server might send such a packet
>> > sometimes.
>> >
>> > Looking at the walsender code, it's not supposed to ever send such a
>> > packet. But I suspect there's one corner-case where it might: if the current
>> > send location is at an xlogid boundary, so that we previously sent the last
>> > byte from the last WAL segment in the previous logical xlog file, and the
>> > WAL flush position points to byte 0 in the beginning of the new WAL file.
>> > Both of those positions are in fact the same thing, but we have two
>> > different ways to represent the same position. For example, if we've already
>> > sent up to WAL position (sentPtr in walsender.c):
>> >
>> > xlogid = 4
>> > xrecoff = XLogFileSize
>> >
>> > and GetFlushRecPtr() returns:
>> >
>> > xlogid = 5
>> > xrecoff = 0
>> >
>> > Those both point to the same position. But the check in XLogSend that
>> > decides if there is any work to do uses XLByteLE() to check if they are
>> > equal, and XLByteLE() treats the latter to be greater than the former. So,
>> > in that situation, XLogSend() would decide that it has work to do, but there
>> > actually isn't, so it would send 0 bytes of WAL data.
>> >
>> > I'm not sure how GetFlushRecPtr() could return such a position, though.
>> > But I'm also not convinced that it can't happen.
>> >
>> > It would be fairly easy to fix walsender to not send anything in that
>> > situation. It would also be easy to fix pg_basebackup to not treat it as an
>> > error. We probably should do both.
>> >
>> > In 9.3, the XLogRecPtr representation changed so that there is only one
>> > value for a boundary position like that, so this is a 9.2-only issue.
>
>
> Thanks for the debugging, Heikki!
>
>>
>> That does sound like a reasonable explanation and fix. Heck, probably
>> enough to just put the fix in pg_basebackup since it's gone in 9.3 anyway.
>>
>> But I'd really like to confirm this is the actual situation before
>> considering it fixed, since it's clearly very intermittent.
>>
>> Selena, was this reasonably reproducible for you? Would it be possible to
>> get a network trace of it to show of that's the kind of package coming
>> across, or by hacking up pg_basebackup to print the exact position it was at
>> when the problem occurred?
>
>
> This is happening with a very busy 700 GB system, so I'm going to rule out a
> network trace out for the moment. The error is occurring "sometime" in the
> middle of the backup. Last time it was at least 30-40 minutes into a 2 hr
> backup.

Yikes. Yeah, that doesn't seem like the best choice of method :)

> I'll see about hacking up pg_basebackup. I'm doing nightly backups so
> hopefully (?) it will happen again in the next few days.

Thanks!

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Selena Deckelmann <selena(at)chesnok(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming header too small
Date: 2013-02-20 16:02:05
Message-ID: 5124F37D.9020303@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20.02.2013 17:53, Selena Deckelmann wrote:
> On Wed, Feb 20, 2013 at 6:23 AM, Magnus Hagander<magnus(at)hagander(dot)net>wrote:
>
>> Selena, was this reasonably reproducible for you? Would it be possible to
>> get a network trace of it to show of that's the kind of package coming
>> across, or by hacking up pg_basebackup to print the exact position it was
>> at when the problem occurred?
>
> This is happening with a very busy 700 GB system, so I'm going to rule out
> a network trace out for the moment. The error is occurring "sometime" in
> the middle of the backup. Last time it was at least 30-40 minutes into a 2
> hr backup.

If you could pinpoint the WAL position where the error happens, that
would already help somewhat. For starters, put pg_receivexlog to verbose
mode, so that it will print a line after each WAL segment. If my theory
is correct, the error should happen at xlogid boundaries, ie. just after
finishing a WAL segment whose filename ends with "FE".

- Heikki