Re: streaming header too small

From: Selena Deckelmann <selena(at)chesnok(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming header too small
Date: 2013-02-20 15:53:54
Message-ID: CAN1EF+xw-4FfGY-L+eZSqe2rE7EaAbhf4kGuRZgPXw+svPokqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 20, 2013 at 6:23 AM, Magnus Hagander <magnus(at)hagander(dot)net>wrote:

>
> On Feb 20, 2013 11:29 AM, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
> wrote:
> >
> > On 20.02.2013 02:11, Selena Deckelmann wrote:
> >>
> >> So, I just ran into a similar issue backing up a 9.2.1 server using
> >> pg_basebackup version 9.2.3:
> >>
> >> pg_basebackup: starting background WAL receiver
> >> pg_basebackup: streaming header too small: 25
> >>
> >>
> >> I've had it happen two times in a row. I'm going to try again...
> >>
> >> But -- what would be helpful here? I can recompile pg_basebackup with
> more
> >> debugging...
> >
> >
> > Hmm, 25 bytes would be the size of the WAL data packet, if it contains
> just the header and no actual WAL data. I think pg_basebackup should accept
> that - it's not unreasonable that the server might send such a packet
> sometimes.
> >
> > Looking at the walsender code, it's not supposed to ever send such a
> packet. But I suspect there's one corner-case where it might: if the
> current send location is at an xlogid boundary, so that we previously sent
> the last byte from the last WAL segment in the previous logical xlog file,
> and the WAL flush position points to byte 0 in the beginning of the new WAL
> file. Both of those positions are in fact the same thing, but we have two
> different ways to represent the same position. For example, if we've
> already sent up to WAL position (sentPtr in walsender.c):
> >
> > xlogid = 4
> > xrecoff = XLogFileSize
> >
> > and GetFlushRecPtr() returns:
> >
> > xlogid = 5
> > xrecoff = 0
> >
> > Those both point to the same position. But the check in XLogSend that
> decides if there is any work to do uses XLByteLE() to check if they are
> equal, and XLByteLE() treats the latter to be greater than the former. So,
> in that situation, XLogSend() would decide that it has work to do, but
> there actually isn't, so it would send 0 bytes of WAL data.
> >
> > I'm not sure how GetFlushRecPtr() could return such a position, though.
> But I'm also not convinced that it can't happen.
> >
> > It would be fairly easy to fix walsender to not send anything in that
> situation. It would also be easy to fix pg_basebackup to not treat it as an
> error. We probably should do both.
> >
> > In 9.3, the XLogRecPtr representation changed so that there is only one
> value for a boundary position like that, so this is a 9.2-only issue.
>

Thanks for the debugging, Heikki!

> That does sound like a reasonable explanation and fix. Heck, probably
> enough to just put the fix in pg_basebackup since it's gone in 9.3 anyway.
>
> But I'd really like to confirm this is the actual situation before
> considering it fixed, since it's clearly very intermittent.
>
> Selena, was this reasonably reproducible for you? Would it be possible to
> get a network trace of it to show of that's the kind of package coming
> across, or by hacking up pg_basebackup to print the exact position it was
> at when the problem occurred?
>

This is happening with a very busy 700 GB system, so I'm going to rule out
a network trace out for the moment. The error is occurring "sometime" in
the middle of the backup. Last time it was at least 30-40 minutes into a 2
hr backup.

I'll see about hacking up pg_basebackup. I'm doing nightly backups so
hopefully (?) it will happen again in the next few days.

-selena

--
http://chesnok.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-02-20 15:54:53 Re: posix_fadvise missing in the walsender
Previous Message Robert Haas 2013-02-20 15:42:46 Re: JSON Function Bike Shedding