Re: Questions about connection clean-up and "invalid page header"

Lists: pgsql-generalpgsql-hackers
From: Herouth Maoz <herouth(at)unicell(dot)co(dot)il>
To: pgsql-general(at)postgresql(dot)org
Subject: Questions about connection clean-up and "invalid page header"
Date: 2010-01-24 10:17:11
Message-ID: 201001241217.11751.herouth@unicell.co.il
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi Everybody.

I have two questions.

1. We have a system that is accessed by Crystal reports which is in turned
controlled by another (3rd party) system. Now, when a report takes too long or
the user cancels it, it doesn't send a cancel request to Postgres. It just
kills the Crystal process that works on it.

As a result, the query is left alive on the Postgres backend. Eventually I get
the message "Unexpected End of file" and the query is cancelled. But this
doesn't happen soon enough for me - these are usually very heavy queries, and
I'd like them to be cleaned up as soon as possible if the client connection
has ended.

Is there a parameter to set in the configuration or some other means to
shorten the time before an abandoned backend's query is cancelled?

2. I get the following message in my development database:

vacuumdb: vacuuming of database "reports" failed: ERROR: invalid page header
in block 6200 of relation "rb"

I had this already a couple of months ago. Looking around the web, I saw this
error is supposed to indicate a hardware error. I informed my sysadmin, but
since this is just the dev system and the data was not important, I did a
TRUNCATE TABLE on the "rb" relation, and the errors stopped...

But now the error is back, and I'm a bit suspicious. If this is a hardware
issue, it's rather suspicious that it returned in the exact same relation
after I did a "truncate table". I have many other relations in the system,
ones that fill up a lot faster. So I suspect this might be a PostgreSQL issue
after all. What can I do about this?

We are currently using PostgreSQL v. 8.3.1 on the server side.

TIA,
Herouth


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Herouth Maoz <herouth(at)unicell(dot)co(dot)il>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Questions about connection clean-up and "invalid page header"
Date: 2010-01-25 08:15:07
Message-ID: dcc563d11001250015v12fac7a7l7f93082baadfc913@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Sun, Jan 24, 2010 at 3:17 AM, Herouth Maoz <herouth(at)unicell(dot)co(dot)il> wrote:
> Hi Everybody.
>
> I have two questions.
>
> 1. We have a system that is accessed by Crystal reports which is in turned
> controlled by another (3rd party) system. Now, when a report takes too long or
> the user cancels it, it doesn't send a cancel request to Postgres. It just
> kills the Crystal process that works on it.
>
> As a result, the query is left alive on the Postgres backend. Eventually I get
> the message "Unexpected End of file" and the query is cancelled. But this
> doesn't happen soon enough for me - these are usually very heavy queries, and
> I'd like them to be cleaned up as soon as possible if the client connection
> has ended.

The real solution is to fix the application. But I understand
sometimes you can't do that.

> Is there a parameter to set in the configuration or some other means to
> shorten the time before an abandoned backend's query is cancelled?

You can shorten the tcp_keepalive settings so that dead connections
get detected faster.

> 2. I get the following message in my development database:
>
> vacuumdb: vacuuming of database "reports" failed: ERROR:  invalid page header
> in block 6200 of relation "rb"
>
> I had this already a couple of months ago. Looking around the web, I saw this
> error is supposed to indicate a hardware error. I informed my sysadmin, but
> since this is just the dev system and the data was not important, I did a
> TRUNCATE TABLE on the "rb" relation, and the errors stopped...
>
> But now the error is back, and I'm a bit suspicious. If this is a hardware
> issue, it's rather suspicious that it returned in the exact same relation
> after I did a "truncate table". I have many other relations in the system,
> ones that fill up a lot faster. So I suspect this might be a PostgreSQL issue
> after all. What can I do about this?

Might be, but not very likely. I and many others run pgsql in
production environments where it handles thousands of updates /
inserts per minute with no corruption. We run on server class
hardware with ECC memory and large RAID arrays with no corruption.

Have you run something as simple as memtest86+ on your machine to see
if it's got bad memory?

> We are currently using PostgreSQL v. 8.3.1 on the server side.

You should really update to the latest 8.3.x version (around 8.3.8 or
so). It's simple and easy, and it's possible you've hit a bug in an
older version of 8.3.


From: Herouth Maoz <herouth(at)unicell(dot)co(dot)il>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Questions about connection clean-up and "invalid page header"
Date: 2010-01-25 10:09:33
Message-ID: 4B5D6DDD.1030606@unicell.co.il
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Scott Marlowe wrote:

> You can shorten the tcp_keepalive settings so that dead connections
> get detected faster.
>
Thanks, I'll ask my sysadmin to do that.
>
> Might be, but not very likely. I and many others run pgsql in
> production environments where it handles thousands of updates /
> inserts per minute with no corruption. We run on server class
> hardware with ECC memory and large RAID arrays with no corruption.
>
Someone pointed out to me, though, that comparing data warehouse systems
to production systems is like Apples and Oranges - we also have a
production system that, as you say, makes millions of inserts and
updates per hour. It works very well with PostgreSQL - a lot better than
with Sybase with which we worked previously. But the reports system on
which I work makes bulk inserts using calculations based on complicated
joins and each transaction is long and memory-consuming, as opposed to
the production system, where each transaction takes a few milliseconds
and is cleared immediately.

So far this only happened to me in the development server, and if it
really is a matter of hardware, I'm not worried. What I am worried is if
there really is some sort of bug that may carry to our production
reports system.
> Have you run something as simple as memtest86+ on your machine to see
> if it's got bad memory?
>
I'll tell my sysadmin to do that. Thank you.
>
>> We are currently using PostgreSQL v. 8.3.1 on the server side.
>>
>
> You should really update to the latest 8.3.x version (around 8.3.8 or
> so). It's simple and easy, and it's possible you've hit a bug in an
> older version of 8.3.
>
OK, I'll also try to get that done.

Thanks for your help,
Herouth


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: Herouth Maoz <herouth(at)unicell(dot)co(dot)il>, pgsql-general(at)postgresql(dot)org
Subject: Re: Questions about connection clean-up and "invalid page header"
Date: 2010-01-25 11:26:17
Message-ID: 407d949e1001250326j47c812d4v932ba9a85bb78dcb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Mon, Jan 25, 2010 at 8:15 AM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
>> Is there a parameter to set in the configuration or some other means to
>> shorten the time before an abandoned backend's query is cancelled?
>
> You can shorten the tcp_keepalive settings so that dead connections
> get detected faster.
>

This won't help. The TCP connection is already being closed (or I
think only half-closed). The problem is that in the Unix socket API
you don't find out about that unless you check or try to read or write
to it.

The tcp_keepalive setting would only come into play if the remote
machine crashed or was disconnected from the network.

--
greg


From: Herouth Maoz <herouth(at)unicell(dot)co(dot)il>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Questions about connection clean-up and "invalid page header"
Date: 2010-01-25 11:37:31
Message-ID: 4B5D827B.1050205@unicell.co.il
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Greg Stark wrote:

> On Mon, Jan 25, 2010 at 8:15 AM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
>
>>> Is there a parameter to set in the configuration or some other means to
>>> shorten the time before an abandoned backend's query is cancelled?
>>>
>> You can shorten the tcp_keepalive settings so that dead connections
>> get detected faster.
>>
>>
>
> This won't help. The TCP connection is already being closed (or I
> think only half-closed). The problem is that in the Unix socket API
> you don't find out about that unless you check or try to read or write
> to it.
>
> The tcp_keepalive setting would only come into play if the remote
> machine crashed or was disconnected from the network.
>
That's the situation I'm having, so it's OK. Crystal, being a Windows
application, obviously runs on a different server than the database
itself, so the connection between them is TCP/IP, not Unix domain
sockets. And furthermore, that was exactly the problem as I described it
- the fact that the third party software, instead of somehow instructing
Crystal to send a cancel request to PostgreSQL, instead just kills the
client process on the Windows side.

Herouth


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Herouth Maoz <herouth(at)unicell(dot)co(dot)il>
Cc: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Questions about connection clean-up and "invalid page header"
Date: 2010-01-25 13:01:35
Message-ID: 407d949e1001250501tc9c9850s21fb683b23eff664@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Mon, Jan 25, 2010 at 11:37 AM, Herouth Maoz <herouth(at)unicell(dot)co(dot)il> wrote:
> The tcp_keepalive setting would only come into play if the remote
> machine crashed or was disconnected from the network.
>
>
> That's the situation I'm having, so it's OK. Crystal, being a Windows
> application, obviously runs on a different server than the database itself,
> so the connection between them is TCP/IP, not Unix domain sockets.

The unix socket api is used for both unix domain sockets and internet
domain sockets. The point is that in the api there's no way to find
out about a connection the other side has closed except for when you
write or read from it or when you explicitly check.

> And
> furthermore, that was exactly the problem as I described it - the fact that
> the third party software, instead of somehow instructing Crystal to send a
> cancel request to PostgreSQL, instead just kills the client process on the
> Windows side.

Killing the client process doesn't mean the machine has crashed or
been disconnected from the network. I'm assuming Crystal isn't
crashing the machine just to stop the report... And even if it did and
tcp_keepalives kicked in the server *still* wouldn't notice until it
checked or tried to read or write to that socket.

--
greg


From: Herouth Maoz <herouth(at)unicell(dot)co(dot)il>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Questions about connection clean-up and "invalid page header"
Date: 2010-01-25 13:16:18
Message-ID: 4B5D99A2.2070004@unicell.co.il
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Greg Stark wrote:

> On Mon, Jan 25, 2010 at 11:37 AM, Herouth Maoz <herouth(at)unicell(dot)co(dot)il> wrote:
>
>> The tcp_keepalive setting would only come into play if the remote
>> machine crashed or was disconnected from the network.
>>
>>
>> That's the situation I'm having, so it's OK. Crystal, being a Windows
>> application, obviously runs on a different server than the database itself,
>> so the connection between them is TCP/IP, not Unix domain sockets.
>>
>
> The unix socket api is used for both unix domain sockets and internet
> domain sockets. The point is that in the api there's no way to find
> out about a connection the other side has closed except for when you
> write or read from it or when you explicitly check.
>
>
>
>> And
>> furthermore, that was exactly the problem as I described it - the fact that
>> the third party software, instead of somehow instructing Crystal to send a
>> cancel request to PostgreSQL, instead just kills the client process on the
>> Windows side.
>>
>
> Killing the client process doesn't mean the machine has crashed or
> been disconnected from the network. I'm assuming Crystal isn't
> crashing the machine just to stop the report... And even if it did and
> tcp_keepalives kicked in the server *still* wouldn't notice until it
> checked or tried to read or write to that socket.
>
>
Well, I assume by the fact that eventually I get an "Unexpected end of
file" message for those queries, that something does go in and check
them. Do you have any suggestion as to how to cause the postgresql
server to do so earlier?

Herouth


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Herouth Maoz <herouth(at)unicell(dot)co(dot)il>
Cc: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Questions about connection clean-up and "invalid page header"
Date: 2010-01-25 13:52:24
Message-ID: 407d949e1001250552i725f93e9h727865c863933825@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Mon, Jan 25, 2010 at 1:16 PM, Herouth Maoz <herouth(at)unicell(dot)co(dot)il> wrote:
> Well, I assume by the fact that eventually I get an "Unexpected end of file"
> message for those queries, that something does go in and check them. Do you
> have any suggestion as to how to cause the postgresql server to do so
> earlier?

No, Postgres pretty intentionally doesn't check because checking would
be quite slow.

If this is a plpgsql function looping you can put a RAISE NOTICE in
the loop periodically. I suppose you could write such a function and
add it to your query but whether it does what you want will depend on
the query plan.

--
greg


From: Greg Stark <gsstark(at)mit(dot)edu>
To: "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Fwd: Questions about connection clean-up and "invalid page header"
Date: 2010-01-25 13:56:53
Message-ID: 407d949e1001250556t36ede117x3bf424da81d83bb3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Given this thread on pgsql-general I wonder if we should have
something like 1 in every 1,000 CHECK_FOR_INTERRUPTS check if the
client socket is still open.

This has come up in the past and the concern was that syscalls would
be way too slow to put in critical loops but perhaps if it's only done
once every n checks it would be possible to find a good tradeoff?

---------- Forwarded message ----------
From: Greg Stark <gsstark(at)mit(dot)edu>
Date: Mon, Jan 25, 2010 at 1:52 PM
Subject: Re: Questions about connection clean-up and "invalid page header"
To: Herouth Maoz <herouth(at)unicell(dot)co(dot)il>
Cc: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org

On Mon, Jan 25, 2010 at 1:16 PM, Herouth Maoz <herouth(at)unicell(dot)co(dot)il> wrote:
> Well, I assume by the fact that eventually I get an "Unexpected end of file"
> message for those queries, that something does go in and check them. Do you
> have any suggestion as to how to cause the postgresql server to do so
> earlier?

No, Postgres pretty intentionally doesn't check because checking would
be quite slow.

If this is a plpgsql function looping you can put a RAISE NOTICE in
the loop periodically. I suppose you could write such a function and
add it to your query but whether it does what you want will depend on
the query plan.

--
greg

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: Questions about connection clean-up and "invalid page header"
Date: 2010-01-25 18:31:01
Message-ID: 24716.1264444261@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Given this thread on pgsql-general I wonder if we should have
> something like 1 in every 1,000 CHECK_FOR_INTERRUPTS check if the
> client socket is still open.

... and do what? If so, why?

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: Questions about connection clean-up and "invalid page header"
Date: 2010-01-26 03:56:39
Message-ID: 407d949e1001251956g38199672t4cbac757dbcd82e4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Mon, Jan 25, 2010 at 6:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
>> Given this thread on pgsql-general I wonder if we should have
>> something like 1 in every 1,000 CHECK_FOR_INTERRUPTS check if the
>> client socket is still open.
>
> ... and do what?  If so, why?

Hm, an interesting question.

Any time the client closes the server->client connection the server
_could_ abort the transaction and the client would be none-the-wiser.
Since it isn't sticking around to see if the transaction commits then
for all it knows the server is already aborting some of the
connections. At least in cases where it's running a SELECT then no
useful work is going to get done anyways since once the first results
are ready the server's only going to get a socket error and abort at
that point.

When the client closes the client->server connection I don't think
it's wise to take any action. Though I suppose if the connection is in
an explicit transaction and the operation currently running produces
no output (ie UPDATE or DELETE or INSERT) then an argument could be
made that no useful work could be done there either since the work
will only be rolled back.

--
greg