Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Date: 2013-02-19 21:42:43
Message-ID: 5123F1D3.30303@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19.2.2013 11:27, Tom Lane wrote:
> Tomas Vondra <tv(at)fuzzy(dot)cz> writes:
>> Dne 19.02.2013 05:46, Alvaro Herrera napsal:
>>> Mastodon failed:
>>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2013-02-19%2000%3A00%3A01
>>>
>>> probably worth investigating a bit; we might have broken something.
>
>> Hmmm, interesting. A single Windows machine, while the other Windows
>> machines seem to work fine (although some of them were not built for a
>> few weeks).
>
> Could be random chance --- we've seen the same failure before, eg
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2012-11-25%2006%3A00%3A00
>
> regards, tom lane

I'm looking at that test, and I'm not really sure about a few details.

First, this function seems pretty useless to me:

=======================================================================
create function wait_for_stats() returns void as $$
declare
start_time timestamptz := clock_timestamp();
updated bool;
begin
-- we don't want to wait forever; loop will exit after 30 seconds
for i in 1 .. 300 loop

-- check to see if indexscan has been sensed
SELECT (st.idx_scan >= pr.idx_scan + 1) INTO updated
FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';

exit when updated;

-- wait a little
perform pg_sleep(0.1);

-- reset stats snapshot so we can test again
perform pg_stat_clear_snapshot();

end loop;

-- report time waited in postmaster log (where it won't change test
output)
raise log 'wait_for_stats delayed % seconds',
extract(epoch from clock_timestamp() - start_time);
end
$$ language plpgsql;
=======================================================================

AFAIK the stats remain the same within a transaction, and as a function
runs within a transaction, it will either get new data on the first
iteration, or it will run all 300 of them. I've checked several
buildfarm members and I'm yet to see a single duration between 12ms and
30sec.

So IMHO we can replace the function call with pg_sleep(30) and we'll get
about the same effect.

But this obviously does not answer the question why it failed, although
on both occasions there's this log message:

[50b1b7fa.0568:14] LOG: wait_for_stats delayed 34.75 seconds

which essentialy means the stats were not updated before the call to
wait_for_stats().

Anyway, there are these two failing queries:

=======================================================================
-- check effects
SELECT st.seq_scan >= pr.seq_scan + 1,
st.seq_tup_read >= pr.seq_tup_read + cl.reltuples,
st.idx_scan >= pr.idx_scan + 1,
st.idx_tup_fetch >= pr.idx_tup_fetch + 1
FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
t | t | t | t
(1 row)

SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
st.idx_blks_read + st.idx_blks_hit >= pr.idx_blks + 1
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
?column? | ?column?
----------+----------
t | t
(1 row)
=======================================================================

The first one returns just falses, the second one retuns either (t,f) or
(f,f) - for the two failures posted by Alvaro and TL earlier today.

I'm really wondering how that could happen. The only thing that I can
think of is some strange timing issue, causing lost requests to write
the stats or maybe some of the stats updates. Hmmm, IIRC the stats are
sent over UDP - couldn't that be related?

Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gurjeet Singh 2013-02-19 21:50:45 Patch to make pgindent work cleanly
Previous Message Heikki Linnakangas 2013-02-19 20:43:32 Re: Call for Google Summer of Code mentors, admins