Re: Buildfarm feature request: some way to track/classify failures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 03:22:13
Message-ID: 20700.1174274533@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> OK, for anyone that wants to play, I have created an extract that
> contains a summary of every non-CVS-related failure we've had. It's a
> single table looking like this:

I did some analysis on this data. Attached is a text dump of a table
declared as

CREATE TABLE mreasons (
sysname text,
snapshot timestamp without time zone,
branch text,
reason text,
known boolean
);

where the sysname/snapshot/branch data is taken from your table,
"reason" is a brief sketch of the failure, and "known" indicates
whether the cause is known ... although as I went along it sort
of evolved into "does this seem worthy of more investigation?".

I looked at every failure back through early December. I'd intended to
go back further, but decided I'd hit a point of diminishing returns.
However, failures back to the beginning of July that matched grep
searches for recent symptoms are classified in the table.

The gross stats are: 2231 failures classified, 71 distinct reason
codes, 81 failures (with 18 reasons) that seem worthy of closer
investigation:

bfarm=# select reason,branch,max(snapshot) as latest, count(*) from mreasons where not known group by 1,2 order by 1,2 ;
reason | branch | latest | count
------------------------------------------------------------------+---------------+---------------------+-------
Input/output error - possible hardware problem | HEAD | 2007-03-06 10:30:01 | 1
No rule to make target | HEAD | 2007-02-08 15:30:01 | 6
No rule to make target | REL8_0_STABLE | 2007-02-28 03:15:02 | 9
No rule to make target | REL8_2_STABLE | 2006-12-17 20:00:01 | 1
could not open relation with OID | HEAD | 2007-03-16 16:45:01 | 2
could not open relation with OID | REL8_1_STABLE | 2006-08-29 23:30:07 | 2
createlang not found? | REL8_1_STABLE | 2007-02-28 02:50:00 | 1
irreproducible contrib/sslinfo build failure, likely not our bug | HEAD | 2007-02-03 07:03:02 | 1
irreproducible opr_sanity failure | HEAD | 2006-12-18 19:15:02 | 2
libintl.h rejected by configure | HEAD | 2007-01-11 20:35:00 | 3
libintl.h rejected by configure | REL8_0_STABLE | 2007-03-01 20:28:04 | 22
postmaster failed to start | REL7_4_STABLE | 2007-02-28 22:23:20 | 1
postmaster failed to start | REL8_0_STABLE | 2007-02-28 22:30:44 | 1
random Solaris configure breakage | HEAD | 2007-01-14 05:30:00 | 1
random Windows breakage | HEAD | 2007-03-16 09:48:31 | 3
random Windows breakage | REL8_0_STABLE | 2007-03-15 03:15:09 | 7
segfault during bootstrap | HEAD | 2007-03-12 23:03:03 | 1
server does not shut down | HEAD | 2007-01-08 03:03:03 | 3
tablespace is not empty | HEAD | 2007-02-24 15:00:10 | 6
tablespace is not empty | REL8_1_STABLE | 2007-01-25 02:30:01 | 2
unexpected statement_timeout failure | HEAD | 2007-01-25 05:05:06 | 1
unexplained tsearch2 crash | HEAD | 2007-01-10 22:05:02 | 1
weird DST-transition-like timestamp test failure | HEAD | 2007-02-04 07:25:04 | 1
weird assembler failure, likely not our bug | HEAD | 2006-12-26 17:02:01 | 1
weird assembler failure, likely not our bug | REL8_2_STABLE | 2007-02-03 23:47:01 | 1
weird install failure | HEAD | 2007-01-25 12:35:00 | 1
(26 rows)

I think I know the cause of the recent 'could not open relation with
OID' failures in HEAD, but the rest of these maybe need a look.
Any volunteers?

Also, for completeness, the causes I wrote off as not interesting
(anymore, in some cases):

bfarm=# select reason,max(snapshot) as latest, count(*) from mreasons where known group by 1 order by 1 ;
reason | latest | count
----------------------------------------------------------------------+---------------------+-------
DST transition test failure | 2007-03-13 04:04:47 | 26
ISO-week-patch regression test breakage | 2007-02-16 15:00:08 | 23
No rule to make Makefile.port | 2007-03-02 12:30:02 | 40
Out of disk space | 2007-02-16 22:30:01 | 67
Out of semaphores | 2007-02-20 02:03:31 | 14
Python not installed | 2007-02-19 22:45:05 | 2
Solaris random conn-refused bug | 2007-03-06 01:20:00 | 37
TCP socket already in use | 2007-01-09 07:03:04 | 13
Too many clients | 2007-02-26 06:06:02 | 90
Too many open files in system | 2007-02-27 20:30:59 | 17
another icc crash | 2007-02-03 10:50:01 | 1
apparently a malloc bug | 2007-03-04 23:00:20 | 27
bogus system clock setting | 1997-12-21 15:20:11 | 6
breakage from changing := to = in makefiles | 2007-02-10 02:15:01 | 4
broken GUC patch | 2007-03-13 15:15:01 | 92
broken float8 hacking | 2007-01-06 20:00:09 | 120
broken fsync-revoke patch | 2007-01-17 16:21:01 | 77
broken inet hacking | 2007-01-03 00:05:01 | 4
broken log_error patch | 2007-01-28 08:15:01 | 15
broken money patch | 2007-01-03 19:05:01 | 78
broken pg_regress change for msvc support | 2007-01-19 22:03:00 | 46
broken plpython patch | 2007-01-25 14:21:00 | 22
broken sys_siglist patch | 2007-01-28 06:06:02 | 18
bug in btree page split patch | 2007-02-08 11:35:03 | 7
buildfarm pilot error | 2007-01-19 03:28:07 | 69
cache flush bug in operator-family patch | 2006-12-31 10:30:03 | 8
ccache failure | 2007-01-25 23:00:34 | 2
could not create shared memory | 2007-02-13 07:00:05 | 32
ecpg regression test teething pains | 2007-02-03 13:30:02 | 516
failure to update PL expected files for may/can/might rewording | 2007-02-01 20:15:01 | 8
failure to update contrib expected files for may/can/might rewording | 2007-02-01 21:15:02 | 11
failure to update expected files for may/can/might rewording | 2007-02-01 19:35:02 | 3
icc "internal error" | 2007-03-16 16:30:01 | 29
image not found (possibly related to too-many-open-files) | 2006-10-25 08:05:02 | 1
largeobject test bugs | 2007-02-17 23:35:03 | 4
ld segfaulted | 2007-03-16 15:30:02 | 3
missing BYTE_ORDER definition for Solaris | 2007-01-10 14:18:23 | 1
pg_regress patch breakage | 2007-02-08 18:30:01 | 1
plancache test race condition | 2007-03-16 11:15:01 | 5
pltcl regression test broken by ORDER BY semantics tightening | 2007-01-09 03:15:01 | 9
previous contrib test still running | 2007-02-13 20:49:33 | 21
random Solaris breakage | 2007-01-05 17:20:01 | 1
random Windows breakage | 2006-12-27 03:15:07 | 1
random Windows permission-denied failures | 2007-02-12 11:00:09 | 5
random ccache breakage | 2007-01-04 01:34:33 | 1
readline misconfiguration | 2007-02-12 17:19:41 | 33
row-ordering discrepancy in rowtypes test | 2007-02-10 03:00:02 | 3
stats test failed | 2007-03-14 13:00:02 | 319
threaded Python library | 2007-01-10 04:05:02 | 6
undefined symbol pg_mic2ascii | 2007-02-03 01:13:40 | 101
unexpected signal 9 | 2006-12-31 06:30:02 | 15
unportable uuid patch | 2007-01-31 17:30:01 | 16
use of // comment | 2007-02-16 09:23:02 | 1
xml code teething problems | 2007-02-16 16:01:05 | 79
(54 rows)

Some of these might possibly be interesting to other people ...

regards, tom lane

Attachment Content-Type Size
mreasons.dump.gz application/octet-stream 17.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-03-19 03:34:53 Re: ILIKE and indexes
Previous Message Andrew Dunstan 2007-03-19 00:35:05 Re: Bug in UTF8-Validation Code?