Re: Buildfarm feature request: some way to track/classify failures

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-16 19:10:58
Message-ID: 9467.1174072258@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The current buildfarm webpages make it easy to see when a branch tip
is seriously broken, but it's not very easy to investigate transient
failures, such as a regression test race condition that only
materializes once in awhile. I would like to have a way of seeing
just the failed build attempts across all machines running a given
branch. Ideally it would be possible to tag failures as to the cause
(if known) and/or symptom pattern, and then be able to examine just
the ones without known cause or having similar symptoms.

I'm not sure how much of this is reasonable to try to do with webpages
similar to what we've got. But the data is all in a database AIUI,
so another possibility is to do this work via SQL. That'd require
having the ability to pull the information from the buildfarm database
so someone else could manipulate it.

So I guess the first question is can you make the build data available,
and the second is whether you're interested in building more flexible
views or just want to let someone else do that. Also, if anyone does
make an effort to tag failures, it'd be good to somehow push that data
back into the master database, so that we don't end up duplicating such
work.

regards, tom lane


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-16 19:25:33
Message-ID: 45FAEF2D.6070804@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> The current buildfarm webpages make it easy to see when a branch tip
> is seriously broken, but it's not very easy to investigate transient
> failures, such as a regression test race condition that only
> materializes once in awhile. I would like to have a way of seeing
> just the failed build attempts across all machines running a given
> branch. Ideally it would be possible to tag failures as to the cause
> (if known) and/or symptom pattern, and then be able to examine just
> the ones without known cause or having similar symptoms.
>
> I'm not sure how much of this is reasonable to try to do with webpages
> similar to what we've got. But the data is all in a database AIUI,
> so another possibility is to do this work via SQL. That'd require
> having the ability to pull the information from the buildfarm database
> so someone else could manipulate it.
>
> So I guess the first question is can you make the build data available,
> and the second is whether you're interested in building more flexible
> views or just want to let someone else do that. Also, if anyone does
> make an effort to tag failures, it'd be good to somehow push that data
> back into the master database, so that we don't end up duplicating such
> work.

If the data is already there and just not represented, just let me know
exactly what you want and I will implement pages for that data happily.

Joshua D. Drake

>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-16 19:35:23
Message-ID: 45FAF17B.7050906@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> The current buildfarm webpages make it easy to see when a branch tip
> is seriously broken, but it's not very easy to investigate transient
> failures, such as a regression test race condition that only
> materializes once in awhile. I would like to have a way of seeing
> just the failed build attempts across all machines running a given
> branch. Ideally it would be possible to tag failures as to the cause
> (if known) and/or symptom pattern, and then be able to examine just
> the ones without known cause or having similar symptoms.
>
> I'm not sure how much of this is reasonable to try to do with webpages
> similar to what we've got. But the data is all in a database AIUI,
> so another possibility is to do this work via SQL. That'd require
> having the ability to pull the information from the buildfarm database
> so someone else could manipulate it.
>
> So I guess the first question is can you make the build data available,
> and the second is whether you're interested in building more flexible
> views or just want to let someone else do that. Also, if anyone does
> make an effort to tag failures, it'd be good to somehow push that data
> back into the master database, so that we don't end up duplicating such
> work.
>
>
>

Well, the db is currently running around 13Gb, so that's not something
to be exported lightly ;-)

If we upgraded from Postgres 8.0.x to 8.2.x we could make use of some
features, like dynamic partitioning and copy from queries, that might
make life easier (CP people: that's a hint :-) )

I don't want to fragment effort, but I also know CP don't want open
access, for obvious reasons.

We can also look at a safe API that we could make available freely. I've
already done this over SOAP (see example client at
http://people.planetpostgresql.org/andrew/index.php?/archives/14-SOAP-server-for-Buildfarm-dashboard.html
). Doing updates is a whole other matter, of course.

Lastly, note that some buildfarm enhancements are on the SOC project
list. I have no idea if anyone will express any interest in that, of
course. It's not very glamorous work.

cheers

andrew


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-16 19:39:44
Message-ID: 45FAF280.8020201@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Well, the db is currently running around 13Gb, so that's not something
> to be exported lightly ;-)
>
> If we upgraded from Postgres 8.0.x to 8.2.x we could make use of some
> features, like dynamic partitioning and copy from queries, that might
> make life easier (CP people: that's a hint :-) )

Yeah, Yeah... I need to get you off that machine as a whole :) Which is
on the list but I am waiting for 8.3 *badda bing*.

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-16 19:40:33
Message-ID: 9887.1174074033@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Well, the db is currently running around 13Gb, so that's not something
> to be exported lightly ;-)

Yeah. I would assume though that the vast bulk of that is captured log
files. For the purposes I'm imagining, it'd be sufficient to export
only the rest of the database --- or ideally, records including all the
other fields and a URL for each log file. For the small number of log
files you actually need to examine, you'd chase the URL.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-16 21:32:04
Message-ID: 45FB0CD4.1020005@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Well, the db is currently running around 13Gb, so that's not something
>> to be exported lightly ;-)
>>
>
> Yeah. I would assume though that the vast bulk of that is captured log
> files. For the purposes I'm imagining, it'd be sufficient to export
> only the rest of the database --- or ideally, records including all the
> other fields and a URL for each log file. For the small number of log
> files you actually need to examine, you'd chase the URL.
>
>

OK, for anyone that wants to play, I have created an extract that
contains a summary of every non-CVS-related failure we've had. It's a
single table looking like this:

CREATE TABLE mfailures (
sysname text,
snapshot timestamp without time zone,
stage text,
conf_sum text,
branch text,
changed_this_run text,
changed_since_success text,
log_archive_filenames text[],
build_flags text[]
);

The dump is just under 1Mb and can be downloaded from
http://www.pgbuildfarm.org/mfailures.dump

If this is useful we can create it or something like it on a regular
basis (say nightly).

The summary log for a given build can be got from:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=<sysname>&dt=<snapshot>

To look at the log for a given run stage select
http://www.pgbuildfarm.org/cgi-bin/show_stage_log.pl?nm=<sysname>&dt=<snapshot>&stg=<stagename>
- the stage names available (if any) are the entries in
log_archive_filenames, stripped of the ".log" suffix.

We can make these available over an API that isn't plain http is people
want. Or we can provide a version of the buildlog that is tripped of the
html.

cheers

andrew


From: Jeremy Drake <pgsql(at)jdrake(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-16 21:48:56
Message-ID: Pine.BSO.4.64.0703161446350.19070@resin.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 16 Mar 2007, Andrew Dunstan wrote:

> OK, for anyone that wants to play, I have created an extract that contains a
> summary of every non-CVS-related failure we've had. It's a single table
> looking like this:
>
> CREATE TABLE mfailures (
> sysname text,
> snapshot timestamp without time zone,
> stage text,
> conf_sum text,
> branch text,
> changed_this_run text,
> changed_since_success text,
> log_archive_filenames text[],
> build_flags text[]
> );

Sweet. Should be interesting to look at.

>
>
> The dump is just under 1Mb and can be downloaded from
> http://www.pgbuildfarm.org/mfailures.dump

Sure about that?

--14:45:45-- http://www.pgbuildfarm.org/mfailures.dump
=> `mfailures.dump'
Resolving www.pgbuildfarm.org... 207.173.203.146
Connecting to www.pgbuildfarm.org|207.173.203.146|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9,184,142 (8.8M) [text/plain]

--
BOO! We changed Coke again! BLEAH! BLEAH!


From: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
To: "Jeremy Drake" <pgsql(at)jdrake(dot)com>
Cc: "PostgreSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-16 22:10:16
Message-ID: 62691.75.177.135.163.1174083016.squirrel@www.dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeremy Drake wrote:
>>
>>
>> The dump is just under 1Mb and can be downloaded from
>> http://www.pgbuildfarm.org/mfailures.dump
>
> Sure about that?
>
> HTTP request sent, awaiting response... 200 OK
> Length: 9,184,142 (8.8M) [text/plain]
>

Damn these new specs. They made me skip a digit.

cheers

andrew


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-18 12:12:48
Message-ID: 45FD2CC0.4000000@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew,

> Lastly, note that some buildfarm enhancements are on the SOC project
> list. I have no idea if anyone will express any interest in that, of
> course. It's not very glamorous work.

On the other hand, I think there are a lot more student perl hackers and
web people than there are folks with the potential to do backend stuff.
So who knows?

--Josh


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 03:22:13
Message-ID: 20700.1174274533@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> OK, for anyone that wants to play, I have created an extract that
> contains a summary of every non-CVS-related failure we've had. It's a
> single table looking like this:

I did some analysis on this data. Attached is a text dump of a table
declared as

CREATE TABLE mreasons (
sysname text,
snapshot timestamp without time zone,
branch text,
reason text,
known boolean
);

where the sysname/snapshot/branch data is taken from your table,
"reason" is a brief sketch of the failure, and "known" indicates
whether the cause is known ... although as I went along it sort
of evolved into "does this seem worthy of more investigation?".

I looked at every failure back through early December. I'd intended to
go back further, but decided I'd hit a point of diminishing returns.
However, failures back to the beginning of July that matched grep
searches for recent symptoms are classified in the table.

The gross stats are: 2231 failures classified, 71 distinct reason
codes, 81 failures (with 18 reasons) that seem worthy of closer
investigation:

bfarm=# select reason,branch,max(snapshot) as latest, count(*) from mreasons where not known group by 1,2 order by 1,2 ;
reason | branch | latest | count
------------------------------------------------------------------+---------------+---------------------+-------
Input/output error - possible hardware problem | HEAD | 2007-03-06 10:30:01 | 1
No rule to make target | HEAD | 2007-02-08 15:30:01 | 6
No rule to make target | REL8_0_STABLE | 2007-02-28 03:15:02 | 9
No rule to make target | REL8_2_STABLE | 2006-12-17 20:00:01 | 1
could not open relation with OID | HEAD | 2007-03-16 16:45:01 | 2
could not open relation with OID | REL8_1_STABLE | 2006-08-29 23:30:07 | 2
createlang not found? | REL8_1_STABLE | 2007-02-28 02:50:00 | 1
irreproducible contrib/sslinfo build failure, likely not our bug | HEAD | 2007-02-03 07:03:02 | 1
irreproducible opr_sanity failure | HEAD | 2006-12-18 19:15:02 | 2
libintl.h rejected by configure | HEAD | 2007-01-11 20:35:00 | 3
libintl.h rejected by configure | REL8_0_STABLE | 2007-03-01 20:28:04 | 22
postmaster failed to start | REL7_4_STABLE | 2007-02-28 22:23:20 | 1
postmaster failed to start | REL8_0_STABLE | 2007-02-28 22:30:44 | 1
random Solaris configure breakage | HEAD | 2007-01-14 05:30:00 | 1
random Windows breakage | HEAD | 2007-03-16 09:48:31 | 3
random Windows breakage | REL8_0_STABLE | 2007-03-15 03:15:09 | 7
segfault during bootstrap | HEAD | 2007-03-12 23:03:03 | 1
server does not shut down | HEAD | 2007-01-08 03:03:03 | 3
tablespace is not empty | HEAD | 2007-02-24 15:00:10 | 6
tablespace is not empty | REL8_1_STABLE | 2007-01-25 02:30:01 | 2
unexpected statement_timeout failure | HEAD | 2007-01-25 05:05:06 | 1
unexplained tsearch2 crash | HEAD | 2007-01-10 22:05:02 | 1
weird DST-transition-like timestamp test failure | HEAD | 2007-02-04 07:25:04 | 1
weird assembler failure, likely not our bug | HEAD | 2006-12-26 17:02:01 | 1
weird assembler failure, likely not our bug | REL8_2_STABLE | 2007-02-03 23:47:01 | 1
weird install failure | HEAD | 2007-01-25 12:35:00 | 1
(26 rows)

I think I know the cause of the recent 'could not open relation with
OID' failures in HEAD, but the rest of these maybe need a look.
Any volunteers?

Also, for completeness, the causes I wrote off as not interesting
(anymore, in some cases):

bfarm=# select reason,max(snapshot) as latest, count(*) from mreasons where known group by 1 order by 1 ;
reason | latest | count
----------------------------------------------------------------------+---------------------+-------
DST transition test failure | 2007-03-13 04:04:47 | 26
ISO-week-patch regression test breakage | 2007-02-16 15:00:08 | 23
No rule to make Makefile.port | 2007-03-02 12:30:02 | 40
Out of disk space | 2007-02-16 22:30:01 | 67
Out of semaphores | 2007-02-20 02:03:31 | 14
Python not installed | 2007-02-19 22:45:05 | 2
Solaris random conn-refused bug | 2007-03-06 01:20:00 | 37
TCP socket already in use | 2007-01-09 07:03:04 | 13
Too many clients | 2007-02-26 06:06:02 | 90
Too many open files in system | 2007-02-27 20:30:59 | 17
another icc crash | 2007-02-03 10:50:01 | 1
apparently a malloc bug | 2007-03-04 23:00:20 | 27
bogus system clock setting | 1997-12-21 15:20:11 | 6
breakage from changing := to = in makefiles | 2007-02-10 02:15:01 | 4
broken GUC patch | 2007-03-13 15:15:01 | 92
broken float8 hacking | 2007-01-06 20:00:09 | 120
broken fsync-revoke patch | 2007-01-17 16:21:01 | 77
broken inet hacking | 2007-01-03 00:05:01 | 4
broken log_error patch | 2007-01-28 08:15:01 | 15
broken money patch | 2007-01-03 19:05:01 | 78
broken pg_regress change for msvc support | 2007-01-19 22:03:00 | 46
broken plpython patch | 2007-01-25 14:21:00 | 22
broken sys_siglist patch | 2007-01-28 06:06:02 | 18
bug in btree page split patch | 2007-02-08 11:35:03 | 7
buildfarm pilot error | 2007-01-19 03:28:07 | 69
cache flush bug in operator-family patch | 2006-12-31 10:30:03 | 8
ccache failure | 2007-01-25 23:00:34 | 2
could not create shared memory | 2007-02-13 07:00:05 | 32
ecpg regression test teething pains | 2007-02-03 13:30:02 | 516
failure to update PL expected files for may/can/might rewording | 2007-02-01 20:15:01 | 8
failure to update contrib expected files for may/can/might rewording | 2007-02-01 21:15:02 | 11
failure to update expected files for may/can/might rewording | 2007-02-01 19:35:02 | 3
icc "internal error" | 2007-03-16 16:30:01 | 29
image not found (possibly related to too-many-open-files) | 2006-10-25 08:05:02 | 1
largeobject test bugs | 2007-02-17 23:35:03 | 4
ld segfaulted | 2007-03-16 15:30:02 | 3
missing BYTE_ORDER definition for Solaris | 2007-01-10 14:18:23 | 1
pg_regress patch breakage | 2007-02-08 18:30:01 | 1
plancache test race condition | 2007-03-16 11:15:01 | 5
pltcl regression test broken by ORDER BY semantics tightening | 2007-01-09 03:15:01 | 9
previous contrib test still running | 2007-02-13 20:49:33 | 21
random Solaris breakage | 2007-01-05 17:20:01 | 1
random Windows breakage | 2006-12-27 03:15:07 | 1
random Windows permission-denied failures | 2007-02-12 11:00:09 | 5
random ccache breakage | 2007-01-04 01:34:33 | 1
readline misconfiguration | 2007-02-12 17:19:41 | 33
row-ordering discrepancy in rowtypes test | 2007-02-10 03:00:02 | 3
stats test failed | 2007-03-14 13:00:02 | 319
threaded Python library | 2007-01-10 04:05:02 | 6
undefined symbol pg_mic2ascii | 2007-02-03 01:13:40 | 101
unexpected signal 9 | 2006-12-31 06:30:02 | 15
unportable uuid patch | 2007-01-31 17:30:01 | 16
use of // comment | 2007-02-16 09:23:02 | 1
xml code teething problems | 2007-02-16 16:01:05 | 79
(54 rows)

Some of these might possibly be interesting to other people ...

regards, tom lane

Attachment Content-Type Size
mreasons.dump.gz application/octet-stream 17.3 KB

From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 04:46:47
Message-ID: 45FE15B7.60208@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

| 2007-01-31 17:30:01 | 16
> use of // comment | 2007-02-16 09:23:02 | 1
> xml code teething problems | 2007-02-16 16:01:05 | 79
> (54 rows)
>
> Some of these might possibly be interesting to other people ...

If you provide the various greps, etc... I will put it into the website
proper...

Joshua D. Drake

>
> regards, tom lane
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend


From: Jeremy Drake <pgsql(at)jdrake(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 05:11:18
Message-ID: Pine.BSO.4.64.0703182155350.19070@resin.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 18 Mar 2007, Tom Lane wrote:

> another icc crash | 2007-02-03 10:50:01 | 1
> icc "internal error" | 2007-03-16 16:30:01 | 29

These on mongoose are most likely a result of flaky hardware. They tend
to occur most often when either
a) I am doing something else on the box when the build runs, or
b) the ambient temperature in the room is > ~72degF

I need to bring down this box at some point and try to figure out if it is
bad memory or what.

Anyway, ICC seems to be one of the few things that are really succeptable
to hardware issues (on this box at least, it is mostly ICC and firefox),
and I apologize for the noise this caused in the buildfarm logs...

--
American business long ago gave up on demanding that prospective
employees be honest and hardworking. It has even stopped hoping for
employees who are educated enough that they can tell the difference
between the men's room and the women's room without having little
pictures on the doors.
-- Dave Barry, "Urine Trouble, Mister"


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 05:39:49
Message-ID: 21808.1174282789@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>> Some of these might possibly be interesting to other people ...

> If you provide the various greps, etc... I will put it into the website
> proper...

Unfortunately I didn't keep notes on exactly what I searched for in each
case. Some of them were not based on grep at all, but rather "this
failure looks similar to those others and happened in the period between
a known bad patch commit and its fix". The goal was essentially to
group together failures that probably arose from the same cause --- I
may have made a mistake or two along the way ...

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeremy Drake <pgsql(at)jdrake(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 05:50:54
Message-ID: 21900.1174283454@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeremy Drake <pgsql(at)jdrake(dot)com> writes:
> These on mongoose are most likely a result of flaky hardware.

Yeah, I saw a pretty fair number of irreproducible issues that are
probably hardware flake-outs. Of course you can't tell which are those
and which are low-probability software bugs for many moons...

I believe that a large fraction of the buildfarm consists of
semi-retired equipment that is probably more prone to this sort of
problem than newer stuff would be. But that's the price we must pay
for building such a large test farm on a shoestring. What we need to do
to deal with it, I think, is institutionalize some kind of long-term
tracking so that we can tell the recurrent from the non-recurrent
issues. I don't quite know how to do that; what I did over this past
weekend was labor-intensive and not scalable.

SoC project perhaps?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 06:34:06
Message-ID: 22205.1174286046@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

BTW, before I forget, this little project turned up a couple of
small improvements for the current buildfarm infrastructure:

1. There are half a dozen entries with obviously bogus timestamps:

bfarm=# select sysname,snapshot,branch from mfailures where snapshot < '2004-01-01';
sysname | snapshot | branch
------------+---------------------+--------
corgi | 1997-10-14 14:20:10 | HEAD
kookaburra | 1970-01-01 01:23:00 | HEAD
corgi | 1997-09-30 11:47:08 | HEAD
corgi | 1997-10-17 14:20:11 | HEAD
corgi | 1997-12-21 15:20:11 | HEAD
corgi | 1997-10-15 14:20:10 | HEAD
corgi | 1997-09-28 11:47:09 | HEAD
corgi | 1997-09-28 11:47:08 | HEAD
(8 rows)

indicating wrong system clock settings on these buildfarm machines.
(Indeed, IIRC these failures were actually caused by the ridiculous
clock settings --- we have at least one regression test that checks
century >= 21 ...) Perhaps the buildfarm server should bounce
reports with timestamps more than a day in the past or a few minutes in
the future. I think though that a more useful answer would be to
include "time of receipt of report" in the permanent record, and then
subsequent analysis could make its own decisions about whether to
believe the snapshot timestamp --- plus we could track elapsed times for
builds, which could be interesting in itself.

2. I was annoyed repeatedly that some buildfarm members weren't
reporting log_archive_filenames entries, which forced going the long
way round in the process I was using. Seems like we need some more
proactive means for getting buildfarm owners to keep their script
versions up-to-date. Not sure what that should look like exactly,
as long as it's not "you can run an ancient version as long as you
please".

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 13:04:03
Message-ID: 87hcshjsfw.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Also, for completeness, the causes I wrote off as not interesting
> (anymore, in some cases):
>
> missing BYTE_ORDER definition for Solaris | 2007-01-10 14:18:23 | 1

What is this BYTE_ORDER macro? Should I be using it instead of the
AC_C_BIGENDIAN test in configure for the packed varlena patch?

> row-ordering discrepancy in rowtypes test | 2007-02-10 03:00:02 | 3

Is this because the test is fixed or unfixable? If not shouldn't the test get
an ORDER BY clause so that it will reliably pass on future versions?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 13:17:24
Message-ID: 45FE8D64.40309@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
>> Also, for completeness, the causes I wrote off as not interesting
>> (anymore, in some cases):
>>
>> missing BYTE_ORDER definition for Solaris | 2007-01-10 14:18:23 | 1
>
> What is this BYTE_ORDER macro? Should I be using it instead of the
> AC_C_BIGENDIAN test in configure for the packed varlena patch?

FYI: this is the relevant commit (the affected buildfarm member was
clownfish)
http://archives.postgresql.org/pgsql-committers/2007-01/msg00154.php

Stefan


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 13:45:15
Message-ID: 45FE93EB.7030006@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> BTW, before I forget, this little project turned up a couple of
> small improvements for the current buildfarm infrastructure:
>
> 1. There are half a dozen entries with obviously bogus timestamps:
>
> bfarm=# select sysname,snapshot,branch from mfailures where snapshot < '2004-01-01';
> sysname | snapshot | branch
> ------------+---------------------+--------
> corgi | 1997-10-14 14:20:10 | HEAD
> kookaburra | 1970-01-01 01:23:00 | HEAD
> corgi | 1997-09-30 11:47:08 | HEAD
> corgi | 1997-10-17 14:20:11 | HEAD
> corgi | 1997-12-21 15:20:11 | HEAD
> corgi | 1997-10-15 14:20:10 | HEAD
> corgi | 1997-09-28 11:47:09 | HEAD
> corgi | 1997-09-28 11:47:08 | HEAD
> (8 rows)
>
> indicating wrong system clock settings on these buildfarm machines.
> (Indeed, IIRC these failures were actually caused by the ridiculous
> clock settings --- we have at least one regression test that checks
> century >= 21 ...) Perhaps the buildfarm server should bounce
> reports with timestamps more than a day in the past or a few minutes in
> the future. I think though that a more useful answer would be to
> include "time of receipt of report" in the permanent record, and then
> subsequent analysis could make its own decisions about whether to
> believe the snapshot timestamp --- plus we could track elapsed times for
> builds, which could be interesting in itself.
>

We actually do timestamp the reports - I just didn't include that in the
extract. I will alter the view it's based on. We started doing this in
Nov 2005, so I'm going to restrict the view to cases where the
report_time is not null - I doubt we're interested in ancient history.

A revised extract is available at
http://www.pgbuildfarm.org/mfailures2.dump

We already reject snapshot times that are in the future.

Use of NTP is highly recommended to buildfarm members, but I'm reluctant
to make it mandatory, as they might not have it available. I think we
can do this: alter the client script to report its idea of current time
at the time it makes the web transaction. If it's off from the server
time by more than some small value (say 60 secs), adjust the snapshot
time accordingly. If they don't report it then we can reject insane
dates (more than 24hours ago seems about right).

So I agree with both your suggestions ;-)

> 2. I was annoyed repeatedly that some buildfarm members weren't
> reporting log_archive_filenames entries, which forced going the long
> way round in the process I was using. Seems like we need some more
> proactive means for getting buildfarm owners to keep their script
> versions up-to-date. Not sure what that should look like exactly,
> as long as it's not "you can run an ancient version as long as you
> please".
>
>
>

Modern clients report the versions of the two scripts involved (see
script_version and web_script_version in reported config) so we could
easily enforce a minimum version on these.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 13:55:38
Message-ID: 4246.1174312538@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> missing BYTE_ORDER definition for Solaris | 2007-01-10 14:18:23 | 1

> What is this BYTE_ORDER macro? Should I be using it instead of the
> AC_C_BIGENDIAN test in configure for the packed varlena patch?

Actually, if we start to rely on AC_C_BIGENDIAN, I'd prefer to see us
get rid of direct usages of BYTE_ORDER. It looks like only
contrib/pgcrypto is depending on it today, but we've got lots of
cruft in the include/port/ files supporting that.

>> row-ordering discrepancy in rowtypes test | 2007-02-10 03:00:02 | 3

> Is this because the test is fixed or unfixable?

It's fixed.
http://archives.postgresql.org/pgsql-committers/2007-02/msg00228.php

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 14:15:23
Message-ID: 873b41jp50.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Gregory Stark" <stark(at)enterprisedb(dot)com> writes:

> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
>> row-ordering discrepancy in rowtypes test | 2007-02-10 03:00:02 | 3
>
> Is this because the test is fixed or unfixable? If not shouldn't the test get
> an ORDER BY clause so that it will reliably pass on future versions?

Hm, I took a quick look at this test and while there are a couple tests
missing ORDER BY clauses I can't see how they could possibly generate results
that are out of order. Perhaps the ones that do have ORDER BY clauses only
recently acquired them?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 17:14:03
Message-ID: 14031.1174324443@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> The current buildfarm webpages make it easy to see when a branch tip
>> is seriously broken, but it's not very easy to investigate transient
>> failures, such as a regression test race condition that only
>> materializes once in awhile.

> If the data is already there and just not represented, just let me know
> exactly what you want and I will implement pages for that data happily.

I think what would be nice is some way to view all the failures for a
given branch, extending back not-sure-how-far. Right now the only way
to see past failures is to look at individual machines' histories, which
is not real satisfactory when you want a broader view.

Actually what I *really* want is something closer to "show me all the
unexplained failures", but unless Andrew is willing to support some way
of tagging failures in the master database, I suppose that won't happen.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 21:25:13
Message-ID: 45FEFFB9.3060908@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I think what would be nice is some way to view all the failures for a
> given branch, extending back not-sure-how-far. Right now the only way
> to see past failures is to look at individual machines' histories, which
> is not real satisfactory when you want a broader view.
>
> Actually what I *really* want is something closer to "show me all the
> unexplained failures", but unless Andrew is willing to support some way
> of tagging failures in the master database, I suppose that won't happen.
>
>
>

Well, if I understood how it might work it might happen.

Who would do the tagging, and how?

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 21:27:47
Message-ID: 45FF0053.2050705@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
>> 2. I was annoyed repeatedly that some buildfarm members weren't
>> reporting log_archive_filenames entries, which forced going the long
>> way round in the process I was using. Seems like we need some more
>> proactive means for getting buildfarm owners to keep their script
>> versions up-to-date. Not sure what that should look like exactly,
>> as long as it's not "you can run an ancient version as long as you
>> please".
>>
>>
>>
>
> Modern clients report the versions of the two scripts involved (see
> script_version and web_script_version in reported config) so we could
> easily enforce a minimum version on these.
>

Meanwhile, the owner of the main 2 offending machines has said he will
upgrade them.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-19 22:58:28
Message-ID: 18456.1174345108@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> Actually what I *really* want is something closer to "show me all the
>> unexplained failures", but unless Andrew is willing to support some way
>> of tagging failures in the master database, I suppose that won't happen.

> Who would do the tagging, and how?

Well, that's the hard part isn't it? I was sort of envisioning a group
of users who'd be authorized to log in and set tags on database entries
somehow. I'm not sure about details. One issue is that the majority
of failures come in batches (when one of us commits a bad patch).
With the current web interface it would be real tedious to verify which
of the failures in a particular time interval matched the symptoms of
a failure. What I did for my experiment this weekend was to download
the last-stage-log of each failed build, which required an hour or so
of setup time; then I could use grep to confirm which logs matched a
failure that I'd identified. Doing that through the current webpage
would involve lots of clicking and waiting. If we could expose a
text-search-style API for grepping the stage logs, it'd be a lot easier
to collect related failures. Then maybe a few widgets to let authorized
users apply a tag to the search results ...

I'm not entirely sure that this infrastructure would pay for itself,
though. Without some users willing to take the time to separate
explained from unexplained failures, it'd be a waste of effort.
But we've already had a couple of cases of interesting failures going
unnoticed because of the noise level. Between duplicate reports about
busted patches and transient problems on particular build machines
(out of disk space, misconfiguration, etc) it's pretty hard to not miss
the once-in-a-while failures. Is there some other way we could attack
that problem?

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 02:14:32
Message-ID: 45FF4388.6060406@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> Actually what I *really* want is something closer to "show me all the
>>> unexplained failures", but unless Andrew is willing to support some way
>>> of tagging failures in the master database, I suppose that won't happen.
>>>
>
>
>> Who would do the tagging, and how?
>>
>
> Well, that's the hard part isn't it? I was sort of envisioning a group
> of users who'd be authorized to log in and set tags on database entries
> somehow. I'm not sure about details. One issue is that the majority
> of failures come in batches (when one of us commits a bad patch).
> With the current web interface it would be real tedious to verify which
> of the failures in a particular time interval matched the symptoms of
> a failure. What I did for my experiment this weekend was to download
> the last-stage-log of each failed build, which required an hour or so
> of setup time; then I could use grep to confirm which logs matched a
> failure that I'd identified. Doing that through the current webpage
> would involve lots of clicking and waiting. If we could expose a
> text-search-style API for grepping the stage logs, it'd be a lot easier
> to collect related failures. Then maybe a few widgets to let authorized
> users apply a tag to the search results ...
>
> I'm not entirely sure that this infrastructure would pay for itself,
> though. Without some users willing to take the time to separate
> explained from unexplained failures, it'd be a waste of effort.
> But we've already had a couple of cases of interesting failures going
> unnoticed because of the noise level. Between duplicate reports about
> busted patches and transient problems on particular build machines
> (out of disk space, misconfiguration, etc) it's pretty hard to not miss
> the once-in-a-while failures. Is there some other way we could attack
> that problem?
>
>

I'm not too sanguine about having a team of eager taggers.

I think we probably need to work on a usable API for extracting data in
small or large amounts, and maybe some good text search facilities.

The real issue is the one you identify of stuff getting lost in the
noise. But I'm not sure there's any realistic cure for that.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 06:57:13
Message-ID: 5293.1174373833@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> But we've already had a couple of cases of interesting failures going
>> unnoticed because of the noise level. Between duplicate reports about
>> busted patches and transient problems on particular build machines
>> (out of disk space, misconfiguration, etc) it's pretty hard to not miss
>> the once-in-a-while failures. Is there some other way we could attack
>> that problem?

> The real issue is the one you identify of stuff getting lost in the
> noise. But I'm not sure there's any realistic cure for that.

Maybe we should think about filtering the noise. Like, say, discarding
every report from mongoose that involves an icc core dump ...
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mongoose&dt=2007-03-20%2006:30:01

That's only semi-serious, but I do think that it's getting harder to
pluck the wheat from the chaff. My investigations over the weekend
showed that we have got basically three categories of reports:

1. genuine code breakage from unportable patches: normally multiple
reports over a short period until we fix or revert the cause.
2. failures on a single buildfarm member due to misconfiguration,
hardware flakiness, etc. These are sometimes repeatable and sometimes
not.
3. all the rest, of which some fraction represents bugs we need to fix,
only we don't know they're there.

In category 1 the buildfarm certainly pays for itself, but we'd hoped
that it would help us spot less-reproducible errors too. The problem
I'm seeing is that category 2 is overwhelming our ability to recognize
patterns within category 3. How can we dial down the noise level?

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 10:22:42
Message-ID: 20070320102242.GC19221@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 20, 2007 at 02:57:13AM -0400, Tom Lane wrote:
> Maybe we should think about filtering the noise. Like, say, discarding
> every report from mongoose that involves an icc core dump ...
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mongoose&dt=2007-03-20%2006:30:01

Maybe a simple compromise would be being able to setup a set of regexes
that search the output and set a flag it that string is found. If you
find the string, it gets marked with a flag, which means that when you
look at mongoose, any failures that don't have the flag become easier
to spot.

It also means that once you've found a common failure, you can create
the regex and then any other failures with the same string get tagged
also, making unexplained ones easier to spot.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 13:24:26
Message-ID: 45FFE08A.4030100@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Tue, Mar 20, 2007 at 02:57:13AM -0400, Tom Lane wrote:
>
>> Maybe we should think about filtering the noise. Like, say, discarding
>> every report from mongoose that involves an icc core dump ...
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mongoose&dt=2007-03-20%2006:30:01
>>
>
> Maybe a simple compromise would be being able to setup a set of regexes
> that search the output and set a flag it that string is found. If you
> find the string, it gets marked with a flag, which means that when you
> look at mongoose, any failures that don't have the flag become easier
> to spot.
>
> It also means that once you've found a common failure, you can create
> the regex and then any other failures with the same string get tagged
> also, making unexplained ones easier to spot.
>
>
>

You need to show first that this is an adequate tagging mechanism, both
in tagging things adequately and in not picking up false positives,
which would make things worse, not better. And even then you need
someone to do the analysis to create the regex.

The buildfarm works because it leverages our strength, namely automating
things. But all the tagging suggestions I've seen will involve regular,
repetitive and possibly boring work, precisely the thing we are not good
at as a group.

If we had some staff they could be given this task (among others),
assuming we show that it actually works. We don't, so they can't.

cheers

andrew


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 13:40:37
Message-ID: 45FFE455.4080909@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
> Martijn van Oosterhout wrote:
>> On Tue, Mar 20, 2007 at 02:57:13AM -0400, Tom Lane wrote:
>>
>>> Maybe we should think about filtering the noise. Like, say, discarding
>>> every report from mongoose that involves an icc core dump ...
>>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=mongoose&dt=2007-03-20%2006:30:01
>>>
>>>
>>
>> Maybe a simple compromise would be being able to setup a set of regexes
>> that search the output and set a flag it that string is found. If you
>> find the string, it gets marked with a flag, which means that when you
>> look at mongoose, any failures that don't have the flag become easier
>> to spot.
>>
>> It also means that once you've found a common failure, you can create
>> the regex and then any other failures with the same string get tagged
>> also, making unexplained ones easier to spot.
>>
>>
>>
>
> You need to show first that this is an adequate tagging mechanism, both
> in tagging things adequately and in not picking up false positives,
> which would make things worse, not better. And even then you need
> someone to do the analysis to create the regex.
>
> The buildfarm works because it leverages our strength, namely automating
> things. But all the tagging suggestions I've seen will involve regular,
> repetitive and possibly boring work, precisely the thing we are not good
> at as a group.

this is probably true - however as a buildfarm admin I occasionally
wished i had a way to invalidate reports generated from my boxes to
prevent someone wasting time to investigate them (like errors caused by
system upgrades,configuration problems or other local issues).

But I agree that it might be difficult to make that "manual tagging"
process scalable and reliable enough so that it really is an improvment
over what we have now.

Stefan


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 13:55:07
Message-ID: 45FFE7BB.9040602@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner wrote:
> however as a buildfarm admin I occasionally wished i had a way to
> invalidate reports generated from my boxes to prevent someone wasting
> time to investigate them (like errors caused by system
> upgrades,configuration problems or other local issues).
>

It would be extremely simply to provide a 'revoke report' API and
client. Good idea.

But that's quite different from what we have been discussing.

cheers

andrew


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 14:04:59
Message-ID: 20070320140459.GS24234@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:

> The buildfarm works because it leverages our strength, namely automating
> things. But all the tagging suggestions I've seen will involve regular,
> repetitive and possibly boring work, precisely the thing we are not good
> at as a group.

You may be forgetting that Martijn and others tagged the
scan.coverity.com database. Now, there are some untagged errors, but
I'd say that that's because we don't control the tool, so we cannot fix
it if there are false positives. We do control the buildfarm however,
so we can develop systematic solutions for widespread problems (instead
of forcing us to checking and tagging every single occurance of
widespread problems).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 14:09:08
Message-ID: 45FFEB04.6050108@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera wrote:
> Andrew Dunstan wrote:
>
>
>> The buildfarm works because it leverages our strength, namely automating
>> things. But all the tagging suggestions I've seen will involve regular,
>> repetitive and possibly boring work, precisely the thing we are not good
>> at as a group.
>>
>
> You may be forgetting that Martijn and others tagged the
> scan.coverity.com database. Now, there are some untagged errors, but
> I'd say that that's because we don't control the tool, so we cannot fix
> it if there are false positives. We do control the buildfarm however,
> so we can develop systematic solutions for widespread problems (instead
> of forcing us to checking and tagging every single occurance of
> widespread problems).
>
>

Well, I'm sure we can provide appropriate access or data for anyone who
wants to do research in this area and prove me wrong.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 14:40:23
Message-ID: 13514.1174401623@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Martijn van Oosterhout wrote:
>> Maybe a simple compromise would be being able to setup a set of regexes
>> that search the output and set a flag it that string is found. If you
>> find the string, it gets marked with a flag, which means that when you
>> look at mongoose, any failures that don't have the flag become easier
>> to spot.
>>
>> It also means that once you've found a common failure, you can create
>> the regex and then any other failures with the same string get tagged
>> also, making unexplained ones easier to spot.

> You need to show first that this is an adequate tagging mechanism, both
> in tagging things adequately and in not picking up false positives,
> which would make things worse, not better. And even then you need
> someone to do the analysis to create the regex.

Well, my experiment over the weekend with doing exactly that convinced
me that regexes could be used successfully to identify common-mode
failures. So I think Martijn has a fine idea here. And I don't see a
problem with lack of motivation, at least for those of us who try to pay
attention to buildfarm results --- once you've looked at a couple of
reports of the same issue, you really don't want to have to repeat the
analysis over and over. But just assuming that every report on a
particular day reflects the same breakage is exactly the risk I wish
we didn't have to take.

For a lot of cases there is not a need for an ongoing filter: we break
something, we get a pile of reports, we fix it, and then we want to tag
all the reports of that something so that we can see if anything else
happened in the same interval. So for this, something based on an
interactive search API would work fine. You could even use that for
repetitive problems such as buildfarm misconfigurations, though having
to repeat the search every so often would get old in the end. The main
thing though is for the database to remember the tags once made.

> The buildfarm works because it leverages our strength, namely automating
> things. But all the tagging suggestions I've seen will involve regular,
> repetitive and possibly boring work, precisely the thing we are not good
> at as a group.

Well, responding to bug reports could be called regular and repetitive
work --- in reality I don't find it so, because every bug is different.
The point I think you are missing is that having something like this
will *eliminate* repetitive, boring work, namely recognizing multiple
reports of the same problem. The buildfarm has gotten big enough that
some way of dealing with that is desperately needed, else our ability
to spot infrequently-reported issues will disappear entirely.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 15:36:09
Message-ID: 45FFFF69.8050608@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> The point I think you are missing is that having something like this
> will *eliminate* repetitive, boring work, namely recognizing multiple
> reports of the same problem. The buildfarm has gotten big enough that
> some way of dealing with that is desperately needed, else our ability
> to spot infrequently-reported issues will disappear entirely.
>
>

OK. How about if we have a table of <branch, failure_stage, regex, tag,
description, start_date> plus some webby transactions for approved users
to edit this?

The wrinkle is that applying the tags on the fly is probably not a great
idea - the status page query is already in desperate need of overhauling
because it's too slow. So we'd need a daemon to set up the tags in the
background. But that's an implementation detail. Screen real estate on
the dashboard page is also in very short supply. Maybe we could play
with the background colour, so that a tagged failure had, say, a blue
background, as opposed to the red/pink/yellow we use for failures now.
Again - an implementation detail.

My biggest worry apart from maintenance (which doesn't matter that much
- if people don't enter the regexes they don't get the tags they want)
is that the regexes will not be specific enough, and so give false
positives on the tags. Then if you're looking for things that aren't
tagged you be even more likely than today to miss the outliers. Lord
knows that regexes are hard to get right - I've been using them for a
couple of decades and they've earned me lots of money, and I still get
them wrong regularly (including several cases on the buildfarm). but
maybe we need to take the plunge and see how it works.

This would be a fine SOC project - I at least won't have time to develop
it for quite some time.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 15:59:01
Message-ID: 14527.1174406341@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> The wrinkle is that applying the tags on the fly is probably not a great
> idea - the status page query is already in desperate need of overhauling
> because it's too slow. So we'd need a daemon to set up the tags in the
> background. But that's an implementation detail. Screen real estate on
> the dashboard page is also in very short supply. Maybe we could play
> with the background colour, so that a tagged failure had, say, a blue
> background, as opposed to the red/pink/yellow we use for failures now.
> Again - an implementation detail.

I'm not sure that the current status dashboard needs to pay any attention
to the tags. The view that I would like to have of "recent failures
across all machines in a branch" is the one that needs to be tag-aware,
and perhaps also the existing display of a given machine's branch history.

> My biggest worry apart from maintenance (which doesn't matter that much
> - if people don't enter the regexes they don't get the tags they want)
> is that the regexes will not be specific enough, and so give false
> positives on the tags.

True. I strongly suggest that we want an interactive search-and-tag
capability *before* worrying about automatic tagging --- one of the
reasons for that is to provide a way to test a regex that you might
then consider adding to the automatic filter for future reports.

> This would be a fine SOC project - I at least won't have time to develop
> it for quite some time.

Agreed. Who's maintaining the SOC project list page?

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 17:20:43
Message-ID: 20070320172043.GB24108@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 20, 2007 at 11:36:09AM -0400, Andrew Dunstan wrote:
> My biggest worry apart from maintenance (which doesn't matter that much
> - if people don't enter the regexes they don't get the tags they want)
> is that the regexes will not be specific enough, and so give false
> positives on the tags. Then if you're looking for things that aren't
> tagged you be even more likely than today to miss the outliers. Lord

I think you could solve that by displaying the text that matched the
regex. If it starts matching odd things it'd be visible.

But I'm just sprouting ideas here, the proof is in the pudding. If the
logs are easily available (or a subset of, say the last month) then
people could play with that and see what happens...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 17:37:26
Message-ID: 15461.1174412246@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> But I'm just sprouting ideas here, the proof is in the pudding. If the
> logs are easily available (or a subset of, say the last month) then
> people could play with that and see what happens...

Anyone who wants to play around can replicate what I did, which was to
download the table that Andrew made available upthread, and then pull
the log files matching interesting rows. I used the attached functions
to generate URLs for the failing stage logs, and then a shell script
looping over lwp-download ...

CREATE FUNCTION lastfile(mfailures) RETURNS text
AS $$
select replace(
'show_stage_log.pl?nm=' || $1.sysname || '&dt=' || $1.snapshot ||
'&stg=' ||
replace($1.log_archive_filenames[array_upper($1.log_archive_filenames, 1)],
'.log', ''),
' ', '%20')
$$
LANGUAGE sql;

CREATE FUNCTION lastlog(mfailures) RETURNS text
AS $$
select 'http://www.pgbuildfarm.org/cgi-bin/' || lastfile($1)
$$
LANGUAGE sql;

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 18:46:01
Message-ID: 46002BE9.2070202@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>
>> But I'm just sprouting ideas here, the proof is in the pudding. If the
>> logs are easily available (or a subset of, say the last month) then
>> people could play with that and see what happens...
>>
>
> Anyone who wants to play around can replicate what I did, which was to
> download the table that Andrew made available upthread, and then pull
> the log files matching interesting rows.
>
[snip]

To save people this trouble, I have made an extract for the last 3
months, augmented by log field, which is pretty much the last stage log.
The dump is 27Mb and can be got at

http://www.pgbuildfarm.org/tfailures.dmp

cheers

andrew


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 19:00:15
Message-ID: 46002F3F.3030501@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
> Tom Lane wrote:
>> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>>
>>> But I'm just sprouting ideas here, the proof is in the pudding. If the
>>> logs are easily available (or a subset of, say the last month) then
>>> people could play with that and see what happens...
>>>
>>
>> Anyone who wants to play around can replicate what I did, which was to
>> download the table that Andrew made available upthread, and then pull
>> the log files matching interesting rows.
> [snip]
>
>
> To save people this trouble, I have made an extract for the last 3
> months, augmented by log field, which is pretty much the last stage log.
> The dump is 27Mb and can be got at
>
> http://www.pgbuildfarm.org/tfailures.dmp

Should we just automate this and make it a weekly?

>
> cheers
>
> andrew
>
>
>

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm feature request: some way to track/classify failures
Date: 2007-03-20 19:16:13
Message-ID: 460032FD.8040503@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joshua D. Drake wrote:
> Andrew Dunstan wrote:
>
>> Tom Lane wrote:
>>
>>> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>>>
>>>
>>>> But I'm just sprouting ideas here, the proof is in the pudding. If the
>>>> logs are easily available (or a subset of, say the last month) then
>>>> people could play with that and see what happens...
>>>>
>>>>
>>> Anyone who wants to play around can replicate what I did, which was to
>>> download the table that Andrew made available upthread, and then pull
>>> the log files matching interesting rows.
>>>
>> [snip]
>>
>>
>> To save people this trouble, I have made an extract for the last 3
>> months, augmented by log field, which is pretty much the last stage log.
>> The dump is 27Mb and can be got at
>>
>> http://www.pgbuildfarm.org/tfailures.dmp
>>
>
> Should we just automate this and make it a weekly?
>
>

Sure. Talk to me offline about it - very simple to do.

cheers

andrew