BUG #4692: VACUUM: write to WAL gets very slow and seems redundant

From: "Peter Much" <pmc(at)citylink(dot)dinoex(dot)sub(dot)org>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #4692: VACUUM: write to WAL gets very slow and seems redundant
Date: 2009-03-04 16:28:41
Message-ID: 200903041628.n24GSfHL064596@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 4692
Logged by: Peter Much
Email address: pmc(at)citylink(dot)dinoex(dot)sub(dot)org
PostgreSQL version: 8.2.7
Operating system: FreeBSD 6.3
Description: VACUUM: write to WAL gets very slow and seems redundant
Details:

While most of the time the system does run very fine, sometimes (i perceived
it only during VACUUM) the throughput breaks down dramatically: while
normally the VACUUM fills multiple WAL logs per minute, it then takes 10
minutes to fill each WAL log, while the system is rather idle:

Normal:
Mar 4 03:18:29 <local0.info> edge postgres[1781]: [11-1] :[] LOG: archived
transaction log file "000000020000004200000097"
Mar 4 03:19:09 <local0.info> edge postgres[1781]: [12-1] :[] LOG: archived
transaction log file "000000020000004200000098"
Mar 4 03:20:26 <local0.info> edge postgres[1781]: [13-1] :[] LOG: archived
transaction log file "000000020000004200000099"
Mar 4 03:21:11 <local0.info> edge postgres[1781]: [14-1] :[] LOG: archived
transaction log file "00000002000000420000009A"

Defect:
Mar 4 10:16:16 <local0.info> edge postgres[1781]: [84-1] :[] LOG: archived
transaction log file "0000000200000042000000E0"
Mar 4 10:25:19 <local0.info> edge postgres[1781]: [85-1] :[] LOG: archived
transaction log file "0000000200000042000000E1"
Mar 4 10:37:20 <local0.info> edge postgres[1781]: [86-1] :[] LOG: archived
transaction log file "0000000200000042000000E2"
Mar 4 10:44:04 <local0.info> edge postgres[1781]: [87-1] :[] LOG: archived
transaction log file "0000000200000042000000E3"

Looking closer at this.
1.) The only process being active on the system is the VACUUM (FULL) task:
6322 ?? Ds 17:16.52 postgres: pgsql bacula 192.168.98.6(50283) VACUUM
(po

2.) the system load is near idle.

3.) "top" shows the vacuum process waiting in "bo_waa" (or "biowr") nearly
all the time:
PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND
6322 pgsql 1 -5 0 76596K 44324K bo_wwa 17:22 1.03% postgres

4.) The disk with the WAL logs is 98% busy, and the only file that is read
or written there is the current WAL log:
[root(at)edge /var/pgsql/data]# find . -type f -exec ls -lu {} \; | grep
"Mar 4 12:4"
[root(at)edge /var/pgsql/data]# find . -type f -exec ls -l {} \; | grep "Mar
4 12:4"
-rw------- 1 pgsql pgsql 16777216 Mar 4 12:41
./pg_xlog/0000000200000042000000F0

5.) lsof shows the file-pointer in the WAL log proceeding:
[root(at)edge /var/pgsql/data]# lsof -o pg_xlog/0000000200000042000000F0 ;
sleep 10; lsof -o pg_xlog/0000000200000042000000F0
COMMAND PID USER FD TYPE DEVICE OFFSET NODE NAME
postgres 1780 pgsql 9u VREG 0,167 0t8585216 14344
pg_xlog/0000000200000042000000F0
postgres 6322 pgsql 11u VREG 0,167 0t14090240 14344
pg_xlog/0000000200000042000000F0
COMMAND PID USER FD TYPE DEVICE OFFSET NODE NAME
postgres 1780 pgsql 9u VREG 0,167 0t8585216 14344
pg_xlog/0000000200000042000000F0
postgres 6322 pgsql 11u VREG 0,167 0t14524416 14344
pg_xlog/0000000200000042000000F0

So, during 10 seconds the 6322 process has proceeded 434176 bytes, giving 43
kBps.

6.) iostat shows the WAL disk processing ~60 accesses per second, 16kB each
(while the database disk does nearly nothing):
WAL
KB/t tps MB/s
16.70 69 1.12
16.76 63 1.03
16.00 45 0.70
16.00 58 0.90
16.71 68 1.10

This shows about 1000 kBps.
iostat does not show me which process does this activity, but since the
VACUUM process 6322 is the only active process, and since this process is in
"biowr" state all the time, the activity could hardly come from an other
process.

Evaluating, we have ~1000 kBps thruput but only 43 kBps write advance on the
WAL log, this gives about factor 1:20. And from the logs we see: about
30secs per WAL log in normal operation, and about 10mins in the defect
state, this is also 1:20.

So it seems I am looking at the right thing, and there is no problem on the
system. The only question is: what the **** is this VACUUM process doing?

When killing the 6322 process, the next task behaves normal:
WAL DB
KB/t tps MB/s KB/t tps MB/s
82.67 3 0.24 22.79 197 4.39
31.25 8 0.24 21.16 154 3.19
26.00 8 0.20 21.43 164 3.44
44.80 5 0.22 24.44 143 3.42
22.40 5 0.11 23.85 162 3.78
24.00 4 0.09 22.45 195 4.28

The behaviour occurs only occasionally; as we can see from the postgres log
above, at 3AM the VACUUM was running fine, only later it started with this
behaviour - for no apparent reason.

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2009-03-04 16:49:17 Re: BUG #4690: an select query is not using the index
Previous Message Gregory Clark 2009-03-04 15:42:50 BUG #4691: Installation error