Re: WAL Rate Limiting

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL Rate Limiting
Date: 2014-02-19 16:04:19
Message-ID: CA+Tgmob6Ha+sBHaXEbR8+Y67i=aTyG1FaV-WEtJ7Pv+td8DtnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 19, 2014 at 8:28 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Mon, Jan 20, 2014 at 5:37 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>> Agreed; that was the original plan, but implementation delays
>> prevented the whole vision/discussion/implementation. Requirements
>> from various areas include WAL rate limiting for replication, I/O rate
>> limiting, hard CPU and I/O limits for security and mixed workload
>> coexistence.
>>
>> I'd still like to get something on this in 9.4 that alleviates the
>> replication issues, leaving wider changes for later releases.
>
> My first reaction was that we should just have a generic I/O resource
> throttling. I was only convinced this was a reasonable idea by the
> replication use case. It would help me to understand the specific
> situations where replication breaks down due to WAL bandwidth
> starvation. Heroku has had some problems with slaves falling behind
> though the immediate problems that causes is the slave filling up disk
> which we could solve more directly by switching to archive mode rather
> than slowing down the master.
>
> But I would suggest you focus on a specific use case that's
> problematic so we can judge better if the implementation is really
> fixing it.
>
>> The vacuum_* parameters don't allow any control over WAL production,
>> which is often the limiting factor. I could, for example, introduce a
>> new parameter for vacuum_cost_delay that provides a weighting for each
>> new BLCKSZ chunk of WAL, then rename all parameters to a more general
>> form. Or I could forget that and just press ahead with the patch as
>> is, providing a cleaner interface in next release.
>>
>>> It's also interesting to wonder about the relationship to
>>> CHECK_FOR_INTERRUPTS --- although I think that currently, we assume
>>> that that's *cheap* (1 test and branch) as long as nothing is pending.
>>> I don't want to see a bunch of arithmetic added to it.
>>
>> Good point.
>
> I think it should be possible to actually merge it into
> CHECK_FOR_INTERRUPTS. Have a single global flag
> io_done_since_check_for_interrupts which is set to 0 after each
> CHECK_FOR_INTERRUPTS and set to 1 whenever any wal is written. Then
> CHECK_FOR_INTERRUPTS turns into two tests and branches instead of one
> in the normal case.
>
> In fact you could do all the arithmetic when you do the wal write.
> Only set the flag if the bandwidth consumed is above the budget. Then
> the flag should only ever be set when you're about to sleep.
>
> I would dearly love to see a generic I/O bandwidth limits so it would
> be nice to see a nicely general pattern here that could be extended
> even if we only target wal this release.
>
> I'm going to read the existing patch now, do you think it's ready to
> go or did you want to do more work based on the feedback?

Well, *I* don't think this is ready to go. A WAL rate limit that only
limits WAL sometimes still doesn't impress me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-02-19 16:08:37 Re: should we add a XLogRecPtr/LSN SQL type?
Previous Message Greg Stark 2014-02-19 16:03:38 Re: should we add a XLogRecPtr/LSN SQL type?