Re: Linux kernel impact on PostgreSQL performance

From: Mel Gorman <mgorman(at)suse(dot)de>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>
Subject: Re: Linux kernel impact on PostgreSQL performance
Date: 2014-01-15 10:08:44
Message-ID: 20140115100844.GG4963@suse.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote:
> > > What's not so simple, is figuring out what policy to use. Remember,
> > > you cannot tell the kernel to put some page in its page cache without
> > > reading it or writing it. So, once you make the kernel forget a page,
> > > evicting it from shared buffers becomes quite expensive.
> >
> > posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
> > forcing readahead.
>
>
> But telling the kernel to forget a page, then telling it to read it in
> again from disk because it might be needed again in the near future is
> itself very expensive. We would need to hand the page to the kernel so it
> has it without needing to go to disk to get it.
>

Yes, this is the unnecessary IO cost I was thinking of.

>
> > If you evict it prematurely then you do get kinda
> > screwed because you pay the IO cost to read it back in again even if you
> > had enough memory to cache it. Maybe this is the type of kernel-postgres
> > interaction that is annoying you.
> >
> > If you don't evict, the kernel eventually steps in and evicts the wrong
> > thing. If you do evict and it was unnecessarily you pay an IO cost.
> >
> > That could be something we look at. There are cases buried deep in the
> > VM where pages get shuffled to the end of the LRU and get tagged for
> > reclaim as soon as possible. Maybe you need access to something like
> > that via posix_fadvise to say "reclaim this page if you need memory but
> > leave it resident if there is no memory pressure" or something similar.
> > Not exactly sure what that interface would look like or offhand how it
> > could be reliably implemented.
> >
>
> I think the "reclaim this page if you need memory but leave it resident if
> there is no memory pressure" hint would be more useful for temporary
> working files than for what was being discussed above (shared buffers).
> When I do work that needs large temporary files, I often see physical
> write IO spike but physical read IO does not. I interpret that to mean
> that the temporary data is being written to disk to satisfy either
> dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
> cache and so disk reads are not needed to satisfy it. So a hint that says
> "this file will never be fsynced so please ignore dirty_*bytes and
> dirty_expire_centisecs.

It would be good to know if dirty_expire_centisecs or dirty ratio|bytes
were the problem here. An interface that forces a dirty page to stay dirty
regardless of the global system would be a major hazard. It potentially
allows the creator of the temporary file to stall all other processes
dirtying pages for an unbounded period of time. I proposed in another part
of the thread a hint for open inodes to have the background writer thread
ignore dirty pages belonging to that inode. Dirty limits and fsync would
still be obeyed. It might also be workable for temporary files but the
proposal could be full of holes.

Your alternative here is to create a private anonymous mapping as they
are not subject to dirty limits. This is only a sensible option if the
temporarily data is guaranteeed to be relatively small. If the shared
buffers, page cache and your temporary data exceed the size of RAM then
data will get discarded or your temporary data will get pushed to swap
and performance will hit the floor.

FWIW, the performance of some IO "benchmarks" used to depend on whether they
could create, write and delete files before any of the data actually hit
the disk -- pretty much exactly the type of behaviour you are looking for.

--
Mel Gorman
SUSE Labs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2014-01-15 10:11:55 Re: plpgsql.consistent_into
Previous Message Masterprojekt Naumann1 2014-01-15 09:59:59 Re: identify table oid for an AggState during plan tree initialization