Re: Why we are going to have to go DirectIO

Lists: pgsql-hackers
From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Why we are going to have to go DirectIO
Date: 2013-12-03 18:44:15
Message-ID: 529E267F.4050700@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

All,

https://lkml.org/lkml/2013/11/24/133

What this means for us:

http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data

It seems clear that Kernel.org, since 2.6, has been in the business of
pushing major, hackish, changes to the IO stack without testing them or
even thinking too hard about what the side-effects might be. This is
perhaps unsurprising given that two of the largest sponsors of the
Kernel -- who, incidentally, do 100% of the performance testing -- don't
use the IO stack.

This says to me that Linux will clearly be an undependable platform in
the future with the potential to destroy PostgreSQL performance without
warning, leaving us scrambling for workarounds. Too bad the
alternatives are so unpopular.

I don't know where we'll get the resources to implement our own storage,
but it's looking like we don't have a choice.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 18:53:11
Message-ID: CA+TgmoaTJdCOSrh_ch9O-Z14KQXFW_gug2CoHvUU3ovaSoo3Rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 3, 2013 at 1:44 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> All,
>
> https://lkml.org/lkml/2013/11/24/133
>
> What this means for us:
>
> http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data
>
> It seems clear that Kernel.org, since 2.6, has been in the business of
> pushing major, hackish, changes to the IO stack without testing them or
> even thinking too hard about what the side-effects might be. This is
> perhaps unsurprising given that two of the largest sponsors of the
> Kernel -- who, incidentally, do 100% of the performance testing -- don't
> use the IO stack.
>
> This says to me that Linux will clearly be an undependable platform in
> the future with the potential to destroy PostgreSQL performance without
> warning, leaving us scrambling for workarounds. Too bad the
> alternatives are so unpopular.
>
> I don't know where we'll get the resources to implement our own storage,
> but it's looking like we don't have a choice.

This seems like a strange reaction to an article that's mostly about
how Linux is now *fixing* a problem that could cause PostgreSQL to
experience performance problems. I agree that we'll probably
eventually need to implement our own storage layer, but this article
isn't evidence of urgency so far as I can determine on first
read-through.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 18:59:38
Message-ID: 529E2A1A.7090202@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 12/03/2013 10:44 AM, Josh Berkus wrote:
>
> All,
>
> https://lkml.org/lkml/2013/11/24/133
>
> What this means for us:
>
> http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data
>
> It seems clear that Kernel.org, since 2.6, has been in the business of
> pushing major, hackish, changes to the IO stack without testing them or
> even thinking too hard about what the side-effects might be. This is
> perhaps unsurprising given that two of the largest sponsors of the
> Kernel -- who, incidentally, do 100% of the performance testing -- don't
> use the IO stack.
>
> This says to me that Linux will clearly be an undependable platform in
> the future with the potential to destroy PostgreSQL performance without
> warning, leaving us scrambling for workarounds. Too bad the
> alternatives are so unpopular.
>
> I don't know where we'll get the resources to implement our own storage,
> but it's looking like we don't have a choice.
>

This seems rather half cocked. I read the article. They found a problem,
that really will only affect a reasonably small percentage of users,
created a test case, reported it, and a patch was produced.

Kind of like how we do it.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 19:23:45
Message-ID: 529E2FC1.7030303@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/03/2013 10:59 AM, Joshua D. Drake wrote:
> This seems rather half cocked. I read the article. They found a problem,
> that really will only affect a reasonably small percentage of users,
> created a test case, reported it, and a patch was produced.

"Users with at least one file bigger than 50% of RAM" is unlikely to be
a small percentage.

>
> Kind of like how we do it.

I like to think we'd have at least researched the existing literature on
2Q algorithms (which is extensive) before implementing our own. Oh,
wait, we *did*. Nor is this the first ill-considered performance hack
pushed into mainline kernels without any real testing. It's not even
the first *that year*.

While I am angry over this -- no matter what Kernel.org fixes now, we're
going to have to live with their mistake for 3 years -- the DirectIO
thing isn't just me; when I've gone to Linux Kernel events to talk about
IO, that's the response I've gotten from most Linux hackers: "you
shouldn't be using the filesystem, use DirectIO and implement your own
storage."

That's why they don't feel that it's a problem to break the IO stack;
they really don't believe that anyone who cares about performance should
be using it.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 20:15:08
Message-ID: 529E3BCC.5080402@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/03/2013 08:23 PM, Josh Berkus wrote:
> On 12/03/2013 10:59 AM, Joshua D. Drake wrote:
>> This seems rather half cocked. I read the article. They found a problem,
>> that really will only affect a reasonably small percentage of users,
>> created a test case, reported it, and a patch was produced.
>
> "Users with at least one file bigger than 50% of RAM" is unlikely to be
> a small percentage.
>
>>
>> Kind of like how we do it.
>
> I like to think we'd have at least researched the existing literature on
> 2Q algorithms (which is extensive) before implementing our own. Oh,
> wait, we *did*. Nor is this the first ill-considered performance hack
> pushed into mainline kernels without any real testing. It's not even
> the first *that year*.
>
> While I am angry over this -- no matter what Kernel.org fixes now, we're
> going to have to live with their mistake for 3 years -- the DirectIO
> thing isn't just me; when I've gone to Linux Kernel events to talk about
> IO, that's the response I've gotten from most Linux hackers: "you
> shouldn't be using the filesystem, use DirectIO and implement your own
> storage."
>
> That's why they don't feel that it's a problem to break the IO stack;
> they really don't believe that anyone who cares about performance should
> be using it.

reading that article I think this is an overreaction, it is not
kernel.orgs fault that distributions exist and bugs and regression
happen in all pieces of software.

We are in no way different and I would like to note that we do not have
any form of sensible performance related regression testing either.
I would even argue that there is ton more regression testing (be it
performance or otherwise) going into the linux kernel (even on a
relative scale) than we do and pointing the finger at something they are
dealing with once noticed.
If we care about our performance on various operating systems it is
_OUR_ responsibility to track that closely and automated and report back
and only if that feedback loop fails to work we are actually in a real
position to consider something as drastical as considering a platform
"undependable" or looking into alternatives (like directIO).

Stefan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 20:35:06
Message-ID: 27536.1386102906@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> If we care about our performance on various operating systems it is
> _OUR_ responsibility to track that closely and automated and report back
> and only if that feedback loop fails to work we are actually in a real
> position to consider something as drastical as considering a platform
> "undependable" or looking into alternatives (like directIO).

+1. I fail to understand why anyone would think it's a good idea for us
to build our own I/O stack. The resources that would be consumed by that
would probably be enough to sink the project, or at least ensure that we
made no progress on any other aspect of the system for a good long time.
(And I'm just talking development, never mind maintenance.)

Far better to invest some effort in providing decent feedback to the
platforms we depend on.

regards, tom lane


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 21:42:01
Message-ID: 529E5029.9000407@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 12/03/2013 12:35 PM, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> If we care about our performance on various operating systems it is
>> _OUR_ responsibility to track that closely and automated and report back
>> and only if that feedback loop fails to work we are actually in a real
>> position to consider something as drastical as considering a platform
>> "undependable" or looking into alternatives (like directIO).
>
> +1. I fail to understand why anyone would think it's a good idea for us
> to build our own I/O stack. The resources that would be consumed by that
> would probably be enough to sink the project, or at least ensure that we
> made no progress on any other aspect of the system for a good long time.
> (And I'm just talking development, never mind maintenance.)
>
> Far better to invest some effort in providing decent feedback to the
> platforms we depend on.

Although I am on the same page as Tom and Stefan here, I can certainly
understand Josh's frustration. When you see things like the ext4 bugs or
the recent long slew of performance related issues in relation to
PDFlush, it is enough to make consultants very frustrated with the likes
of Ubuntu and Debian. I would say RedHat too except they learned their
lesson back in the kernel 2.4 days.

Sincerely,

Joshua D. Drake

>
> regards, tom lane
>
>

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 22:44:13
Message-ID: 529E5EBD.1060701@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/03/2013 12:15 PM, Stefan Kaltenbrunner wrote:
> We are in no way different and I would like to note that we do not have
> any form of sensible performance related regression testing either.
> I would even argue that there is ton more regression testing (be it
> performance or otherwise) going into the linux kernel (even on a
> relative scale) than we do and pointing the finger at something they are
> dealing with once noticed.
> If we care about our performance on various operating systems it is
> _OUR_ responsibility to track that closely and automated and report back
> and only if that feedback loop fails to work we are actually in a real
> position to consider something as drastical as considering a platform
> "undependable" or looking into alternatives (like directIO).

Would certainly be nice. Realistically, getting good automated
performace tests will require paying someone like Greg S., Mark or me
for 6 solid months to develop them, since worthwhile open source
performance test platforms currently don't exist. That money has never
been available; maybe I should do a kickstarter.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 23:02:12
Message-ID: CABUevEy3cU54rSUSwCO+cZrqB8kthEz7Gbnmyo1-S_739qqvWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 3, 2013 at 11:44 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> On 12/03/2013 12:15 PM, Stefan Kaltenbrunner wrote:
> > We are in no way different and I would like to note that we do not have
> > any form of sensible performance related regression testing either.
> > I would even argue that there is ton more regression testing (be it
> > performance or otherwise) going into the linux kernel (even on a
> > relative scale) than we do and pointing the finger at something they are
> > dealing with once noticed.
> > If we care about our performance on various operating systems it is
> > _OUR_ responsibility to track that closely and automated and report back
> > and only if that feedback loop fails to work we are actually in a real
> > position to consider something as drastical as considering a platform
> > "undependable" or looking into alternatives (like directIO).
>
> Would certainly be nice. Realistically, getting good automated
> performace tests will require paying someone like Greg S., Mark or me
> for 6 solid months to develop them, since worthwhile open source
> performance test platforms currently don't exist. That money has never
> been available; maybe I should do a kickstarter.
>
>
So in order to get *testing* we need to pay somebody. But to build a great
database server, we can rely on volunteer efforts or sponsorship from
companies who are interested in moving the project forward? That hardly
seems right... Either it's just not high enough on peoples priority lists
(in which case you're not likely to get anybody to actually pay for it
either), or there is some other reason why people just don't care. Figuring
that out would probably be a pre-requisite to get it done. But sure - I'm
all for trying a kickstarter. Did anybody ever try that for an actual
postgres feature? Didn't JD and/or cmd and/or pgus at some point try
something like that?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 23:08:11
Message-ID: 31495.1386112091@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander <magnus(at)hagander(dot)net> writes:
> On Tue, Dec 3, 2013 at 11:44 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Would certainly be nice. Realistically, getting good automated
>> performace tests will require paying someone like Greg S., Mark or me
>> for 6 solid months to develop them, since worthwhile open source
>> performance test platforms currently don't exist. That money has never
>> been available; maybe I should do a kickstarter.

> So in order to get *testing* we need to pay somebody. But to build a great
> database server, we can rely on volunteer efforts or sponsorship from
> companies who are interested in moving the project forward?

And even more to the point, volunteers to reinvent the kernel I/O stack
can be found on every street corner? And those volunteers won't need any
test scaffolding to be sure that *their* version never has performance
regressions? (Well, no, they won't, because no such thing will ever be
built. But we do need better test scaffolding for real problems.)

regards, tom lane


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 23:14:07
Message-ID: 20131203231407.GK5158@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander wrote:
> On Tue, Dec 3, 2013 at 11:44 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> > Would certainly be nice. Realistically, getting good automated
> > performace tests will require paying someone like Greg S., Mark or me
> > for 6 solid months to develop them, since worthwhile open source
> > performance test platforms currently don't exist. That money has never
> > been available; maybe I should do a kickstarter.
>
> So in order to get *testing* we need to pay somebody. But to build a great
> database server, we can rely on volunteer efforts or sponsorship from
> companies who are interested in moving the project forward?

The reason for this is obvious. You cannot just give the responsibility
of creating a good testing framework to any random guy you just found on
the internets. It has to be an expert, you see.

> But sure - I'm all for trying a kickstarter. Did anybody ever try that
> for an actual postgres feature? Didn't JD and/or cmd and/or pgus at
> some point try something like that?

Hmm, I vaguely recall at CMD there was an attempt to work on a feature
paid via crowd-funding, but I don't recall if we got there or not. In a
way, the foreign key locking patch was done that way, but it wasn't true
crowd-funding but multiple companies sponsoring (at least partly,
because we spent so much more on it than we initially thought we would).

Not quite the same thing, I guess; but then if you get several companies
to each put larger amounts of money than an individual would, I guess
your kickstarter might also succeed.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 23:15:00
Message-ID: 529E65F4.3020706@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus,

> So in order to get *testing* we need to pay somebody. But to build a great
> database server, we can rely on volunteer efforts or sponsorship from
> companies who are interested in moving the project forward? That hardly
> seems right... Either it's just not high enough on peoples priority lists
> (in which case you're not likely to get anybody to actually pay for it
> either), or there is some other reason why people just don't care.

It's *always* much easier to get money for features than for other
things. Earlier this year I was really hoping that our new corporate
community members, who seemed to be interested in testing, would put
some serious resources behind this. When pressed, however, they did
what everyone does -- pass and hope that someone else will pay for it.
Huawei staff at least did add a bunch of regression tests, which was
great, but it's a fraction of the work we need for a more comprehensive
testing infrastructure. I got this pretty quickly when Andrew and I led
the session at the unconference. Everybody wanted better testing, but
they all wanted someone else to foot the bill.

We also have the issue that many folks on this list think that testing
isn't important, which further discourages anyone from committing their
own time. But even if the enthusiasm for testing was universal, I think
that we'd need to find money for someone.

I don't think this is prohibitive, though; we do very little fundraising
in this community, and if a testing project had official community
endorsement, I think it would be relatively easy to raise money for it.
Provided that we avoid bikeshedding it to death, of course.

> Figuring
> that out would probably be a pre-requisite to get it done. But sure - I'm
> all for trying a kickstarter. Did anybody ever try that for an actual
> postgres feature? Didn't JD and/or cmd and/or pgus at some point try
> something like that?

CMD raised money for the FK locks feature. 2Q has raised money for
several features. So has PGX.

I'd rather do the testing thing as a community thing, though, which
means raising non-profit money and having an open bid process for the
person to do the work. I think we cold raise more money that way, and
are likely to get a better end result.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-03 23:37:12
Message-ID: 529E6B28.9060209@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 12/03/2013 03:02 PM, Magnus Hagander wrote:
> On Tue, Dec 3, 2013 at 11:44 PM, Josh Berkus <josh(at)agliodbs(dot)com

> Would certainly be nice. Realistically, getting good automated
> performace tests will require paying someone like Greg S., Mark or me
> for 6 solid months to develop them, since worthwhile open source
> performance test platforms currently don't exist. That money has never
> been available; maybe I should do a kickstarter.
>
>
> So in order to get *testing* we need to pay somebody. But to build a
> great database server, we can rely on volunteer efforts or sponsorship
> from companies who are interested in moving the project forward? That
> hardly seems right... Either it's just not high enough on peoples
> priority lists (in which case you're not likely to get anybody to
> actually pay for it either), or there is some other reason why people
> just don't care. Figuring that out would probably be a pre-requisite to
> get it done. But sure - I'm all for trying a kickstarter. Did anybody
> ever try that for an actual postgres feature? Didn't JD and/or cmd
> and/or pgus at some point try something like that?

We had our own launched thing that was called FOSSExperts. It wasn't
that successful but it is what started the foreign keys project. That
said, something like KickStarter (or one of the other ones) would have a
lot more exposure.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 00:05:52
Message-ID: 529E71E0.4050002@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 12/03/2013 03:15 PM, Josh Berkus wrote:

> It's *always* much easier to get money for features than for other
> things. Earlier this year I was really hoping that our new corporate
> community members, who seemed to be interested in testing, would put
> some serious resources behind this. When pressed, however, they did
> what everyone does -- pass and hope that someone else will pay for it.
> Huawei staff at least did add a bunch of regression tests, which was
> great, but it's a fraction of the work we need for a more comprehensive
> testing infrastructure. I got this pretty quickly when Andrew and I led
> the session at the unconference. Everybody wanted better testing, but
> they all wanted someone else to foot the bill.

+1

I have talked to many a customer and community member about various
things like testing and the overwhelming response is: **crickets** .

It isn't hard to get someone to pay for something that has very little
tangible return, such as a conference sponsorship, but to pay for
something that helps the entire community? **crickets**

There are exceptions of course, but by and large that is my experience.

Sincerely,

Joshua D. Drake

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: magnus(at)hagander(dot)net, josh(at)agliodbs(dot)com, stefan(at)kaltenbrunner(dot)cc, jd(at)commandprompt(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 02:28:29
Message-ID: 20131204.112829.2284665416860121658.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Magnus Hagander <magnus(at)hagander(dot)net> writes:
>> On Tue, Dec 3, 2013 at 11:44 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> Would certainly be nice. Realistically, getting good automated
>>> performace tests will require paying someone like Greg S., Mark or me
>>> for 6 solid months to develop them, since worthwhile open source
>>> performance test platforms currently don't exist. That money has never
>>> been available; maybe I should do a kickstarter.
>
>> So in order to get *testing* we need to pay somebody. But to build a great
>> database server, we can rely on volunteer efforts or sponsorship from
>> companies who are interested in moving the project forward?
>
> And even more to the point, volunteers to reinvent the kernel I/O stack
> can be found on every street corner? And those volunteers won't need any
> test scaffolding to be sure that *their* version never has performance
> regressions? (Well, no, they won't, because no such thing will ever be
> built. But we do need better test scaffolding for real problems.)

Can we avoid the Linux kernel problem by simply increasing our shared
buffer size, say up to 80% of memory?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>, tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: magnus(at)hagander(dot)net, josh(at)agliodbs(dot)com, stefan(at)kaltenbrunner(dot)cc, jd(at)commandprompt(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 04:08:38
Message-ID: 529EAAC6.1010506@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/12/04 11:28), Tatsuo Ishii wrote:
>> Magnus Hagander <magnus(at)hagander(dot)net> writes:
>>> On Tue, Dec 3, 2013 at 11:44 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>>> Would certainly be nice. Realistically, getting good automated
>>>> performace tests will require paying someone like Greg S., Mark or me
>>>> for 6 solid months to develop them, since worthwhile open source
>>>> performance test platforms currently don't exist. That money has never
>>>> been available; maybe I should do a kickstarter.
>>
>>> So in order to get *testing* we need to pay somebody. But to build a great
>>> database server, we can rely on volunteer efforts or sponsorship from
>>> companies who are interested in moving the project forward?
>>
>> And even more to the point, volunteers to reinvent the kernel I/O stack
>> can be found on every street corner? And those volunteers won't need any
>> test scaffolding to be sure that *their* version never has performance
>> regressions? (Well, no, they won't, because no such thing will ever be
>> built. But we do need better test scaffolding for real problems.)
>
> Can we avoid the Linux kernel problem by simply increasing our shared
> buffer size, say up to 80% of memory?
It will be swap more easier.

I think that we should use latest system-calls in Linux which are like
posix_fadvise(), fallocate() and sync_file_range() etc, when we use linux
buffered IO. Hoevere, PostgreSQL doesn't use these system-call a lots.
Especially, I think that checkpoint algorithm is very ugly..

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 04:40:08
Message-ID: 1386132008.27399.3.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2013-12-03 at 14:44 -0800, Josh Berkus wrote:
> Would certainly be nice. Realistically, getting good automated
> performace tests will require paying someone like Greg S., Mark or me
> for 6 solid months to develop them, since worthwhile open source
> performance test platforms currently don't exist. That money has
> never been available; maybe I should do a kickstarter.

I think the problem is, it's not even clear what the deliverable might
be. Benchmarking tools exist, and running them on a regular schedule
shouldn't be difficult. But that doesn't find regressions between
kernel versions, for example, or regressions in particular queries
(unless they happen to be included in the benchmark).

The first step here should be to work out the minimum viable product,
and then see what it would take to get that done.


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 07:14:41
Message-ID: 529ED661.8060901@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/04/2013 05:40 AM, Peter Eisentraut wrote:
> On Tue, 2013-12-03 at 14:44 -0800, Josh Berkus wrote:
>> Would certainly be nice. Realistically, getting good automated
>> performace tests will require paying someone like Greg S., Mark or me
>> for 6 solid months to develop them, since worthwhile open source
>> performance test platforms currently don't exist. That money has
>> never been available; maybe I should do a kickstarter.
>
> I think the problem is, it's not even clear what the deliverable might
> be. Benchmarking tools exist, and running them on a regular schedule
> shouldn't be difficult. But that doesn't find regressions between
> kernel versions, for example, or regressions in particular queries
> (unless they happen to be included in the benchmark).

I agree on the problem of specifying an exact deliverable - however
simple using some of the extisting benchmark tool and maybe augment them
by the myriad of simple "micro" level regressions we have in the form of
sql queries in the archives would be a sensible start. It might not help
for all cases but it can help for some and we learn something that might
help us building the next iteration of it. Adding say some operatimng
systems to the mix of we have the above would be fairly easy - running a
few kvm instances that get bootstrapped automatically is something that
is a solved problem.

>
> The first step here should be to work out the minimum viable product,
> and then see what it would take to get that done.

yeah we need to start somewhere and see what we can learn.

Stefan


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp
Cc: ishii(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us, magnus(at)hagander(dot)net, josh(at)agliodbs(dot)com, stefan(at)kaltenbrunner(dot)cc, jd(at)commandprompt(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 07:28:19
Message-ID: 20131204.162819.1459835641271977076.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> Can we avoid the Linux kernel problem by simply increasing our shared
>> buffer size, say up to 80% of memory?
> It will be swap more easier.

Is that the case? If the system has not enough memory, the kernel
buffer will be used for other purpose, and the kernel cache will not
work very well anyway. In my understanding, the problem is, even if
there's enough memory, the kernel's cache does not work as expected.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 07:39:23
Message-ID: CAGTBQpZXLbuQMuVg1xwC=MWDqR1X84p9450RaDDn7fiqbauqEg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 4, 2013 at 4:28 AM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>>> Can we avoid the Linux kernel problem by simply increasing our shared
>>> buffer size, say up to 80% of memory?
>> It will be swap more easier.
>
> Is that the case? If the system has not enough memory, the kernel
> buffer will be used for other purpose, and the kernel cache will not
> work very well anyway. In my understanding, the problem is, even if
> there's enough memory, the kernel's cache does not work as expected.

Problem is, Postgres relies on a working kernel cache for checkpoints.
Checkpoint logic would have to be heavily reworked to account for an
impaired kernel cache.

Really, there's no difference between fixing the I/O problems in the
kernel(s) vs in postgres. The only difference is, in the kernel(s),
everyone profits, and you've got a huge head start.

Communicating more with the kernel (through posix_fadvise, fallocate,
aio, iovec, etc...) would probably be good, but it does expose more
kernel issues. posix_fadvise, for instance, is a double-edged sword
ATM. I do believe, however, that exposing those issues and prompting a
fix is far preferable than silently working around them.


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 08:47:24
Message-ID: 529EEC1C.2040207@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/04/2013 01:08 AM, Tom Lane wrote:
> Magnus Hagander <magnus(at)hagander(dot)net> writes:
>> On Tue, Dec 3, 2013 at 11:44 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> Would certainly be nice. Realistically, getting good automated
>>> performace tests will require paying someone like Greg S., Mark or me
>>> for 6 solid months to develop them, since worthwhile open source
>>> performance test platforms currently don't exist. That money has never
>>> been available; maybe I should do a kickstarter.
>
>> So in order to get *testing* we need to pay somebody. But to build a great
>> database server, we can rely on volunteer efforts or sponsorship from
>> companies who are interested in moving the project forward?
>
> And even more to the point, volunteers to reinvent the kernel I/O stack
> can be found on every street corner?

Actually, yes, I think so. That's a lot more exciting to work on than a
regression test suite.

> And those volunteers won't need any
> test scaffolding to be sure that *their* version never has performance
> regressions? (Well, no, they won't, because no such thing will ever be
> built. But we do need better test scaffolding for real problems.)

Maybe we should lie, and *say* that we want direct I/O, but require that
all submissions come with a test suite to prove that it's a gain. Then
someone might actually write one, as a sidekick of a direct I/O patch.
Then we could toss out the direct I/O stuff and take only the test
framework.

FWIW, I also think that it'd be a folly to reimplement the I/O stack.
The kernel does a lot of things for us. It might not do a great job, but
it's good enough. As one datapoint, before my time, the VMware vPostgres
team actually did use direct I/O in vPostgres. We shipped that in a few
releases. It was a lot of effort to get the code right, and for DBAs, it
made correct tuning of shared_buffers a lot more important - set it too
low and you want take full advantage of your RAM, set it too high and
you won't have memory available for other things. To be a good VM
citizen, they also had to implement a memory ballooning module inside
Postgres, to release shared buffers if the system hosting the VM is
under memory pressure. What did we gain by doing all that, compared to
just letting the kernel handle? Some extra performance in some use
cases, and a loss in others. Not worth the trouble.

- Heikki


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 15:30:51
Message-ID: 529F4AAB.80505@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/4/13, 2:14 AM, Stefan Kaltenbrunner wrote:
> running a
> few kvm instances that get bootstrapped automatically is something that
> is a solved problem.

Is it sound to run performance tests on kvm?


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 15:32:41
Message-ID: 529F4B19.1050705@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/04/2013 04:30 PM, Peter Eisentraut wrote:
> On 12/4/13, 2:14 AM, Stefan Kaltenbrunner wrote:
>> running a
>> few kvm instances that get bootstrapped automatically is something that
>> is a solved problem.
>
> Is it sound to run performance tests on kvm?

as sounds as on any other platform imho, the performance characteristics
will differ between bare metal or other virtualisation platforms but the
future is virtual and that is what a lot of stuff runs on...

Stefan


From: Jonathan Corbet <corbet(at)lwn(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 15:33:45
Message-ID: 20131204083345.31c60dd1@lwn.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 03 Dec 2013 10:44:15 -0800
Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> It seems clear that Kernel.org, since 2.6, has been in the business of
> pushing major, hackish, changes to the IO stack without testing them or
> even thinking too hard about what the side-effects might be. This is
> perhaps unsurprising given that two of the largest sponsors of the
> Kernel -- who, incidentally, do 100% of the performance testing -- don't
> use the IO stack.
>
> This says to me that Linux will clearly be an undependable platform in
> the future with the potential to destroy PostgreSQL performance without
> warning, leaving us scrambling for workarounds. Too bad the
> alternatives are so unpopular.

Wow, Josh, I'm surprised to hear this from you.

The active/inactive list mechanism works great for the vast majority of
users. The second-use algorithm prevents a lot of pathological behavior,
like wiping out your entire cache by copying a big file or running a
backup. We *need* that kind of logic in the kernel.

Now, back in 2012, Johannes (working for one of those big contributors)
hit upon an issue where second-use falls down. So he set out to fix it:

https://lwn.net/Articles/495543/

This code has been a bit slow getting into the mainline for a few reasons,
but one of the chief ones is this: nobody is saying from the sidelines
that they need it! If somebody were saying "Postgres would work a lot
better with this code in place" and had some numbers to demonstrate that,
we'd be far more likely to see it get into an upcoming release.

In the end, Linux is quite responsive to the people who participate in its
development, even as testers and bug reporters. It responds rather less
well to people who find problems in enterprise kernels years later,
granted.

The amount of automated testing, including performance testing, has
increased markedly in the last couple of years. I bet that it would not
be hard at all to get somebody like Fengguang Wu to add some
Postgres-oriented I/O tests to his automatic suite:

https://lwn.net/Articles/571991/

Then we would all have a much better idea of how kernel releases are
affecting one of our most important applications; developers would pay
attention to that information.

Or you could go off and do your own thing, but I believe that would leave
us all poorer.

Thanks,

jon


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Jonathan Corbet <corbet(at)lwn(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 17:45:22
Message-ID: 529F6A32.50901@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/04/2013 04:33 PM, Jonathan Corbet wrote:
> On Tue, 03 Dec 2013 10:44:15 -0800
> Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> It seems clear that Kernel.org, since 2.6, has been in the business of
>> pushing major, hackish, changes to the IO stack without testing them or
>> even thinking too hard about what the side-effects might be. This is
>> perhaps unsurprising given that two of the largest sponsors of the
>> Kernel -- who, incidentally, do 100% of the performance testing -- don't
>> use the IO stack.
>>
>> This says to me that Linux will clearly be an undependable platform in
>> the future with the potential to destroy PostgreSQL performance without
>> warning, leaving us scrambling for workarounds. Too bad the
>> alternatives are so unpopular.
>
> Wow, Josh, I'm surprised to hear this from you.
>
> The active/inactive list mechanism works great for the vast majority of
> users. The second-use algorithm prevents a lot of pathological behavior,
> like wiping out your entire cache by copying a big file or running a
> backup. We *need* that kind of logic in the kernel.
>
> Now, back in 2012, Johannes (working for one of those big contributors)
> hit upon an issue where second-use falls down. So he set out to fix it:
>
> https://lwn.net/Articles/495543/
>
> This code has been a bit slow getting into the mainline for a few reasons,
> but one of the chief ones is this: nobody is saying from the sidelines
> that they need it! If somebody were saying "Postgres would work a lot
> better with this code in place" and had some numbers to demonstrate that,
> we'd be far more likely to see it get into an upcoming release.
>
> In the end, Linux is quite responsive to the people who participate in its
> development, even as testers and bug reporters. It responds rather less
> well to people who find problems in enterprise kernels years later,
> granted.
>
> The amount of automated testing, including performance testing, has
> increased markedly in the last couple of years. I bet that it would not
> be hard at all to get somebody like Fengguang Wu to add some
> Postgres-oriented I/O tests to his automatic suite:
>
> https://lwn.net/Articles/571991/
>
> Then we would all have a much better idea of how kernel releases are
> affecting one of our most important applications; developers would pay
> attention to that information.

hmm interesting tool, I can see how that would be very useful "for early
warning" style detection on the kernel development side using a small
set of postgresql "benchmarks". That would basically help with part of
Josh complained that it will take ages for regressions to be detected.
From postgresqls pov we would also need additional long term and more
complex testing spanning different postgresql version on various
distribution platforms (because that is what people deploy in
production, hand built git-fetched kernels are rare) using tests that
both might have extended runtimes and/or require external infrastructure

>
> Or you could go off and do your own thing, but I believe that would leave
> us all poorer.

fully agreed

Stefan


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 18:30:46
Message-ID: 529F74D6.8030009@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 12/04/2013 07:32 AM, Stefan Kaltenbrunner wrote:
>
> On 12/04/2013 04:30 PM, Peter Eisentraut wrote:
>> On 12/4/13, 2:14 AM, Stefan Kaltenbrunner wrote:
>>> running a
>>> few kvm instances that get bootstrapped automatically is something that
>>> is a solved problem.
>>
>> Is it sound to run performance tests on kvm?
>
> as sounds as on any other platform imho, the performance characteristics
> will differ between bare metal or other virtualisation platforms but the
> future is virtual and that is what a lot of stuff runs on...

In actuality you need both. We need to know what the kernel is going to
do on bare metal. For example, 3.2 to 3.8 are total crap for random IO
access. We will only catch that properly from bare metal tests or at
least, we will only catch it easily on bare metal tests.

If we know the standard bare metal tests are working then the next step
up would be to test virtual.

BTW: Virtualization is only one future and it is still a long way off
from serving the needs that bare metal serves at the same level
(speaking PostgreSQL specifically).

JD

>
>
> Stefan
>
>

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 18:35:22
Message-ID: 529F75EA.4070608@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/04/2013 07:30 PM, Joshua D. Drake wrote:
>
> On 12/04/2013 07:32 AM, Stefan Kaltenbrunner wrote:
>>
>> On 12/04/2013 04:30 PM, Peter Eisentraut wrote:
>>> On 12/4/13, 2:14 AM, Stefan Kaltenbrunner wrote:
>>>> running a
>>>> few kvm instances that get bootstrapped automatically is something that
>>>> is a solved problem.
>>>
>>> Is it sound to run performance tests on kvm?
>>
>> as sounds as on any other platform imho, the performance characteristics
>> will differ between bare metal or other virtualisation platforms but the
>> future is virtual and that is what a lot of stuff runs on...
>
> In actuality you need both. We need to know what the kernel is going to
> do on bare metal. For example, 3.2 to 3.8 are total crap for random IO
> access. We will only catch that properly from bare metal tests or at
> least, we will only catch it easily on bare metal tests.
>
> If we know the standard bare metal tests are working then the next step
> up would be to test virtual.
>
> BTW: Virtualization is only one future and it is still a long way off
> from serving the needs that bare metal serves at the same level
> (speaking PostgreSQL specifically).

we need to get that off the ground - and whatever makes it easier to get
off the ground will help. and if we solve the automation for
virtualisation, bare metal is just a small step away (or the other way
round). Getting comparable performance levels between either different
postgresql versions (or patches) or different operating systems with
various workloads is probably more valuable now that getting absolute
peak performance levels under specific tests long term.

Stefan


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Jonathan Corbet <corbet(at)lwn(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 19:07:04
Message-ID: 529F7D58.1060301@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/04/2013 07:33 AM, Jonathan Corbet wrote:
> Wow, Josh, I'm surprised to hear this from you.

Well, I figured it was too angry to propose for an LWN article. ;-)

> The active/inactive list mechanism works great for the vast majority of
> users. The second-use algorithm prevents a lot of pathological behavior,
> like wiping out your entire cache by copying a big file or running a
> backup. We *need* that kind of logic in the kernel.

There's a large body of research on 2Q algorithms going back to the 80s,
which is what this is. As far as I can tell, the modification was
performed without any reading of this research, since that would have
easily shown that 50/50 was unlikely to be a good division, and that in
fact there is nothing which would work except a tunable setting, because
workloads are different. Certainly the "what happens if a single file
is larger than the entire recency bucket" question is addressed and debated.

As an example, PostgreSQL would want to shrink the frequency list to 0%,
because we already implement our own frequency list, and we already
demonstrated back in version 8.1 that a 3-list system was ineffective.

I can save Johannes some time: don't implement ARC. Not only is it
under IBM patent, it's not effective in real-world situations. Both
Postgres and Apache tried it in the early aughts.

However, this particular issue concerns me less than the general
attitude that it's OK to push in experimental IO changes which can't be
disabled by users into release kernels, as exemplified by several
problematic and inadequately tested IO changes in the 3.X kernels --
most notably the pdflush bug. It speaks of a policy that the Linux IO
stack is not production software, and it's OK to tinker with it in ways
that break things for many users.

I also wasn't exaggerating the reception I got when I tried to talk
about IO and PostgreSQL at LinuxCon and other events. The majority of
Linux hackers I've talked to simply don't want to be bothered with
PostgreSQL's performance needs, and I've heard similar things from my
collegues at the MySQL variants. Greg KH was the only real exception.

Heck, I went to a meeting of filesystem geeks at LinuxCon and the main
feedback I received, from Linux FS developers (Chris and Ted), was
"PostgreSQL should implement its own storage and use DirectIO, we don't
know why you're even trying to use the Linux IO stack." That's why I
gave up on working through community channels; I face enough uphill
battles in *this* project.

> This code has been a bit slow getting into the mainline for a few reasons,
> but one of the chief ones is this: nobody is saying from the sidelines
> that they need it! If somebody were saying "Postgres would work a lot
> better with this code in place" and had some numbers to demonstrate that,
> we'd be far more likely to see it get into an upcoming release.

Well, Citus did that; do you need more evidence?

> In the end, Linux is quite responsive to the people who participate in its
> development, even as testers and bug reporters. It responds rather less
> well to people who find problems in enterprise kernels years later,
> granted.

All infrastructure software, including Postgres, has the issue that most
enterprise users are using a version which was released years ago. As a
result, some performance issues simply aren't going to be found until
that version has been out for a couple of years. This leads to a
Catch-22: enterprise users are reluctant to upgrade because of potential
performance regressions, and as a result the median "enterprise" version
gets further and further behind current development, and as a result the
performance regressions are never fixed.

We encounter this in PostgreSQL (I have customers who are still on 8.4
or 9.1 because of specific regressions), and it's even worse in the
Linux world, where RHEL is still on 2.6. We work really hard to avoid
performance regressions in Postgres versions, because we know we can't
test for them adequately, and often can't fix them in release versions
after the fact.

But you know what? 2.6, overall, still performs better than any kernel
in the 3.X series, at least for Postgres.

> The amount of automated testing, including performance testing, has
> increased markedly in the last couple of years. I bet that it would not
> be hard at all to get somebody like Fengguang Wu to add some
> Postgres-oriented I/O tests to his automatic suite:
>
> https://lwn.net/Articles/571991/
>
> Then we would all have a much better idea of how kernel releases are
> affecting one of our most important applications; developers would pay
> attention to that information.

Oh, good! I was working with Greg on having an automated pgBench run,
but doing it on Wu's testing platform would be even better. I still
need to get some automated stats digestion, since I want to at least
make sure that the tests would show the three major issues which we
encountered in recent Linux kernels so far. Of course, I have a "free
time" issue, which is being discussed on the other fork of this thread.

In addition to testing, though, I have yet to find a way to learn about
new changes to IO or memory performance in the Linux Kernel without
reading all of the traffic on LKML and all Linux commit messages and
filtering them myself. If there were a better way to look for this
information, Linux would be more likely to get feedback in a timely
fashion. And yeah, I know that Postgres has the same issue.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Jonathan Corbet <corbet(at)lwn(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 19:40:56
Message-ID: 529F8548.6010206@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 12/04/2013 07:33 AM, Jonathan Corbet wrote:

>
> Wow, Josh, I'm surprised to hear this from you.
>
> The active/inactive list mechanism works great for the vast majority of
> users. The second-use algorithm prevents a lot of pathological behavior,
> like wiping out your entire cache by copying a big file or running a
> backup. We *need* that kind of logic in the kernel.
>
>
> The amount of automated testing, including performance testing, has
> increased markedly in the last couple of years. I bet that it would not
> be hard at all to get somebody like Fengguang Wu to add some
> Postgres-oriented I/O tests to his automatic suite:
>
> https://lwn.net/Articles/571991/
>
> Then we would all have a much better idea of how kernel releases are
> affecting one of our most important applications; developers would pay
> attention to that information.
>
> Or you could go off and do your own thing, but I believe that would leave
> us all poorer.

Thank you for your very well thought out, and knowledgeable response.
This is certainly helpful and highlights what a lot of us were already
stating.

Sincerely,

Joshua D. Drake

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats


From: Jonathan Corbet <corbet(at)lwn(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 20:31:39
Message-ID: 20131204133139.5dad25c9@lwn.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 04 Dec 2013 11:07:04 -0800
Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> On 12/04/2013 07:33 AM, Jonathan Corbet wrote:
> > Wow, Josh, I'm surprised to hear this from you.
>
> Well, I figured it was too angry to propose for an LWN article. ;-)

So you're going to make us write it for you :)

> > The active/inactive list mechanism works great for the vast majority of
> > users. The second-use algorithm prevents a lot of pathological behavior,
> > like wiping out your entire cache by copying a big file or running a
> > backup. We *need* that kind of logic in the kernel.
>
> There's a large body of research on 2Q algorithms going back to the 80s,
> which is what this is. As far as I can tell, the modification was
> performed without any reading of this research, since that would have
> easily shown that 50/50 was unlikely to be a good division, and that in
> fact there is nothing which would work except a tunable setting, because
> workloads are different.

In general, the movement of useful information between academia and
real-world programming seems to be minimal at best. Neither side seems to
find much that is useful or interesting in what the other is doing.
Unfortunate.

For those interested in the details... (1) It's not quite 50/50, that's one
bound for how the balance is allowed to go. (2) Anybody trying to add
tunables to the kernel tends to run into resistance. Exposing thousands of
knobs tends to lead to a situation where you *have* to be an expert on all
those knobs to get decent behavior out of your system. So there is a big
emphasis on having the kernel tune itself whenever possible. Here is a
situation where that is not always happening, but a fix (which introduces
no knob) is in the works.

As an example, I've never done much with the PostgreSQL knobs on the LWN
server. I just don't have the time to mess with it, and things Work Well
Enough.

</irrelevant_aside>

> However, this particular issue concerns me less than the general
> attitude that it's OK to push in experimental IO changes which can't be
> disabled by users into release kernels, as exemplified by several
> problematic and inadequately tested IO changes in the 3.X kernels --
> most notably the pdflush bug. It speaks of a policy that the Linux IO
> stack is not production software, and it's OK to tinker with it in ways
> that break things for many users.

Bugs and regressions happen, and I won't say that we do a good enough job
in that regard. There has been some concern recently that we're accepting
too much marginal stuff. We have problems getting enough people to
adequately review code — I think I've heard of another project or two with
similar issues :). But nobody sees the kernel as experimental or feels
that the introduction of bugs is an acceptable thing.

> I also wasn't exaggerating the reception I got when I tried to talk
> about IO and PostgreSQL at LinuxCon and other events. The majority of
> Linux hackers I've talked to simply don't want to be bothered with
> PostgreSQL's performance needs, and I've heard similar things from my
> collegues at the MySQL variants. Greg KH was the only real exception.
>
> Heck, I went to a meeting of filesystem geeks at LinuxCon and the main
> feedback I received, from Linux FS developers (Chris and Ted), was
> "PostgreSQL should implement its own storage and use DirectIO, we don't
> know why you're even trying to use the Linux IO stack."

I think you're talking to the wrong people. Nothing you've described is a
filesystem problem; you're contending with memory management problems.
Chris and Ted weren't helpful because there's actually little they can do
to help you. I would be happy to introduce you to some people who would be
more likely to take your problems to heart.

Mel Gorman, for example, is working on putting together a set of MM
benchmarks in the hopes of quantifying changes and catching regressions
before new code is merged. He's one of the people who has to deal with
performance regressions when they show up in enterprise kernels, and I get
the sense he'd rather do less of that.

Perhaps even better: the next filesystem, storage, and memory management
summit is March 24-25. A session on your pain points there would bring in
a substantial portion of the relevant developers at all levels. LSFMM
is arguably the most productive kernel event I see over the course of a
year; it's where I would go first to make progress on this issue. I'm not
an LSFMM organizer, but I would be happy to work to make such a session
happen if somebody from the PostgreSQL community wanted to be there.

> > This code has been a bit slow getting into the mainline for a few reasons,
> > but one of the chief ones is this: nobody is saying from the sidelines
> > that they need it! If somebody were saying "Postgres would work a lot
> > better with this code in place" and had some numbers to demonstrate that,
> > we'd be far more likely to see it get into an upcoming release.
>
> Well, Citus did that; do you need more evidence?

Yes, they did that — one week ago. This patch has been in the works for
almost two years. And Citus has not taken anything to the kernel
community, so somebody else will have to do that for them. I might be able
to help in that regard.

> In addition to testing, though, I have yet to find a way to learn about
> new changes to IO or memory performance in the Linux Kernel without
> reading all of the traffic on LKML and all Linux commit messages and
> filtering them myself. If there were a better way to look for this
> information, Linux would be more likely to get feedback in a timely
> fashion. And yeah, I know that Postgres has the same issue.

Gee, if only there were a web site where one could read about changes to
the Linux kernel :)

Seriously, though, one of the best things to do would be to make a point of
picking up a kernel around -rc3 (right around now, say, for 3.13) and
running a few benchmarks on it. If you report a performance regression at
that stage, it will get attention.

Thanks,

jon


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Jonathan Corbet <corbet(at)lwn(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 20:47:30
Message-ID: CABUevEwJdxyJe8nAquN26nxLZD2f7UqvvzByZvpz+uf44PKQYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 4, 2013 at 9:31 PM, Jonathan Corbet <corbet(at)lwn(dot)net> wrote:

> > I also wasn't exaggerating the reception I got when I tried to talk
> > about IO and PostgreSQL at LinuxCon and other events. The majority of
> > Linux hackers I've talked to simply don't want to be bothered with
> > PostgreSQL's performance needs, and I've heard similar things from my
> > collegues at the MySQL variants. Greg KH was the only real exception.
> >
> > Heck, I went to a meeting of filesystem geeks at LinuxCon and the main
> > feedback I received, from Linux FS developers (Chris and Ted), was
> > "PostgreSQL should implement its own storage and use DirectIO, we don't
> > know why you're even trying to use the Linux IO stack."
>
> I think you're talking to the wrong people. Nothing you've described is a
> filesystem problem; you're contending with memory management problems.
> Chris and Ted weren't helpful because there's actually little they can do
> to help you. I would be happy to introduce you to some people who would be
> more likely to take your problems to heart.
>
> Mel Gorman, for example, is working on putting together a set of MM
> benchmarks in the hopes of quantifying changes and catching regressions
> before new code is merged. He's one of the people who has to deal with
> performance regressions when they show up in enterprise kernels, and I get
> the sense he'd rather do less of that.
>
> Perhaps even better: the next filesystem, storage, and memory management
> summit is March 24-25. A session on your pain points there would bring in
> a substantial portion of the relevant developers at all levels. LSFMM
> is arguably the most productive kernel event I see over the course of a
> year; it's where I would go first to make progress on this issue. I'm not
> an LSFMM organizer, but I would be happy to work to make such a session
> happen if somebody from the PostgreSQL community wanted to be there.
>

I think that's an excellent idea. If one of our developers could find the
time to attend that, I think that could be very productive. While I'm not
on the funds team, I'd definitely vote for funding such participation out
of community funds if said developer can't do it on his own.

But it should definitely be a developer with interest and skills in that
particular area as well of course :) So don't think I'm proposing myself, I
definitely am not :)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Jonathan Corbet <corbet(at)lwn(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 20:58:51
Message-ID: 20131204205851.GX17272@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> I think that's an excellent idea. If one of our developers could find the
> time to attend that, I think that could be very productive. While I'm not
> on the funds team, I'd definitely vote for funding such participation out
> of community funds if said developer can't do it on his own.
>
> But it should definitely be a developer with interest and skills in that
> particular area as well of course :) So don't think I'm proposing myself, I
> definitely am not :)

For my part, I'm definitely interested and those dates currently look
like they'd work for me. Not sure if I really meet Magnus'
qualifications above, but I'd be happy to try. ;) Stark and I were
having a pretty good discussion with Ted Ts'o at pgconf.eu and he
certainly seemed interested and willing to at least discuss things with
us..

Thanks,

Stephen


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 21:01:37
Message-ID: 529F9831.60301@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jonathan,

> For those interested in the details... (1) It's not quite 50/50, that's one
> bound for how the balance is allowed to go. (2) Anybody trying to add
> tunables to the kernel tends to run into resistance. Exposing thousands of
> knobs tends to lead to a situation where you *have* to be an expert on all
> those knobs to get decent behavior out of your system. So there is a big
> emphasis on having the kernel tune itself whenever possible. Here is a
> situation where that is not always happening, but a fix (which introduces
> no knob) is in the works.

Yeah, we get into this argument all the time. The problem is when you
run into situations where there is no optimal (or even acceptable)
setting for all, or even most, users. And I'll say in advance that 2Q
is one of those situations.

> As an example, I've never done much with the PostgreSQL knobs on the LWN
> server. I just don't have the time to mess with it, and things Work Well
> Enough.

Sure, and even when I teach fiddling with the knobs, there's only 12-20
knobs 95% of users need to have any interest in. But we have ~~220
settings for the other 5%, and those users would be screwed without them.

> Bugs and regressions happen, and I won't say that we do a good enough job
> in that regard. There has been some concern recently that we're accepting
> too much marginal stuff. We have problems getting enough people to
> adequately review code — I think I've heard of another project or two with
> similar issues :). But nobody sees the kernel as experimental or feels
> that the introduction of bugs is an acceptable thing.

OK. The chain of events over the pdflush bug really felt like what I
said earlier, especially since problems *were* reported shortly after
kernel release and ignored.

> I think you're talking to the wrong people.

Quite possibly.

> Perhaps even better: the next filesystem, storage, and memory management
> summit is March 24-25.

Link? I can't find anything Googling by that name. I'm pretty sure we
can get at least one person there.

> Gee, if only there were a web site where one could read about changes to
> the Linux kernel :)

Even you don't cover 100% of performance-changing commits. And I'll
admit to missing issues of LWN when I'm travelling.

> Seriously, though, one of the best things to do would be to make a point of
> picking up a kernel around -rc3 (right around now, say, for 3.13) and
> running a few benchmarks on it. If you report a performance regression at
> that stage, it will get attention.

Yeah, back to the "we need resources for good benchmarks" discussion
fork ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jonathan Corbet <corbet(at)lwn(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 21:08:13
Message-ID: CAHyXU0weQ8XAKNQG+M4nz3c7kTF+J5r4zforZt3aZKVJb_zTtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 4, 2013 at 2:31 PM, Jonathan Corbet <corbet(at)lwn(dot)net> wrote:
> For those interested in the details... (1) It's not quite 50/50, that's one
> bound for how the balance is allowed to go. (2) Anybody trying to add
> tunables to the kernel tends to run into resistance. Exposing thousands of
> knobs tends to lead to a situation where you *have* to be an expert on all
> those knobs to get decent behavior out of your system. So there is a big
> emphasis on having the kernel tune itself whenever possible. Here is a
> situation where that is not always happening, but a fix (which introduces
> no knob) is in the works.

I think there are interesting parallels here with the 'query plan
hints' debate. In both cases I think the conservative voices are
correct: better not to go crazy adding knobs.

merlin


From: Jonathan Corbet <corbet(at)lwn(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 21:09:54
Message-ID: 20131204140954.44c1563d@lwn.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 04 Dec 2013 13:01:37 -0800
Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> > Perhaps even better: the next filesystem, storage, and memory management
> > summit is March 24-25.
>
> Link? I can't find anything Googling by that name. I'm pretty sure we
> can get at least one person there.

It looks like the page for the 2014 event isn't up yet. It will be
attached (as usual) to the LF collaboration summit:

http://events.linuxfoundation.org/events/collaboration-summit

I'll make a personal note to send something here when the planning process
begins and the CFP goes out.

Napa Valley...one can do worse...:)

jon


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 00:49:48
Message-ID: 20131205004948.GB8935@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-12-03 10:44:15 -0800, Josh Berkus wrote:
> I don't know where we'll get the resources to implement our own storage,
> but it's looking like we don't have a choice.

As long as our storage layer is a s suboptimal as it is today, I think
it's a purely detractory to primarily blame the kernel.

We
* cannot deal with large shared_buffers, the dirty-buffer scanning is far to
expensive. The amount of memory required for locks is pretty big, and
every backend carries around a pretty huge private array for the
buffer pins.
* do not have scalability in pretty damn central datastructures like
buffer mapping.
* Our background eviction mechanism doesn't do anything in lots of
workloads but increase contention on important data structures.
* Due to the missing efficient eviction, we synchronously write out data
when acquiring a victim buffer most of the time. That's already bad if
you have a kernel buffering your writes, but if you don't...
* Due to the frequency of buffer pins in almost all workloads, our
tracking of the importance of individual buffers is far, far too
volatile.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 08:35:31
Message-ID: 52A03AD3.6000606@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/12/04 16:39), Claudio Freire wrote:
> On Wed, Dec 4, 2013 at 4:28 AM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>>>> Can we avoid the Linux kernel problem by simply increasing our shared
>>>> buffer size, say up to 80% of memory?
>>> It will be swap more easier.
>>
>> Is that the case? If the system has not enough memory, the kernel
>> buffer will be used for other purpose, and the kernel cache will not
>> work very well anyway. In my understanding, the problem is, even if
>> there's enough memory, the kernel's cache does not work as expected.
>
> Problem is, Postgres relies on a working kernel cache for checkpoints.
> Checkpoint logic would have to be heavily reworked to account for an
> impaired kernel cache.
>
> Really, there's no difference between fixing the I/O problems in the
> kernel(s) vs in postgres. The only difference is, in the kernel(s),
> everyone profits, and you've got a huge head start.
Yes. And using something efficiently DirectIO is more difficult than BufferedIO.
If we change write() flag with direct IO in PostgreSQL, it will execute hardest
ugly randomIO.

> Communicating more with the kernel (through posix_fadvise, fallocate,
> aio, iovec, etc...) would probably be good, but it does expose more
> kernel issues. posix_fadvise, for instance, is a double-edged sword
> ATM. I do believe, however, that exposing those issues and prompting a
> fix is far preferable than silently working around them.
Agreed. And, I believe that controled BufferedIO is faster and easier than
controled DirectIO perfectly. In actually, Oracle database uses BufferedIO to
access small datasets, and uses DirectIO to access big datasets. It is because
using OS file cache more efficiently.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Jonathan Corbet <corbet(at)lwn(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 08:41:31
Message-ID: CAM3SWZSE2q0LhPb0V5qrUmJ7u048Rp7AoKYVvC4uxLVfddjbqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 4, 2013 at 11:07 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> I also wasn't exaggerating the reception I got when I tried to talk
> about IO and PostgreSQL at LinuxCon and other events. The majority of
> Linux hackers I've talked to simply don't want to be bothered with
> PostgreSQL's performance needs, and I've heard similar things from my
> collegues at the MySQL variants. Greg KH was the only real exception.

If so, he is a fairly major exception. But there is at least one other
major exception: I met Theodore Ts'o at pgConf.EU (he was in town for
some Google thing), and he seemed pretty interested in what we had to
say, and encouraged us to reach out to the Kernel development
community. I suspect that we simply haven't gone about it the right
way.

> But you know what? 2.6, overall, still performs better than any kernel
> in the 3.X series, at least for Postgres.

What about the fseek() scalability issue?

--
Peter Geoghegan


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Jonathan Corbet <corbet(at)lwn(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 13:48:42
Message-ID: 20131205134841.GZ17272@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Peter Geoghegan (pg(at)heroku(dot)com) wrote:
> On Wed, Dec 4, 2013 at 11:07 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> > But you know what? 2.6, overall, still performs better than any kernel
> > in the 3.X series, at least for Postgres.
>
> What about the fseek() scalability issue?

Not to mention that the 2.6 which I suspect you're referring to (RHEL)
isn't exactly "2.6"...

Thanks,

Stephen


From: Greg Stark <stark(at)mit(dot)edu>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 14:42:29
Message-ID: CAM-w4HMWf4J8ZKKBFhMy2EntXdKiGOhDKtdi0YDxggh-YY6fxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 5, 2013 at 8:35 AM, KONDO Mitsumasa
<kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Yes. And using something efficiently DirectIO is more difficult than
> BufferedIO.
> If we change write() flag with direct IO in PostgreSQL, it will execute
> hardest ugly randomIO.

Using DirectIO presumes you're using libaio or threads to implement
prefetching and asynchronous I/O scheduling.

I think in the long term there are only two ways to go here. Either a)
we use DirectIO and implement an I/O scheduler in Postgres or b) We
use mmap and use new system calls to give the kernel all the
information Postgres has available to it to control the I/O scheduler.

(a) is by far the lower risk option as it's well trodden and doesn't
depend on other projects to do anything. The most that would be
valuable is if the kernel provided an interface to learn about the
hardware properties such as the raid geometry and queue depth for
different parts of the devices.

(b) is the way more interesting research project though. I don't think
anyone's tried it and the kernel interface to provide the kinds of
information Postgres needs requires a lot of thought. If it's done
right then Postgres wouldn't need a buffer cache manager at all. It
would just mmap the entire database and tell the kernel when it's safe
to flush buffers and let the kernel decide when based on when it's
convenient for the hardware.

I don't think it's tenable in the long run to have Postgres manage
buffers that are then copied to another buffer in memory which are
then flushed to disk based on another scheduler. That it works at all
is a testament to the quality of the code in Postgres and Linux but
it's implausibly inefficient.

--
greg


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 14:54:18
Message-ID: CAGTBQpb8-LR0QBTK_BNJZdZ_gi0ZdSmmhzq9h5Gjnh3pDN=cMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 5, 2013 at 11:42 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
> (b) is the way more interesting research project though. I don't think
> anyone's tried it and the kernel interface to provide the kinds of
> information Postgres needs requires a lot of thought. If it's done
> right then Postgres wouldn't need a buffer cache manager at all. It
> would just mmap the entire database and tell the kernel when it's safe
> to flush buffers and let the kernel decide when based on when it's
> convenient for the hardware.

That's a bad idea in the current state of affairs. MM files haven't
been designed for that usage, and getting stable performance out of
that will be way too difficult.

systemd's journal is finding that out the hard way. It uses mmap too.

Having the buffer manager mmap buffers into its shared address space,
however, might be an interesting idea to pursue. However, one must not
forget that the kernel has similar scalability issues when the number
of memory mappings increase arbitrarily.


From: Greg Stark <stark(at)mit(dot)edu>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 15:40:58
Message-ID: CAM-w4HM++8QgNjjRZLTCPSKHBF1+awKxGJAd3csKR4Mj-+1uiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 5, 2013 at 2:54 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> That's a bad idea in the current state of affairs. MM files haven't
> been designed for that usage, and getting stable performance out of
> that will be way too difficult.

I'm talking about long-term goals here. Either of these two routes
would require whole new kernel interfaces to work effectively. Without
those new kernel interfaces our current approach is possibly the best
we can get.

I think the way to use mmap would be to mmap very large chunks,
possibly whole tables. We would need some way to control page flushes
that doesn't involve splitting mappings and can be efficiently
controlled without having the kernel storing arbitrarily large tags on
page tables or searching through all the page tables to mark pages
flushable.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 15:59:49
Message-ID: 16493.1386259189@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <stark(at)mit(dot)edu> writes:
> I think the way to use mmap would be to mmap very large chunks,
> possibly whole tables. We would need some way to control page flushes
> that doesn't involve splitting mappings and can be efficiently
> controlled without having the kernel storing arbitrarily large tags on
> page tables or searching through all the page tables to mark pages
> flushable.

I might be missing something, but AFAICS mmap's API is just fundamentally
wrong for this. The kernel is allowed to write-back a modified mmap'd
page to the underlying file at any time, and will do so if say it's under
memory pressure. You can tell the kernel to sync now, but you can't tell
it *not* to sync. I suppose you are thinking that some wart could be
grafted onto that API to reverse that, but I wouldn't have a lot of
confidence in it. Any VM bug that caused the kernel to sometimes write
too soon would result in nigh unfindable data consistency hazards.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 17:49:48
Message-ID: 52A0BCBC.3000608@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/05/2013 07:40 AM, Greg Stark wrote:
> On Thu, Dec 5, 2013 at 2:54 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> That's a bad idea in the current state of affairs. MM files haven't
>> been designed for that usage, and getting stable performance out of
>> that will be way too difficult.
>
> I'm talking about long-term goals here. Either of these two routes
> would require whole new kernel interfaces to work effectively. Without
> those new kernel interfaces our current approach is possibly the best
> we can get.

Well, in the long run we'll probably be using persistent RAM. And the
geeks who manage that have already said that MMAP is a bad interface for
persistent RAM. They haven't defined a good one, though.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>, Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Jonathan Corbet <corbet(at)lwn(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 17:54:36
Message-ID: 52A0BDDC.3000503@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/05/2013 05:48 AM, Stephen Frost wrote:
> * Peter Geoghegan (pg(at)heroku(dot)com) wrote:
>> On Wed, Dec 4, 2013 at 11:07 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> But you know what? 2.6, overall, still performs better than any kernel
>>> in the 3.X series, at least for Postgres.
>>
>> What about the fseek() scalability issue?
>
> Not to mention that the 2.6 which I suspect you're referring to (RHEL)
> isn't exactly "2.6"..

Actually, I've been able to do 35K TPS on commodity hardware on Ubuntu
10.04. I have yet to go about 15K on any Ubuntu running a 3.X Kernel.
The CPU scheduling on 2.6 just seems to be far better tuned, aside from
the IO issues; at 35K TPS, the CPU workload is evenly distributed across
cores, whereas on 3.X it lurches from core to core like a drunk in a
cathedral. However, the hardware is not identical, and this is on
proprietary, not benchmark, workloads, which is why I haven't published
anything.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Peter Geoghegan <pg(at)heroku(dot)com>, Jonathan Corbet <corbet(at)lwn(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 20:41:44
Message-ID: CA+TgmoYzUHBnqHNeGO0jRvUY0wtySnbYRMw312kpcXrxMtRAEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 5, 2013 at 12:54 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Actually, I've been able to do 35K TPS on commodity hardware on Ubuntu
> 10.04. I have yet to go about 15K on any Ubuntu running a 3.X Kernel.
> The CPU scheduling on 2.6 just seems to be far better tuned, aside from
> the IO issues; at 35K TPS, the CPU workload is evenly distributed across
> cores, whereas on 3.X it lurches from core to core like a drunk in a
> cathedral.

Do drunks lurch differently in cathedrals than they do elsewhere?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Peter Geoghegan <pg(at)heroku(dot)com>, Jonathan Corbet <corbet(at)lwn(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 20:43:47
Message-ID: 52A0E583.9020009@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/05/2013 12:41 PM, Robert Haas wrote:
> On Thu, Dec 5, 2013 at 12:54 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Actually, I've been able to do 35K TPS on commodity hardware on Ubuntu
>> 10.04. I have yet to go about 15K on any Ubuntu running a 3.X Kernel.
>> The CPU scheduling on 2.6 just seems to be far better tuned, aside from
>> the IO issues; at 35K TPS, the CPU workload is evenly distributed across
>> cores, whereas on 3.X it lurches from core to core like a drunk in a
>> cathedral.
>
> Do drunks lurch differently in cathedrals than they do elsewhere?

Yeah, beacause they lurch from one column to another. It's a visual
metaphor. ;-)

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: bricklen <bricklen(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-05 21:14:09
Message-ID: CAGrpgQ9pxAxTvM8u1miMgHucHgSC_kb1V+VY+Xba0ZFwubSx_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 5, 2013 at 12:43 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> On 12/05/2013 12:41 PM, Robert Haas wrote:
> > Do drunks lurch differently in cathedrals than they do elsewhere?
>
> Yeah, because they lurch from one column to another.
>

Row by row?


From: Jim Nasby <jim(at)nasby(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-08 21:13:25
Message-ID: 52A4E0F5.1090008@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/5/13 9:59 AM, Tom Lane wrote:
> Greg Stark <stark(at)mit(dot)edu> writes:
>> I think the way to use mmap would be to mmap very large chunks,
>> possibly whole tables. We would need some way to control page flushes
>> that doesn't involve splitting mappings and can be efficiently
>> controlled without having the kernel storing arbitrarily large tags on
>> page tables or searching through all the page tables to mark pages
>> flushable.
>
> I might be missing something, but AFAICS mmap's API is just fundamentally
> wrong for this. The kernel is allowed to write-back a modified mmap'd
> page to the underlying file at any time, and will do so if say it's under
> memory pressure. You can tell the kernel to sync now, but you can't tell
> it *not* to sync. I suppose you are thinking that some wart could be
> grafted onto that API to reverse that, but I wouldn't have a lot of
> confidence in it. Any VM bug that caused the kernel to sometimes write
> too soon would result in nigh unfindable data consistency hazards.

Something else to ponder on... a Segate researcher gave a talk on upcoming hard drive technology it RICON East this spring. The interesting bit is that 1 or 2 generations down the road HDs will start using "shingling": The write head has to be bigger than the read head, so they're going to set it up so you can not modify a range of tracks after they've been written. They'll do this by keeping a journal inside the HD. This is somewhat similar to how SSDs work too (you can only erase large pages of data, you can't update individual bytes/sectors/filesystem blocks.

So long-term, random access updates to permanent storage will be less efficient than today. (Of course, non-volatile memory could turn all this on it's head..)
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-09 06:04:55
Message-ID: 52A55D87.4040700@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/12/05 23:42), Greg Stark wrote:
> On Thu, Dec 5, 2013 at 8:35 AM, KONDO Mitsumasa
> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Yes. And using something efficiently DirectIO is more difficult than
>> BufferedIO.
>> If we change write() flag with direct IO in PostgreSQL, it will execute
>> hardest ugly randomIO.
>
> Using DirectIO presumes you're using libaio or threads to implement
> prefetching and asynchronous I/O scheduling.
>
> I think in the long term there are only two ways to go here. Either a)
> we use DirectIO and implement an I/O scheduler in Postgres or b) We
> use mmap and use new system calls to give the kernel all the
> information Postgres has available to it to control the I/O scheduler.
I agree with part of (b) method. I think MMAP API isn't purpose for controling
I/O as others saying. And I think posix_fadivse(), sync_file_range() and
fallocate() is easier way to be realized better I/O sheduler in Postgres. These
systemcall doesn't cause data corruption at all, and we can just use existing
implementaion. They effect only perfomance.

My survey of posix_fadvise() and sync_file_range() is here. It's simple rule.
#Almost my explaining is written in linux man:-)

* Optimize readahead in OS [ posix_fadvise() ]
These options is for mainly read perfomance.

- POSIX_FADV_SEQUENTIAL flag
-> Readahead parameter in OS becomes maximum.
- POSIX_FADV_RANDOM flag
-> Don't use readahead parameter in OS. It can calculate the file cache
frequency and efficiency for using the file cache.
- POSIX_FADV_NORMAL
-> Readahead parameter in OS optimized dynamically in each situasions. If
you doesn't judge strategy of disk controlling, we can select this
option. It might be good working in almost cases.

* Contorol dirty or clean buffer in OS [ posix_fadvise() and sync_file_range() ]
These optinos is for write and read perfomance controling in OS file caches.

- POSIX_FADV_DONTNEED
-> Drop the file cache. If it is dirty, write disk and drop file cache.
If it isn't dirty, it only drop from OS file cache.
- sync_file_range()
-> If you want to write dirty buffer to disk and remain file cache in OS, you
can select this system-call. And it can contorol amount of write size.
- POSIX_FADV_NOREUSE
-> If you think that the file cache will not be needed, we can set this
option. The file cache will be drop soon.
- POSIX_FADV_WILLNEED
-> If you think that the file cache will be important, we can set this
option. The file cache will be tend to remain in OS file caches.

That's all.

Kernel in OS cannot predict IO pattern perfectly in each midlleware, therefore it
is optimized by general heuristic algorithms. I think it is right way. However,
PostgreSQL can predict IO pattern in part of planner, executer and checkpointer,
so we had better set optimum posix_fadvise() flag or sync_file_range()
before/after execute general IO systemcall. I think that they will be good IO
contoroling and scheduling method without unreliable implementations.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Jim Nasby <jim(at)nasby(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-10 23:38:35
Message-ID: 52A7A5FB.8080800@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Just to add a data point (and sorry, I can't find where someone was talking about numbers in the thread)...

For a while earlier this year we were running a 3.x kernel and saw a very modest (1-2%) improvement in overall performance. This would be on a server with 512G RAM running ext4.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-11 00:22:20
Message-ID: CAMkU=1wwhJ9aYxwj53bGFsMeC0HnGtWZgA5UAaJoA7_jAdsYqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 3, 2013 at 11:39 PM, Claudio Freire <klaussfreire(at)gmail(dot)com>wrote:

> On Wed, Dec 4, 2013 at 4:28 AM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
> >>> Can we avoid the Linux kernel problem by simply increasing our shared
> >>> buffer size, say up to 80% of memory?
> >> It will be swap more easier.
> >
> > Is that the case? If the system has not enough memory, the kernel
> > buffer will be used for other purpose, and the kernel cache will not
> > work very well anyway. In my understanding, the problem is, even if
> > there's enough memory, the kernel's cache does not work as expected.
>
>
> Problem is, Postgres relies on a working kernel cache for checkpoints.
> Checkpoint logic would have to be heavily reworked to account for an
> impaired kernel cache.
>

I don't think it would need anything more than a sorted checkpoint. There
are patches around for doing those. I can dig one up again and rebase it
to HEAD if anyone cares. What else would be needed checkpoint-wise?

As far as I can tell, the main problem with large shared_buffers is some
poorly characterized locking issues related to either the buffer mapping or
the freelist. And those locking issues seem to trigger even more poorly
characterized scheduling issues in the kernel, at least in some kernels.

But note that if we did do this, just crank up shared_buffers so it takes
up 95% of RAM, our own ring buffer access strategy would be even worse for
the case which started this thread than the kernel's policy being
complained of. That strategy is only acceptable because it normally sits
on top of a substantial cache at the kernel level.

>
> Really, there's no difference between fixing the I/O problems in the
> kernel(s) vs in postgres. The only difference is, in the kernel(s),
> everyone profits, and you've got a huge head start.
>

That assumes the type of problem the kernel faces is the same as the ones a
database does, which I kind of doubt. Even if the changes were absolute
improvements with no trade-offs, we would need to convince a much larger
community of that fact.

>
> Communicating more with the kernel (through posix_fadvise, fallocate,
> aio, iovec, etc...) would probably be good, but it does expose more
> kernel issues. posix_fadvise, for instance, is a double-edged sword
> ATM. I do believe, however, that exposing those issues and prompting a
> fix is far preferable than silently working around them.
>

Getting the kernel to improve those things so PostgreSQL can be changed to
use them more aggressively seems almost hopeless to me. PostgreSQL would
have to be coded to take advantage of the improved versions, while
defending itself from the pre-improved versions. And my understanding is
that different distributions of Linux cherry pick changes to the kernel
back and forth into their code, so just looking at the kernel version
number without also looking at the distribution doesn't mean very much
about whether we have the improved feature or not. Or am I misinformed
about that?

If we can point things out to the kernel hackers things that would be
absolute improvements, where PostgreSQL and everything else just magically
start working better if that improvement makes it in, that is great. Both
if both systems have to be changed in sync to derive any benefit, how do we
coordinate that?

Cheers,

Jeff


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-11 00:34:44
Message-ID: 20131211003444.GA2935@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-12-04 05:39:23 -0200, Claudio Freire wrote:
> Problem is, Postgres relies on a working kernel cache for checkpoints.
> Checkpoint logic would have to be heavily reworked to account for an
> impaired kernel cache.

I don't think checkpoints are the critical problem with that, they are
nicely in the background and we could easily add sorting.

Rather I think it would be the writeout of a dirty victim buffer when
acquiring a new buffer.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-11 01:09:19
Message-ID: CAGTBQpajeB6w7o2ZpGrfubAR+Hd5gyfUGKyAYp2xuSW-1UX3qQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 9:22 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> Communicating more with the kernel (through posix_fadvise, fallocate,
>> aio, iovec, etc...) would probably be good, but it does expose more
>> kernel issues. posix_fadvise, for instance, is a double-edged sword
>> ATM. I do believe, however, that exposing those issues and prompting a
>> fix is far preferable than silently working around them.
>
>
> Getting the kernel to improve those things so PostgreSQL can be changed to
> use them more aggressively seems almost hopeless to me. PostgreSQL would
> have to be coded to take advantage of the improved versions, while defending
> itself from the pre-improved versions. And my understanding is that
> different distributions of Linux cherry pick changes to the kernel back and
> forth into their code, so just looking at the kernel version number without
> also looking at the distribution doesn't mean very much about whether we
> have the improved feature or not. Or am I misinformed about that?
>
> If we can point things out to the kernel hackers things that would be
> absolute improvements, where PostgreSQL and everything else just magically
> start working better if that improvement makes it in, that is great. Both if
> both systems have to be changed in sync to derive any benefit, how do we
> coordinate that?

Well, posix_fadvise is one such thing. It's a cheap form of AIO used
by more than a few programs that want I/O performance, and in its
current form is sub-optimal, the fix is rather simple, it just needs a
lot of testing.

But my report on LKML[0] spurred little actual work. So it's possible
this kind of thing will need patches attached.

On Tue, Dec 10, 2013 at 9:34 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-12-04 05:39:23 -0200, Claudio Freire wrote:
>> Problem is, Postgres relies on a working kernel cache for checkpoints.
>> Checkpoint logic would have to be heavily reworked to account for an
>> impaired kernel cache.
>
> I don't think checkpoints are the critical problem with that, they are
> nicely in the background and we could easily add sorting.

Problem is, with DirectIO, they won't be so background.

Currently, checkpoints assume there's a background process catching
all I/O requests, sorting them, and flushing them as optimally as
possible. This makes the checkpoint's slow-paced write pattern
benignly background, since it will be scheduled opportunistically by
the kernel.

If you use DirectIO, however, a write will pretty much physically move
the writing head (when it reaches the queue's head at least) of
rotating media, causing delays on all other pending I/O requests.
That's quite un-backgroundly of it.

A few blocks per second like that can pretty much kill sequential
scans (I've seen that effect happen with fadvise).

[0] https://lkml.org/lkml/2012/11/9/353


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-11 01:25:04
Message-ID: 15481.1386725104@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> On Tue, Dec 3, 2013 at 11:39 PM, Claudio Freire <klaussfreire(at)gmail(dot)com>wrote:
>> Problem is, Postgres relies on a working kernel cache for checkpoints.
>> Checkpoint logic would have to be heavily reworked to account for an
>> impaired kernel cache.

> I don't think it would need anything more than a sorted checkpoint.

Nonsense. We don't have access to the physical-disk-layout information
needed to do reasonable sorting; to say nothing of doing something
intelligent in a multi-spindle environment, or whenever any other I/O
is going on concurrently.

regards, tom lane


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-11 02:33:46
Message-ID: CAMkU=1xu-1g6a7Kv3TmNXieA15sP+t0v4UXzFgkL=QsCged7Kg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday, December 10, 2013, Tom Lane wrote:

> Jeff Janes <jeff(dot)janes(at)gmail(dot)com <javascript:;>> writes:
> > On Tue, Dec 3, 2013 at 11:39 PM, Claudio Freire <klaussfreire(at)gmail(dot)com<javascript:;>
> >wrote:
> >> Problem is, Postgres relies on a working kernel cache for checkpoints.
> >> Checkpoint logic would have to be heavily reworked to account for an
> >> impaired kernel cache.
>
> > I don't think it would need anything more than a sorted checkpoint.
>
> Nonsense. We don't have access to the physical-disk-layout information
> needed to do reasonable sorting; to say nothing of doing something
> intelligent in a multi-spindle environment, or whenever any other I/O
> is going on concurrently.
>

The proposal I was responding to was simply to increase shared_buffers to
80% of RAM *instead of* implementing directIO.

Cheers,

Jeff


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-11 03:31:56
Message-ID: CAGTBQpZbcTWzDGN99zF+Dt5giTMspBh_8_ukbVV5aeQ82V5UbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 11:33 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tuesday, December 10, 2013, Tom Lane wrote:
>>
>> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>> > On Tue, Dec 3, 2013 at 11:39 PM, Claudio Freire
>> > <klaussfreire(at)gmail(dot)com>wrote:
>> >> Problem is, Postgres relies on a working kernel cache for checkpoints.
>> >> Checkpoint logic would have to be heavily reworked to account for an
>> >> impaired kernel cache.
>>
>> > I don't think it would need anything more than a sorted checkpoint.
>>
>> Nonsense. We don't have access to the physical-disk-layout information
>> needed to do reasonable sorting; to say nothing of doing something
>> intelligent in a multi-spindle environment, or whenever any other I/O
>> is going on concurrently.
>
>
> The proposal I was responding to was simply to increase shared_buffers to
> 80% of RAM *instead of* implementing directIO.

If you do not leave a reasonable amount of RAM, writes will be direct
and synchronous.


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-11 04:00:50
Message-ID: 52A7E372.7060409@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/12/11 10:25), Tom Lane wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>> On Tue, Dec 3, 2013 at 11:39 PM, Claudio Freire <klaussfreire(at)gmail(dot)com>wrote:
>>> Problem is, Postgres relies on a working kernel cache for checkpoints.
>>> Checkpoint logic would have to be heavily reworked to account for an
>>> impaired kernel cache.
>
>> I don't think it would need anything more than a sorted checkpoint.
>
> Nonsense. We don't have access to the physical-disk-layout information
> needed to do reasonable sorting;
OS knows physical-disk-layout which is under following.
> [mitsu-ko(at)ssd ~]$ filefrag -v .bashrc
> Filesystem type is: ef53
> File size of .bashrc is 124 (1 block, blocksize 4096)
> ext logical physical expected length flags
> 0 0 15761410 1 eof
> .bashrc: 1 extent found
If we have to know this information, we can get physical-disk-layout whenever.

> to say nothing of doing something
> intelligent in a multi-spindle environment, or whenever any other I/O
> is going on concurrently.
IO scheduler in OS knows it best. So I think BufferedIO is faster than DirectIO
in general situations.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center