Re: fstat vs. lseek

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: fstat vs. lseek
Date: 2011-08-08 14:30:38
Message-ID: CA+TgmoawRfpan35wzvgHkSJ0+i-W=VkJpKnRxK2kTDR+HsanWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In response to my blog post on lseek contention, someone posted a
comment wherein they proposed using fstat() rather than lseek() to get
file sizes.

http://rhaas.blogspot.com/2011/08/linux-and-glibc-scalability.html

I tried that on a RHEL 6.1 machine with 64-cores running
2.6.32-131.6.1.el6.x86_64, and it's pretty clear that the locking
characteristics are completely different. At 1 client, the lseek
method appears to be slightly faster, although it's not beyond belief
that the difference could be in the noise. Above 40 cores, however,
the fstat method thumps the lseek method up one side and down the
other.

Patch and test results are attached. Test runs are 5-minute runs with
scale factor 100 and shared_buffers=8GB.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
filesize.patch application/octet-stream 1.9 KB
pgbench -S at 32 cores - fstat patch comparison.ods application/vnd.oasis.opendocument.spreadsheet 12.9 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fstat vs. lseek
Date: 2011-08-08 14:45:22
Message-ID: 25420.1312814722@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> In response to my blog post on lseek contention, someone posted a
> comment wherein they proposed using fstat() rather than lseek() to get
> file sizes.
> Patch and test results are attached. Test runs are 5-minute runs with
> scale factor 100 and shared_buffers=8GB.

> Thoughts?

I'm a bit concerned by the fact that you've only tested this on one
operating system, and thus the performance characteristics could be
quite different elsewhere. The comment in mdextend also points out
a way in which this might not be a win --- did you test anything besides
read-only scenarios?

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fstat vs. lseek
Date: 2011-08-08 14:49:01
Message-ID: 16623238.t9h9Udve74@alap2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday, August 08, 2011 10:30:38 Robert Haas wrote:
> In response to my blog post on lseek contention, someone posted a
> comment wherein they proposed using fstat() rather than lseek() to get
> file sizes.
>
> Thoughts?
I don't think its a good idea to replace lseek with fstat in the long run. The
likelihood that the lockless generic_file_llseek will get included seems rather
high to me. In contrast to that fstat will always be more expensive than that
as its going through a security check and then the fs' getattr implementation
(which actually takes a lock on some fs).
On the other hand its currently lockless if the security subsystem is compiled
out (i.e. no selinux et al) for some common fs (ext3/4, xfs).

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fstat vs. lseek
Date: 2011-08-08 15:33:29
Message-ID: CA+TgmoaA8WwdZRtNNK-RrjxQ2Anww+YgfENMfqyYPD4xhGnp7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 8, 2011 at 10:45 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I'm a bit concerned by the fact that you've only tested this on one
> operating system, and thus the performance characteristics could be
> quite different elsewhere.  The comment in mdextend also points out
> a way in which this might not be a win --- did you test anything besides
> read-only scenarios?

Nope.

On Mon, Aug 8, 2011 at 10:49 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I don't think its a good idea to replace lseek with fstat in the long run. The
> likelihood that the lockless generic_file_llseek will get included seems rather
> high to me. In contrast to that fstat will always be more expensive than that
> as its going through a security check and then the fs' getattr implementation
> (which actually takes a lock on some fs).

*scratches head* I understand that stat() would need a security
check, but why would fstat()?

I think both of you raise good points. I wasn't too enthusiastic
about this approach either. It's not very appealing to adopt an
approach where the right performance decision is going to depend on
operating system, file system, kernel version, core count, and
workload. We could add a GUC, but it would be pretty annoying to have
a setting that won't matter for most people at all, except
occasionally when it makes a huge difference.

I wasn't aware that was any current activity around this on the Linux
side. But Andres' comments made me Google it again, and now I see
this:

https://lkml.org/lkml/2011/6/16/800

Andes, any idea what the status of that patch is? I'm not clear on
how Linux works in terms of things getting upstreamed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Kohei Kaigai <Kohei(dot)Kaigai(at)emea(dot)nec(dot)com>
Subject: Re: fstat vs. lseek
Date: 2011-08-08 17:10:05
Message-ID: 2366521.2k2cV9r50e@alap2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday, August 08, 2011 11:33:29 Robert Haas wrote:

> On Mon, Aug 8, 2011 at 10:49 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I don't think its a good idea to replace lseek with fstat in the long
> > run. The likelihood that the lockless generic_file_llseek will get
> > included seems rather high to me. In contrast to that fstat will always
> > be more expensive than that as its going through a security check and
> > then the fs' getattr implementation (which actually takes a lock on
> > some fs).
> *scratches head* I understand that stat() would need a security
> check, but why would fstat()?
That I am not totally sure of either. I guess Kaigai might know more about
that.
I guess it might be that a forked process possibly is not allowed anymore to
access the information from an inherited file handle? Also I think a process
can change its permissions during runtime.

> I think both of you raise good points. I wasn't too enthusiastic
> about this approach either. It's not very appealing to adopt an
> approach where the right performance decision is going to depend on
> operating system, file system, kernel version, core count, and
> workload. We could add a GUC, but it would be pretty annoying to have
> a setting that won't matter for most people at all, except
> occasionally when it makes a huge difference.
>
> I wasn't aware that was any current activity around this on the Linux
> side. But Andres' comments made me Google it again, and now I see
> this:
>
> https://lkml.org/lkml/2011/6/16/800
>
> Andes, any idea what the status of that patch is? I'm not clear on
> how Linux works in terms of things getting upstreamed.
There doesn't seem to have been any activity to inlude it in 3.1. The merge
window for 3.1 just ended. The next one will open for about a week after the
release.
Its also not yet included in linux-next which is a "preview" for the currently
worked on release + 1. A release takes roughly 3 months.

For upstreaming somebody needs to be persistent enough to convince one of the
maintainers of the particular area to include the code so that linus then can
pull that.
I guess citing your numbers would go a long way in that direction. Naturally
it would be even better to inlcude results with the patch applied.
My largest machine I can reboot often enough to test such a thing has only two
sockets (4cores E5520). I guess you cannot reboot your loaned machine with a
new kernel easily?

Greetings,
Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Kohei Kaigai <Kohei(dot)Kaigai(at)emea(dot)nec(dot)com>
Subject: Re: fstat vs. lseek
Date: 2011-08-08 17:19:13
Message-ID: CA+TgmobqTwOpMPUYx1P698S=+7HY+JbXnpHvjFTop3ARRrVLJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 8, 2011 at 1:10 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> There doesn't seem to have been any activity to inlude it in 3.1. The merge
> window for 3.1 just ended. The next one will open for about a week after the
> release.
> Its also not yet included in linux-next which is a "preview" for the currently
> worked on release + 1. A release takes roughly 3 months.

OK. If it doesn't get into Linux 3.2 we had better start thinking
hard about a workaround on our side. I am not too concerned about
people hitting this with PostgreSQL 9.1 or prior, because you'd
basically need a workload targeted to exercise the problem, which
workload is not that similar to the way people actually do things in
real life. However, in PostgreSQL 9.2devel, it's going to be much
more of a real-world problem, so I'd hate to wait until after our
feature freeze and then decide we've got a problem we have to fix.

> For upstreaming somebody needs to be persistent enough to convince one of the
> maintainers of the particular area to include the code so that linus then can
> pull that.
> I guess citing your numbers would go a long way in that direction. Naturally
> it would be even better to inlcude results with the patch applied.
> My largest machine I can reboot often enough to test such a thing has only two
> sockets (4cores E5520). I guess you cannot reboot your loaned machine with a
> new kernel easily?

Not really. I do have root access to a 64-core box at the moment, and
I could probably get permission to reboot it, but if it didn't come
back on-line that would be awkward.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Kohei Kaigai <Kohei(dot)Kaigai(at)emea(dot)nec(dot)com>
Subject: Re: fstat vs. lseek
Date: 2011-08-08 17:29:27
Message-ID: 28447.1312824567@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Not really. I do have root access to a 64-core box at the moment, and
> I could probably get permission to reboot it, but if it didn't come
> back on-line that would be awkward.

Red Hat has some test hardware that I can use (... pokes around ...)
Hmm, this one looks promising:

Memory NUMA Nodes
64348 MB 4

Cpu
Vendor Model Name Family Model Stepping Speed Processors Cores Sockets Hyper
GenuineIntel Intel(R) Xeon(R) CPU E7- 4860 @ 2.27GHz 6 47 2 1064.0 80 40 4 True

If you can wrap something up to the point where someone else can
run it, I'll give it a shot.

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Kohei Kaigai <Kohei(dot)Kaigai(at)emea(dot)nec(dot)com>
Subject: Re: fstat vs. lseek
Date: 2011-08-08 17:31:35
Message-ID: 3363559.QMgrzrjf7W@alap2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday, August 08, 2011 13:19:13 Robert Haas wrote:
> On Mon, Aug 8, 2011 at 1:10 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > There doesn't seem to have been any activity to inlude it in 3.1. The
> > merge window for 3.1 just ended. The next one will open for about a
> > week after the release.
> > Its also not yet included in linux-next which is a "preview" for the
> > currently worked on release + 1. A release takes roughly 3 months.
>
> OK. If it doesn't get into Linux 3.2 we had better start thinking
> hard about a workaround on our side.
If its ok I will write a mail to lkml referencing this thread and your numbers
inline (with attribution obviously).
I don't think it will be that hard to convince them. But I constantly surprise
myself with naivity so I may be wrong.

> > My largest machine I can reboot often enough to test such a thing has only
> > two sockets (4cores E5520). I guess you cannot reboot your loaned machine
> > with a new kernel easily?
>Not really. I do have root access to a 64-core box at the moment, and
>I could probably get permission to reboot it, but if it didn't come
>back on-line that would be awkward.
As I feared. Any chance that the person lending you the machine can give you a
hand?
Although I don't know how that could be after reading the code it would be
disappointing to wait for 3.2 with the llseek fixes appearing in $distribution
just to notice fstat is still faster for $unobvious_reason...

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Kohei Kaigai <Kohei(dot)Kaigai(at)emea(dot)nec(dot)com>
Subject: Re: fstat vs. lseek
Date: 2011-08-08 17:50:38
Message-ID: CA+TgmoYAo=btCyKmJSgeFnMsb9ekbqAJvs3jhoJFEBZ_KS_xEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> If its ok I will write a mail to lkml referencing this thread and your numbers
> inline (with attribution obviously).

That would be great. Please go ahead.

> I don't think it will be that hard to convince them. But I constantly surprise
> myself with naivity so I may be wrong.

Heh, heh, open source is fun.

>> > My largest machine I can reboot often enough to test such a thing has only
>> > two sockets (4cores E5520). I guess you cannot reboot your loaned machine
>> > with a new kernel easily?
>>Not really.  I do have root access to a 64-core box at the moment, and
>>I could probably get permission to reboot it, but if it didn't come
>>back on-line that would be awkward.
> As I feared. Any chance that the person lending you the machine can give you a
> hand?

Uh, maybe, but considering my relative inexperience in compiling the
Linux kernel, I'd be a little worried about having to iterate too many
times.

> Although I don't know how that could be after reading the code it would be
> disappointing to wait for 3.2 with the llseek fixes appearing in $distribution
> just to notice fstat is still faster for $unobvious_reason...

Well, the good thing here is that we are really only concerned with
gross effects. It's pretty obvious from the numbers I posted upthread
that the problem is related to lock contention. If that gets fixed,
and lseek is still 20% slower under some set of circumstances, it's
not clear that we're really gonna care. I mean, maybe it would be
nice to avoid going to the kernel at all here just so we're immune to
possible inefficiencies in other operating systems (it would be nice
if someone could repeat these tests on a big SMP box running Windows
and/or one of BSD systems) and to save the overhead of a system call,
but those effects are pretty tiny. We could spend a lot of time
optimizing other things before that one percolated up to the top of
the heap, at least based on what I've seen so far.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andrea Suisani <sickpig(at)opinioni(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Kohei Kaigai <Kohei(dot)Kaigai(at)emea(dot)nec(dot)com>
Subject: Re: fstat vs. lseek
Date: 2011-09-16 13:19:07
Message-ID: 4E734CCB.6090802@opinioni.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

hi

On 08/08/2011 07:50 PM, Robert Haas wrote:
> On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund<andres(at)anarazel(dot)de> wrote:
>> If its ok I will write a mail to lkml referencing this thread and your numbers
>> inline (with attribution obviously).
>
> That would be great. Please go ahead.

I've just stumbled across this thread on lkml [1]
"Improve lseek scalability v3".

and I thought to ping pgsql hackers list
just in case, more to the point they're
asking "are there any real workloads which care
[Make generic lseek lockless safe]"

maybe I've got it wrong but it seems somewhat
related to what has been discussed here and
also in Robert Haas's "Linux and glibc Scalability"
blog post [1].

[cut]

Andrea

[1] https://lkml.org/lkml/2011/9/15/399
[2] http://rhaas.blogspot.com/2011/08/linux-and-glibc-scalability.html


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Andrea Suisani <sickpig(at)opinioni(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kohei Kaigai <Kohei(dot)Kaigai(at)emea(dot)nec(dot)com>
Subject: Re: fstat vs. lseek
Date: 2011-09-16 13:30:30
Message-ID: 201109161530.30603.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Friday 16 Sep 2011 15:19:07 Andrea Suisani wrote:
> hi
>
> On 08/08/2011 07:50 PM, Robert Haas wrote:
> > On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund<andres(at)anarazel(dot)de> wrote:
> >> If its ok I will write a mail to lkml referencing this thread and your
> >> numbers inline (with attribution obviously).
> >
> > That would be great. Please go ahead.
>
> I've just stumbled across this thread on lkml [1]
> "Improve lseek scalability v3".
>
> and I thought to ping pgsql hackers list
> just in case, more to the point they're
> asking "are there any real workloads which care
> [Make generic lseek lockless safe]"
I wrote them a mail sometime ago (some weeks) regarding an earlier version of
the patch... Can't find it right now though.

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: fstat vs. lseek
Date: 2011-10-28 19:33:16
Message-ID: 201110282133.18125.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi All,

The lseek patches just got included in Linus tree.

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fstat vs. lseek
Date: 2011-10-28 19:40:51
Message-ID: CA+TgmoamXb9XLcGZvf=cG61+xbNhmDjzDmn4fq=u580A2hbUfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 28, 2011 at 3:33 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> The lseek patches just got included in Linus tree.

Excellent, thanks for the update!

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=ef3d0fd27e90f67e35da516dafc1482c82939a60

So I guess this will be in Linux 3.2?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fstat vs. lseek
Date: 2011-10-28 20:22:01
Message-ID: 201110282222.01614.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Friday, October 28, 2011 09:40:51 PM Robert Haas wrote:
> On Fri, Oct 28, 2011 at 3:33 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > The lseek patches just got included in Linus tree.
>
> Excellent, thanks for the update!
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=ef3
> d0fd27e90f67e35da516dafc1482c82939a60
>
> So I guess this will be in Linux 3.2?
Unless they get reverted for some reason, yes.

Andres