Re: Improve lseek scalability v3

Lists: pgsql-hackers
From: Matthew Wilcox <matthew(at)wil(dot)cx>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Andi Kleen <andi(at)firstfloor(dot)org>, viro(at)zeniv(dot)linux(dot)org(dot)uk, linux-fsdevel(at)vger(dot)kernel(dot)org, linux-kernel(at)vger(dot)kernel(dot)org, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improve lseek scalability v3
Date: 2011-09-16 15:36:20
Message-ID: 20110916153620.GA9913@parisc-linux.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 16, 2011 at 04:16:49PM +0200, Andres Freund wrote:
> I sent an email containing benchmarks from Robert Haas regarding the Subject.
> Looking at lkml.org I can't see it right now, Will recheck when I am at home.
>
> He replaced lseek(SEEK_END) with fstat() and got speedups up to 8.7 times the
> lseek performance.
> The workload was 64 clients hammering postgres with a simple readonly workload
> (pgbench -S).

Yay! Data!

> For reference see the thread in the postgres archives which also links to
> performance data: http://archives.postgresql.org/message-
> id/CA+TgmoawRfpan35wzvgHkSJ0+i-W=VkJpKnRxK2kTDR+HsanWA(at)mail(dot)gmail(dot)com

So both fstat and lseek do more work than postgres wants. lseek modifies
the file pointer while fstat copies all kinds of unnecessary information
into userspace. I imagine this is the source of the slowdown seen in
the 1-client case.

There have been various proposals to make the amount of information returned
by fstat limited to the 'cheap' (for various definitions of 'cheap') fields.

I'd like to dig into the requirement for knowing the file size a little
better. According to the blog entry it's used for "the query planner".
Does the query planner need to know the exact number of bytes in the file,
or is it after an order-of-magnitude? Or to-the-nearest-gigabyte?

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."


From: Andres Freund <andres(at)anarazel(dot)de>
To: Matthew Wilcox <matthew(at)wil(dot)cx>
Cc: Andi Kleen <andi(at)firstfloor(dot)org>, viro(at)zeniv(dot)linux(dot)org(dot)uk, linux-fsdevel(at)vger(dot)kernel(dot)org, linux-kernel(at)vger(dot)kernel(dot)org, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improve lseek scalability v3
Date: 2011-09-16 17:27:33
Message-ID: 201109161927.34472.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,
On Friday 16 Sep 2011 17:36:20 Matthew Wilcox wrote:
> On Fri, Sep 16, 2011 at 04:16:49PM +0200, Andres Freund wrote:
> > I sent an email containing benchmarks from Robert Haas regarding the
> > Subject. Looking at lkml.org I can't see it right now, Will recheck when
> > I am at home.
> >
> > He replaced lseek(SEEK_END) with fstat() and got speedups up to 8.7 times
> > the lseek performance.
> > The workload was 64 clients hammering postgres with a simple readonly
> > workload (pgbench -S).
> Yay! Data!

> > For reference see the thread in the postgres archives which also links to
> > performance data: http://archives.postgresql.org/message-
> > id/CA+TgmoawRfpan35wzvgHkSJ0+i-W=VkJpKnRxK2kTDR+HsanWA(at)mail(dot)gmail(dot)com
> So both fstat and lseek do more work than postgres wants. lseek modifies
> the file pointer while fstat copies all kinds of unnecessary information
> into userspace. I imagine this is the source of the slowdown seen in
> the 1-client case.
Yes, that was my theory as well.

> I'd like to dig into the requirement for knowing the file size a little
> better. According to the blog entry it's used for "the query planner".
Its used for multiple things - one of which is the query planner.
The query planner needs to know how many tuples a table has to produce a
sensible plan. For that is has stats which tell 1. how big is the table 2. how
many tuples does the table have. Those statistics are only updated every now
and then though.
So it uses those old stats to check how many tuples are normally stored on a
page and then uses that to extrapolate the number of tuples from the current
nr of pages (which is computed by lseek(SEEK_END) over the 1GB segements of a
table).

I am not sure how interested you are on the relevant postgres internals?

> Does the query planner need to know the exact number of bytes in the file,
> or is it after an order-of-magnitude? Or to-the-nearest-gigabyte?
It depends on where the information is used. For some of the uses it needs to
be exact (the assumed size is rechecked after acquiring a lock preventing
extension) at other places I guess it would be ok if the accuracy got lower
with bigger files (those files won't ever get bigger than 1GB).
But I have a hard time seeing an implementation where the approximate size
would be faster to get than just the filesize?

Andres


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Matthew Wilcox <matthew(at)wil(dot)cx>, Andi Kleen <andi(at)firstfloor(dot)org>, viro <viro(at)zeniv(dot)linux(dot)org(dot)uk>, linux-fsdevel <linux-fsdevel(at)vger(dot)kernel(dot)org>, linux-kernel <linux-kernel(at)vger(dot)kernel(dot)org>, robertmhaas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve lseek scalability v3
Date: 2011-09-16 17:39:50
Message-ID: 1316194619-sup-2946@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Andres Freund's message of vie sep 16 14:27:33 -0300 2011:
> Hi,
> On Friday 16 Sep 2011 17:36:20 Matthew Wilcox wrote:

> > Does the query planner need to know the exact number of bytes in the file,
> > or is it after an order-of-magnitude? Or to-the-nearest-gigabyte?
> It depends on where the information is used. For some of the uses it needs to
> be exact (the assumed size is rechecked after acquiring a lock preventing
> extension) at other places I guess it would be ok if the accuracy got lower
> with bigger files (those files won't ever get bigger than 1GB).

One other thing we're interested in is portability. I mean, even if
Linux were to introduce a new hypothetical syscall that was able to
return the file size at a ridiculously low cost, we probably wouldn't
use it because it'd be Linux-specific. So an improvement of lseek()
seems to be the best option.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Andi Kleen <andi(at)firstfloor(dot)org>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Matthew Wilcox <matthew(at)wil(dot)cx>, Andi Kleen <andi(at)firstfloor(dot)org>, viro <viro(at)zeniv(dot)linux(dot)org(dot)uk>, linux-fsdevel <linux-fsdevel(at)vger(dot)kernel(dot)org>, linux-kernel <linux-kernel(at)vger(dot)kernel(dot)org>, robertmhaas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve lseek scalability v3
Date: 2011-09-16 17:50:27
Message-ID: 20110916175027.GF7761@one.firstfloor.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> One other thing we're interested in is portability. I mean, even if
> Linux were to introduce a new hypothetical syscall that was able to
> return the file size at a ridiculously low cost, we probably wouldn't
> use it because it'd be Linux-specific. So an improvement of lseek()
> seems to be the best option.

Fully agreed. It doesn't make any sense at all to implement special
syscalls just to workaround a basic system call not scaling.

-Andi


From: Benjamin LaHaise <bcrl(at)kvack(dot)org>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Matthew Wilcox <matthew(at)wil(dot)cx>, Andi Kleen <andi(at)firstfloor(dot)org>, viro(at)zeniv(dot)linux(dot)org(dot)uk, linux-fsdevel(at)vger(dot)kernel(dot)org, linux-kernel(at)vger(dot)kernel(dot)org, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improve lseek scalability v3
Date: 2011-09-16 20:08:17
Message-ID: 20110916200817.GD28519@kvack.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 16, 2011 at 07:27:33PM +0200, Andres Freund wrote:
> many tuples does the table have. Those statistics are only updated every now
> and then though.
> So it uses those old stats to check how many tuples are normally stored on a
> page and then uses that to extrapolate the number of tuples from the current
> nr of pages (which is computed by lseek(SEEK_END) over the 1GB segements of a
> table).
>
> I am not sure how interested you are on the relevant postgres internals?

For such tables, can't Postgres track the size of the file internally? I'm
assuming it's keeping file descriptors open on the tables it manages, in
which case when it writes to a file to extend it, the internally stored size
could be updated. Not making a syscall at all would scale far better than
even a modified lseek() will perform.

Granted, I've not looked at the Postgres code at all.

-ben


From: Andres Freund <andres(at)anarazel(dot)de>
To: Benjamin LaHaise <bcrl(at)kvack(dot)org>
Cc: Matthew Wilcox <matthew(at)wil(dot)cx>, Andi Kleen <andi(at)firstfloor(dot)org>, viro(at)zeniv(dot)linux(dot)org(dot)uk, linux-fsdevel(at)vger(dot)kernel(dot)org, linux-kernel(at)vger(dot)kernel(dot)org, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improve lseek scalability v3
Date: 2011-09-16 21:02:38
Message-ID: 201109162302.38780.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Friday, September 16, 2011 10:08:17 PM Benjamin LaHaise wrote:
> On Fri, Sep 16, 2011 at 07:27:33PM +0200, Andres Freund wrote:
> > many tuples does the table have. Those statistics are only updated every
> > now and then though.
> > So it uses those old stats to check how many tuples are normally stored
> > on a page and then uses that to extrapolate the number of tuples from
> > the current nr of pages (which is computed by lseek(SEEK_END) over the
> > 1GB segements of a table).
> >
> > I am not sure how interested you are on the relevant postgres internals?
>
> For such tables, can't Postgres track the size of the file internally? I'm
> assuming it's keeping file descriptors open on the tables it manages, in
> which case when it writes to a file to extend it, the internally stored
> size could be updated. Not making a syscall at all would scale far better
> than even a modified lseek() will perform.
Yes, it tracks the fds internally. The problem is that postgres is process
based so those tables are not reachable by all processes. It could start
tracking those in shared memory but the synchronization overhead for that
would likely be more expensive than the syscall overhead (Given that the
fdsets are possibly (and realistically) disjunct between the individual
backends you would have to reserve enough shared memory for a fully seperate
fds between each process... Which would complicate efficient lookup).

Also with fstat() instead of lseek() there was no bottleneck anymore, so I
don't think the benefits would warrant that.

Greetings,

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Benjamin LaHaise <bcrl(at)kvack(dot)org>, Matthew Wilcox <matthew(at)wil(dot)cx>, Andi Kleen <andi(at)firstfloor(dot)org>, viro(at)zeniv(dot)linux(dot)org(dot)uk, linux-fsdevel(at)vger(dot)kernel(dot)org, linux-kernel(at)vger(dot)kernel(dot)org, robertmhaas(at)gmail(dot)com
Subject: Re: Improve lseek scalability v3
Date: 2011-09-16 21:05:18
Message-ID: 201109162305.18696.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Friday, September 16, 2011 11:02:38 PM Andres Freund wrote:
> Also with fstat() instead of lseek() there was no bottleneck anymore, so I
> don't think the benefits would warrant that.
At least thats what I observed on a 4 x 6 machine without the patch applied
(can't reboot it). That shouldn't be concurrency relevant so...

Andres


From: Greg Stark <stark(at)mit(dot)edu>
To: Benjamin LaHaise <bcrl(at)kvack(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Matthew Wilcox <matthew(at)wil(dot)cx>, Andi Kleen <andi(at)firstfloor(dot)org>, viro(at)zeniv(dot)linux(dot)org(dot)uk, linux-fsdevel(at)vger(dot)kernel(dot)org, linux-kernel(at)vger(dot)kernel(dot)org, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improve lseek scalability v3
Date: 2011-09-16 22:44:59
Message-ID: CAM-w4HNAJVRx5Vj87hXjL9JDjwbUoDiso_NZcfomk7wpd2zshw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 16, 2011 at 9:08 PM, Benjamin LaHaise <bcrl(at)kvack(dot)org> wrote:
> For such tables, can't Postgres track the size of the file internally?  I'm
> assuming it's keeping file descriptors open on the tables it manages, in
> which case when it writes to a file to extend it, the internally stored size
> could be updated.  Not making a syscall at all would scale far better than
> even a modified lseek() will perform.

There's no hardwired limit on how many tables you can have in a
database, it's not limited by the number of file descriptors. Postgres
would have to keep some kind of LRU for recently opened files and
their sizes or something like that. There would probably still be a
lot of lseeks/fstats going on.

Generally keeping a Postgres cached value for the size would then have
a reliability issue. It's much safer to have a single authoritative
value -- the actual length of the file -- than have the same value
stored in two locations and then need to worry about them getting out
of sync. If a write fails when extending the file due to a filesystem
running out of space then Postgres might not know how to update its
internal cached state accurately for example.

There's no question it could be done but it's not clear it would
necessarily be much faster than a lock-free lseek/fstat.

On Fri, Sep 16, 2011 at 6:27 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> It depends on where the information is used. For some of the uses it needs to
> be exact (the assumed size is rechecked after acquiring a lock preventing
> extension)

Fwiw this might give the wrong impression. I don't believe scans
acquire a lock preventing extension -- that is another process can be
concurrently extending the file at the same time as the scan is
proceeding. The scan only locks out truncation (vacuum). Any blocks
added by another process are ignored by the scan because they can only
contain records invisible to that transaction. This does depend on the
lseek/fstat being done after the transaction snapshot is taken which
is possibly "rechecking" the value taken by the query planner but
they're really two independent things.

--
greg


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Benjamin LaHaise <bcrl(at)kvack(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Matthew Wilcox <matthew(at)wil(dot)cx>, Andi Kleen <andi(at)firstfloor(dot)org>, viro(at)zeniv(dot)linux(dot)org(dot)uk, linux-fsdevel(at)vger(dot)kernel(dot)org, linux-kernel(at)vger(dot)kernel(dot)org, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improve lseek scalability v3
Date: 2011-09-19 12:31:00
Message-ID: 20110919123100.GJ12765@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Benjamin LaHaise (bcrl(at)kvack(dot)org) wrote:
> For such tables, can't Postgres track the size of the file internally? I'm
> assuming it's keeping file descriptors open on the tables it manages, in
> which case when it writes to a file to extend it, the internally stored size
> could be updated. Not making a syscall at all would scale far better than
> even a modified lseek() will perform.

We'd have to have it in shared memory and have a lock around it, it
wouldn't be cheap at all.

Thanks,

Stephen


From: Matthew Wilcox <matthew(at)wil(dot)cx>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Benjamin LaHaise <bcrl(at)kvack(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Andi Kleen <andi(at)firstfloor(dot)org>, viro(at)zeniv(dot)linux(dot)org(dot)uk, linux-fsdevel(at)vger(dot)kernel(dot)org, linux-kernel(at)vger(dot)kernel(dot)org, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improve lseek scalability v3
Date: 2011-09-19 13:25:00
Message-ID: 20110919132500.GA16740@parisc-linux.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 19, 2011 at 08:31:00AM -0400, Stephen Frost wrote:
> * Benjamin LaHaise (bcrl(at)kvack(dot)org) wrote:
> > For such tables, can't Postgres track the size of the file internally? I'm
> > assuming it's keeping file descriptors open on the tables it manages, in
> > which case when it writes to a file to extend it, the internally stored size
> > could be updated. Not making a syscall at all would scale far better than
> > even a modified lseek() will perform.
>
> We'd have to have it in shared memory and have a lock around it, it
> wouldn't be cheap at all.

Yep, that makes perfect sense. After all, the kernel does basically the
same thing to maintain this information; why should we have userspace
duplicating the same infrastructure?

I must admit, I'd never heard of this usage of lseek to get the current
size of a file before; I'd assumed everybody used fstat. Given this
legitimate reason for a high-frequency calling of lseek, I withdraw my
earlier objection to the patch series.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Benjamin LaHaise <bcrl(at)kvack(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Matthew Wilcox <matthew(at)wil(dot)cx>, Andi Kleen <andi(at)firstfloor(dot)org>, viro(at)zeniv(dot)linux(dot)org(dot)uk, linux-fsdevel(at)vger(dot)kernel(dot)org, linux-kernel(at)vger(dot)kernel(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improve lseek scalability v3
Date: 2011-09-19 13:30:22
Message-ID: CA+Tgmoa6CP7uwEAcu+d1vVfapj0ZhpYh2UPLZwXHg-VRnbc9QQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 19, 2011 at 8:31 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> * Benjamin LaHaise (bcrl(at)kvack(dot)org) wrote:
>> For such tables, can't Postgres track the size of the file internally?  I'm
>> assuming it's keeping file descriptors open on the tables it manages, in
>> which case when it writes to a file to extend it, the internally stored size
>> could be updated.  Not making a syscall at all would scale far better than
>> even a modified lseek() will perform.
>
> We'd have to have it in shared memory and have a lock around it, it
> wouldn't be cheap at all.

In theory, we could implement a lock-free cache. But I still think it
would be better to see this fixed on the kernel side. If we had some
evidence that all of those lseek() calls were a performance problem
even when the i_mutex is not seriously contended, then that would be a
good argument for doing this in user-space, but I haven't seen any
such evidence. On the other hand, the numbers I posted show that when
i_mutex IS contended, it can cause a throughput regression of up to
90%. That seems worth fixing. If it turns out that lseek() is too
expensive even in the uncontended case or with the i_mutex contention
removed (or if the Linux community is unwilling to accept the proposed
fix), then we can (and should) look at further optimizing it within
PostgreSQL. My guess, though, is that an unlocked lseek will be fast
enough that we won't need to worry about installing our own caching
infrastructure (or at least, there will be plenty of more significant
performance problems to hunt down first).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Marco Stornelli <marco(dot)stornelli(at)gmail(dot)com>
To: Matthew Wilcox <matthew(at)wil(dot)cx>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Benjamin LaHaise <bcrl(at)kvack(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Andi Kleen <andi(at)firstfloor(dot)org>, viro(at)zeniv(dot)linux(dot)org(dot)uk, linux-fsdevel(at)vger(dot)kernel(dot)org, linux-kernel(at)vger(dot)kernel(dot)org, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improve lseek scalability v3
Date: 2011-09-20 07:18:56
Message-ID: CANGUGtDCLkKMa5vmgYtDDfh5p32aUGAtYOGFMrT6YK6oVYLBCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/9/19 Matthew Wilcox <matthew(at)wil(dot)cx>:
> On Mon, Sep 19, 2011 at 08:31:00AM -0400, Stephen Frost wrote:
>> * Benjamin LaHaise (bcrl(at)kvack(dot)org) wrote:
>> > For such tables, can't Postgres track the size of the file internally?  I'm
>> > assuming it's keeping file descriptors open on the tables it manages, in
>> > which case when it writes to a file to extend it, the internally stored size
>> > could be updated.  Not making a syscall at all would scale far better than
>> > even a modified lseek() will perform.
>>
>> We'd have to have it in shared memory and have a lock around it, it
>> wouldn't be cheap at all.
>
> Yep, that makes perfect sense.  After all, the kernel does basically the
> same thing to maintain this information; why should we have userspace
> duplicating the same infrastructure?
>
> I must admit, I'd never heard of this usage of lseek to get the current
> size of a file before; I'd assumed everybody used fstat.  Given this
> legitimate reason for a high-frequency calling of lseek, I withdraw my
> earlier objection to the patch series.
>
> --
> Matthew Wilcox                          Intel Open Source Technology Centre
> "Bill, look, we understand that you're interested in selling us this
> operating system, but compare it to ours.  We can't possibly take such
> a retrograde step."
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo(at)vger(dot)kernel(dot)org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I really don't understand the approach here. An improvement is an
improvement, do we need a use case to add an improvement to the
kernel? We are not talking about to add a new syscall or to do an ABI
change in this case. So my absolute ack to these patches.

Marco