Quick Links

Re: posix_fadvise v22

Lists:	pgsql-hackers

From:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2008-12-11 15:04:13
Message-ID:	FDDBA24E-FF4D-4654-BA75-692B3BA71B97@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I'll send another path with at least 1 and 3 fixed and hunt around
again for a header file to put this guc into.

On 10 Dec 2008, at 04:22, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp
> wrote:

> Hello,
>
> Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
>> Here's an update to eliminate two small bitrot conflicts.
>
> I read your patch with interest, but found some trivial bad manners.
>
> * LET_OS_MANAGE_FILESIZE is already obsoleted.
> You don't have to cope with the option.

Huh I didn't realize that. I guess the idea is that users just
configure a very large segment size to get the old behaviour.

>
> * Type mismatch in prefetch_pages
> A variable prefetch_pages is defined as "unsigned" or "int"
> in some places. Why don't you define it only once in a header
> and include the header in source files?

Just... Which header?

> * Assignment to prefetch_pages
> What do "+0.99" means here?
> [assign_io_concurrency()]
> + prefetch_pages = new_prefetch_pages+0.99;
> You want to do as follows, right?
> + prefetch_pages = (int) ceil(new_prefetch_pages);

Sure

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2008-12-11 16:29:30
Message-ID:	8347.1229012970@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <greg(dot)stark(at)enterprisedb(dot)com> writes:
>> A variable prefetch_pages is defined as "unsigned" or "int"
>> in some places. Why don't you define it only once in a header
>> and include the header in source files?

> Just... Which header?

MHO: the header that goes with the source file that is most concerned with
implementing the variable's behavior (which is also the file that should
have the actual variable definition).

regards, tom lane

From:	"Greg Stark" <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2008-12-11 17:11:06
Message-ID:	4136ffa0812110911x4914a951o6ca502879e3dede7@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 11, 2008 at 4:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Stark <greg(dot)stark(at)enterprisedb(dot)com> writes:
>>> A variable prefetch_pages is defined as "unsigned" or "int"
>>> in some places. Why don't you define it only once in a header
>>> and include the header in source files?
>
>> Just... Which header?
>
> MHO: the header that goes with the source file that is most concerned with
> implementing the variable's behavior (which is also the file that should
> have the actual variable definition).

Well the trick here is that the variable actually affects how many
PrefetchBuffer() calls *callers* should make. The callers are various
places which are doing lots of ReadBuffer calls and know what buffer's
they'll need in the future. The main places are in
nodeBitmapHeapScan.c and nbtsearch.c. Neither of those are remotely
relevant.

I think i'm settling in that it should be in the same place as the
PrefetchBuffer() prototype since anyone who needs prefetch_buffers
will need that as well (except for guc.c). So I'll put it in bufmgr.h
for now.

--
greg

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2008-12-11 21:35:12
Message-ID:	87k5a6tmy7.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's the update

I also skimmed through and cleaned a couple other things. There's *still* a
function prototype which I don't see what header file to put it in, that's the
one in port/posix_fadvise.c which contains one function with one caller, guc.c.

Attachment	Content-Type	Size
posix_fadvise_v23.diff.gz	application/octet-stream	13.1 KB

From:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-01 20:30:26
Message-ID:	603c8f070901011230t1afa6644racb715b975078054@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I tried this on my laptop running FC9, and because I forgot to run
autoconf, I got this error message when I tried to turn on
posix_fadvise.

rhaas=# set effective_io_concurrency to 3;
ERROR: could not determine if this system has a working posix_fadvise
DETAIL: Check configure.log produced by configure for more information

Am I correct in thinking that the only thing we're really checking for
here is whether a trivial posix_fadvise() call returns success? If
so, is this test really worth doing? It seems to me that since users
can always switch off the feature by leaving effective_io_concurrency
set to the default value of 1, there is not much value in having a
configure test that forcibly disables it. If the user has a broken
posix_fadvise() and later fixes it, they shouldn't have to recompile
PostgreSQL to use this feature, especially in this day when the build
system and the run system are often different. A user who somehow
ends up with RPMs that generate this error message will be utterly at
a loss as to what to do about it.

One minor nit: If we're going to keep this test, we should change the
detail string to say config.log rather than configure.log, as that is
the actual file name.

...Robert

On Thu, Dec 11, 2008 at 4:35 PM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> Here's the update
>
> I also skimmed through and cleaned a couple other things. There's *still* a
> function prototype which I don't see what header file to put it in, that's the
> one in port/posix_fadvise.c which contains one function with one caller, guc.c.
>
>
> --
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
> Ask me about EnterpriseDB's 24x7 Postgres support!
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-01 20:55:15
Message-ID:	10658.1230843315@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
> Am I correct in thinking that the only thing we're really checking for
> here is whether a trivial posix_fadvise() call returns success? If
> so, is this test really worth doing?

Runtime tests performed during configure are generally a bad idea to
start with --- it's impossible to do any such thing in a
cross-compilation scenario, for example.

regards, tom lane

From:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 02:43:29
Message-ID:	603c8f070901011843p121e40afu35c4a7bdc51c051c@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 1, 2009 at 3:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
>> Am I correct in thinking that the only thing we're really checking for
>> here is whether a trivial posix_fadvise() call returns success? If
>> so, is this test really worth doing?
>
> Runtime tests performed during configure are generally a bad idea to
> start with --- it's impossible to do any such thing in a
> cross-compilation scenario, for example.

OK, here's an update of Greg's patch with the runtime configure test
ripped out, some minor documentation tweaks, and a few unnecessary
whitespace diff hunks quashed. I think this is about ready for
committer review. The only thing I haven't been able to do is
demonstrate that this change actually produces a performance
improvement. Either I'm testing the wrong thing, or it just doesn't
provide any benefit on a single-spindle system. However, I believe
that Greg has previously posted some fairly impressive performance
results, so I'm not sure that my shortcomings in this area should be a
bar to having a committer pick this one up. If more testing is
needed, it would at least be helpful to have a committer specify what
areas they are concerned about.

...Robert

Attachment	Content-Type	Size
posix_fadvise_v23_rh1.diff.gz	application/x-gzip	11.8 KB

From:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 03:53:54
Message-ID:	9DCB7B9A-50E0-477E-972F-01754B3EE85F@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

In theory there should be no benefit on a single spindle system. There
could be a slight benefit due to reordering of I/o but only on a raid
array would you see a significant speedup -- which should be about
equal to the number of spindles.

What would be interesting is whether you see a noticable speed
*decrease* from having prefetching enabled when it isn't helping.
Either due to having everything fit in shared buffers or everything
fit in the filesystem cache (the latter should be more of a hit)

Even if there is it doesn't really worry me. By default the feature is
disabled and you should only really turn it on if ulu do have a raid
array and want an individual query to make use if it.

Now that there's an actual run-time sysconf check for the buggy glibc
called by the guc function we arguably don't need the autoconf
check_run check anymore anyways.

--
Greg

On 1 Jan 2009, at 21:43, "Robert Haas" <robertmhaas(at)gmail(dot)com> wrote:

> On Thu, Jan 1, 2009 at 3:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> "Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
>>> Am I correct in thinking that the only thing we're really checking
>>> for
>>> here is whether a trivial posix_fadvise() call returns success? If
>>> so, is this test really worth doing?
>>
>> Runtime tests performed during configure are generally a bad idea to
>> start with --- it's impossible to do any such thing in a
>> cross-compilation scenario, for example.
>
> OK, here's an update of Greg's patch with the runtime configure test
> ripped out, some minor documentation tweaks, and a few unnecessary
> whitespace diff hunks quashed. I think this is about ready for
> committer review. The only thing I haven't been able to do is
> demonstrate that this change actually produces a performance
> improvement. Either I'm testing the wrong thing, or it just doesn't
> provide any benefit on a single-spindle system. However, I believe
> that Greg has previously posted some fairly impressive performance
> results, so I'm not sure that my shortcomings in this area should be a
> bar to having a committer pick this one up. If more testing is
> needed, it would at least be helpful to have a committer specify what
> areas they are concerned about.
>
> ...Robert
> <posix_fadvise_v23_rh1.diff.gz>

From:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
To:	"Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 04:11:33
Message-ID:	603c8f070901012011hca2c36flf527a5cd01695205@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Now that there's an actual run-time sysconf check for the buggy glibc called
> by the guc function we arguably don't need the autoconf check_run check
> anymore anyways.

Isn't that the check I just removed for you, or are you talking about
some other check that can also be removed?

...Robert

From:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 04:49:57
Message-ID:	262184F8-1206-4E2C-B50F-9FF84E63C1B6@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sorry for top-posting -- phone mail client sucks.

I thought the autoconf ac_run_check was the test that people were
questioning. That calls posix_fadvise to see if it crashes at
configure time.

The guc run-time check is checking for known-buggy versions of glibc
using sysconf to check what version of glibc you have.

--
Greg

On 1 Jan 2009, at 23:11, "Robert Haas" <robertmhaas(at)gmail(dot)com> wrote:

>> Now that there's an actual run-time sysconf check for the buggy
>> glibc called
>> by the guc function we arguably don't need the autoconf check_run
>> check
>> anymore anyways.
>
> Isn't that the check I just removed for you, or are you talking about
> some other check that can also be removed?
>
> ...Robert

From:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
To:	"Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 05:12:50
Message-ID:	603c8f070901012112s3994b32cy217054db133383ad@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 1, 2009 at 11:49 PM, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com> wrote:
> Sorry for top-posting -- phone mail client sucks.
>
> I thought the autoconf ac_run_check was the test that people were
> questioning. That calls posix_fadvise to see if it crashes at configure
> time.

Yes, that's what I removed.

> The guc run-time check is checking for known-buggy versions of glibc using
> sysconf to check what version of glibc you have.

Right - that check is still in my updated patch.

I think the confusion may stem from the fact that Tom and I used the
word "runtime" to refer to the ac_run_check thing, because it is
checking something about the runtime environment (namely, whether
posix_fadvise works or not) at configure-time.

In any event, it seems as though we are all on the same page.

...Robert

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 10:08:13
Message-ID:	Pine.GSO.4.64.0901020335150.16680@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 1 Jan 2009, Robert Haas wrote:

> The only thing I haven't been able to do is demonstrate that this change
> actually produces a performance improvement. Either I'm testing the
> wrong thing, or it just doesn't provide any benefit on a single-spindle
> system.

When I did a round of testing on the earlier prefetch test program Greg
Stark put together, one of my single-spindle Linux system didn't show any
real benefit. So as long as you didn't see performance degrade, your not
seeing any improvement isn't bad news.

I've got a stack of hardware I can do performance testing of this patch
on, what I haven't been able to find time for is setting up any sort of
test harness right now. If you or Greg have any benchmark or test program
you could suggest that should show off the improvements here, I'd be glad
to run it on a bunch of systems and report back--I've already got a stack
of candidate ones I ran the earlier tests on to compare results against.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 15:33:25
Message-ID:	24481.1230910405@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> On Thu, 1 Jan 2009, Robert Haas wrote:
>> The only thing I haven't been able to do is demonstrate that this change
>> actually produces a performance improvement. Either I'm testing the
>> wrong thing, or it just doesn't provide any benefit on a single-spindle
>> system.

> When I did a round of testing on the earlier prefetch test program Greg
> Stark put together, one of my single-spindle Linux system didn't show any
> real benefit. So as long as you didn't see performance degrade, your not
> seeing any improvement isn't bad news.

ISTM that you *should* be able to see an improvement on even
single-spindle systems, due to better overlapping of CPU and I/O effort.
If the test case is either 100% CPU-bound or 100% I/O-bound then no,
but for anything in between there ought to be improvement.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 19:25:47
Message-ID:	Pine.GSO.4.64.0901021416310.15834@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2 Jan 2009, Tom Lane wrote:

> ISTM that you *should* be able to see an improvement on even
> single-spindle systems, due to better overlapping of CPU and I/O effort.

The earlier synthetic tests I did:

http://archives.postgresql.org/pgsql-hackers/2008-09/msg01401.php

Showed a substantial speedup even in the single spindle case on a couple
of systems, but one didn't really seem to benefit. So we could theorize
that Robert's test system is more like that one. If someone can help out
with making a more formal test case showing this in action, I'll dig into
the details of what's different between that system and the others.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 20:01:45
Message-ID:	7537.1230926505@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> On Fri, 2 Jan 2009, Tom Lane wrote:
>> ISTM that you *should* be able to see an improvement on even
>> single-spindle systems, due to better overlapping of CPU and I/O effort.

> The earlier synthetic tests I did:
> http://archives.postgresql.org/pgsql-hackers/2008-09/msg01401.php
> Showed a substantial speedup even in the single spindle case on a couple
> of systems, but one didn't really seem to benefit. So we could theorize
> that Robert's test system is more like that one. If someone can help out
> with making a more formal test case showing this in action, I'll dig into
> the details of what's different between that system and the others.

Well, I claim that if you start with a query that's about 50% CPU and
50% I/O effort, you ought to be able to get something approaching 2X
speedup if this patch really works. Consider something like

create function waste_time(int) returns int as $$
begin
for i in 1 .. $1 loop
null;
end loop;
return 1;
end $$ language plpgsql;

select count(waste_time(42)) from very_large_table;

In principle you should be able to adjust the constant so that vmstat
shows about 50% CPU busy, and then enabling fadvise should improve
matters significantly.

Now the above proposed test case is too simple because it will generate
a seqscan, and if the kernel is not completely brain-dead it will not
need any fadvise hinting to do read-ahead. But you should be able to
adapt the idea for whatever indexscan-based test case you are really
using.

Note: on a multi-CPU system you need to take vmstat or top numbers with
a grain of salt, since they might consider "one CPU 50% busy" as
"system only 50/N % busy".

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 20:40:49
Message-ID:	878wptv3u6.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> In principle you should be able to adjust the constant so that vmstat
> shows about 50% CPU busy, and then enabling fadvise should improve
> matters significantly.

I think in practice individual queries don't interleave much cpu with i/o
work. A single random page fetch is 5ms which is an awful lot of cpu cycles to
be sinking somewhere. In practice I think this is going to be single-digit
percentages.

Aside from big sorts and such which tend not to be interleaved with i/o the
main time I've seen queries have a significant cpu load is when the data is
mostly in cache. In which case prefetching would be hard pressed to help at
all.

We could construct a synthetic case but the main point of this feature is to
make use of raid arrays that are currently going idle, not to pick up a few
percentage points for single spindle systems.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 20:43:41
Message-ID:	200901022043.n02Khfa03040@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith wrote:
> On Fri, 2 Jan 2009, Tom Lane wrote:
>
> > ISTM that you *should* be able to see an improvement on even
> > single-spindle systems, due to better overlapping of CPU and I/O effort.
>
> The earlier synthetic tests I did:
>
> http://archives.postgresql.org/pgsql-hackers/2008-09/msg01401.php
>
> Showed a substantial speedup even in the single spindle case on a couple
> of systems, but one didn't really seem to benefit. So we could theorize
> that Robert's test system is more like that one. If someone can help out
> with making a more formal test case showing this in action, I'll dig into
> the details of what's different between that system and the others.

I think for an I/O-bound workload on a single drive system you would
need a drive that did some kind of tagged queuing (reordering/grouping)
of requests to see a benefit from the patch.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 22:36:31
Message-ID:	603c8f070901021436v71f10188g1d747b493c89f2f5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I've got a stack of hardware I can do performance testing of this patch on,
> what I haven't been able to find time for is setting up any sort of test
> harness right now. If you or Greg have any benchmark or test program you
> could suggest that should show off the improvements here, I'd be glad to run
> it on a bunch of systems and report back--I've already got a stack of
> candidate ones I ran the earlier tests on to compare results against.

Unfortunately I don't have anything useful. I took the skewed TPC-H
data that Lawrence Ramon posted and created a table based on lineitem
using something along the lines of:

SELECT *, g FROM lineitem, generate_series(1,8) AS g;

Then I made an index on one of the columns and ran some queries that
ended up being planned as bitmap index scans. The disk seemed to be
doing its thing, but it didn't do it's thing any faster when I changed
effective_io_concurrency to 4 (in fact there might have been a small
slowdown but I didn't take the time to measure carefully, so I can't
refute the null hypothesis).

...Robert

From:	"Greg Stark" <stark(at)enterprisedb(dot)com>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-02 23:02:17
Message-ID:	4136ffa0901021502w78f0611dx63929ac0fbbdc2f4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 2, 2009 at 5:36 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I've got a stack of hardware I can do performance testing of this patch on,
>> what I haven't been able to find time for is setting up any sort of test
>> harness right now. If you or Greg have any benchmark or test program you
>> could suggest that should show off the improvements here, I'd be glad to run
>> it on a bunch of systems and report back--I've already got a stack of
>> candidate ones I ran the earlier tests on to compare results against.
>
> Then I made an index on one of the columns and ran some queries that
> ended up being planned as bitmap index scans.

Hm, what were those plans? You might want to put the old code back in
explain.c to print the prefetching target to see how well it's doing.

The queries I ran to test it tended to look like this kind of thing,
where r was a column filled with random values. If it's clustered
there will be hardly any benefit as even random i/o would be
prefetched by the OS.

select * from h where r = any (array(select
(1+random()*2000000)::integer from generate_series(1,1000)));

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-03 00:51:39
Message-ID:	14488.1230943899@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> In principle you should be able to adjust the constant so that vmstat
>> shows about 50% CPU busy, and then enabling fadvise should improve
>> matters significantly.

> I think in practice individual queries don't interleave much cpu with i/o
> work. A single random page fetch is 5ms which is an awful lot of cpu cycles to
> be sinking somewhere. In practice I think this is going to be single-digit
> percentages.

The point of the suggestion is to prove that the patch works as
advertised. How wide the sweet spot is for this test isn't nearly as
interesting as proving that there *is* a sweet spot. If you can't
find one it suggests that either the patch or the local posix_fadvise
doesn't work.

regards, tom lane

From:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
To:	"Greg Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-03 01:42:51
Message-ID:	603c8f070901021742s61dcbaf2r51deff1219196944@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Hm, what were those plans? You might want to put the old code back in
> explain.c to print the prefetching target to see how well it's doing.

Well, bad news. Here's one where prefetching seems to make it WORSE.

rhaas=# explain select sum(1) from enormous where l_shipdate in
('1992-01-01', '1993-01-01', '1994-01-01', '1995-01-01', '1996-01-01',
'1997-01-01', '1998-01-01', '1999-01-01', '2000-01-01', '2001-01-01');

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
------------
Aggregate (cost=455072.75..455072.76 rows=1 width=0)
-> Bitmap Heap Scan on enormous (cost=3327.59..454634.09
rows=175464 width=0)
Recheck Cond: (l_shipdate = ANY
('{1992-01-01,1993-01-01,1994-01-01,1995-01-01,1996-01-01,1997-01-01,1998-01-01,1999-01-01,2000-01-01,2001-01-01}'::d
ate[]))
-> Bitmap Index Scan on enormous_l_shipdate
(cost=0.00..3283.72 rows=175464 width=0)
Index Cond: (l_shipdate = ANY
('{1992-01-01,1993-01-01,1994-01-01,1995-01-01,1996-01-01,1997-01-01,1998-01-01,1999-01-01,2000-01-01,2001-01-01}
'::date[]))
(5 rows)

With effective_io_concurrency set to 1, this took 32 s. With
effective_io_concurrency set to 4, it took 50 s. The table was
created like this:

create table enormous as select l.*, l_instance from lineitem l,
generate_series(1, 8) l_instance;
create index enormous_l_shipdate on enormous (l_shipdate);
vacuum analyze enormous;

...where lineitem is from the skewed TPC-H data for the histojoin patch.

...Robert

From:	"Greg Stark" <stark(at)enterprisedb(dot)com>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-03 03:02:32
Message-ID:	4136ffa0901021902m3d9565e8v6d62a75c4a81511a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 2, 2009 at 8:42 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Hm, what were those plans? You might want to put the old code back in
>> explain.c to print the prefetching target to see how well it's doing.
>
> Well, bad news. Here's one where prefetching seems to make it WORSE.
>
> rhaas=# explain select sum(1) from enormous where l_shipdate in
> ('1992-01-01', '1993-01-01', '1994-01-01', '1995-01-01', '1996-01-01',
> '1997-01-01', '1998-01-01', '1999-01-01', '2000-01-01', '2001-01-01');
>
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
> ------------
> Aggregate (cost=455072.75..455072.76 rows=1 width=0)
> -> Bitmap Heap Scan on enormous (cost=3327.59..454634.09
> rows=175464 width=0)

Any chance you could put back the code in explain.c which showed
whether posix_fadvise is actually getting used? Another thing I did
when testing was attaching with strace to see if posix_fadvise (the
syscall on linux was actually fadvise64 iirc) is actually getting
called.

And is this on a system with multiple spindles? How many?

And how much of the data is in shared buffers or in filesystem cache?
Is this consistent for repeated queries? Is it only when you're
repeating a query for dates that you've already selected?

And how fast is it with effective_io_concurrency set to 1,2,3,5,6,7,8,...?

Do you see the same effect if you use a self-contained test case
instead of the TPC-H data so I can try it?

--
greg

From:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
To:	"Greg Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-03 04:13:10
Message-ID:	603c8f070901022013u242179fbq4ba97e2f5aea3715@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Any chance you could put back the code in explain.c which showed
> whether posix_fadvise is actually getting used? Another thing I did
> when testing was attaching with strace to see if posix_fadvise (the
> syscall on linux was actually fadvise64 iirc) is actually getting
> called.

I tried changing this:
returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
POSIX_FADV_WILLNEED);
to this:
returnCode = 0;

When I did that, it when back from 50 s to 33 s, which I think means
that posix_fadvise is getting called and that that is what is making
it slower.

> And is this on a system with multiple spindles? How many?

Latitude D830 laptop. Single disk. Fedora 9. kernel-2.6.27.9-73.fc9.x86_64.

> And how much of the data is in shared buffers or in filesystem cache?
> Is this consistent for repeated queries? Is it only when you're
> repeating a query for dates that you've already selected?

I stopped the cluster, dropped the page cache, and restarted the
cluster just before testing. Repeated tests are fast due to caching
effects. shared_buffers is 240MB. System has 2GB RAM, steady state
is about 1GB of page cache.

> And how fast is it with effective_io_concurrency set to 1,2,3,5,6,7,8,...?

I do not currently have this information. :-)

I will try to run some more tests over the weekend, but I'm too tired
now and am starting to make mistakes.

> Do you see the same effect if you use a self-contained test case
> instead of the TPC-H data so I can try it?

Not sure exactly what you have in mind here. If you send a script or
something to reproduce I will try it.

...Robert

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-03 04:15:10
Message-ID:	87mye9t48h.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> The point of the suggestion is to prove that the patch works as
> advertised. How wide the sweet spot is for this test isn't nearly as
> interesting as proving that there *is* a sweet spot. If you can't
> find one it suggests that either the patch or the local posix_fadvise
> doesn't work.

I posted tons of reproducible test cases with graphs of results for various
raid stripe widths a while back. There was a very slight benefit on a single
spindle at some prefetch depths but it wasn't very consistent and it varied
heavily depending on the prefetch depth.

I don't know what to make of this test. I don't know how to reproduce the same
data distribution, I have no idea what raid configuration it's been run on,
what version of what OS it's on, etc. It's quite possible posix_fadvise isn't
working on it, I don't know.

It's also possible the overhead of the extra buffer lookups and syscalls
outweight any benefit of overlapping i/o and cpu on a single spindle.

Trying to contrive a situation where a single spindle sees a significant
benefit is going to be very tricky. Avoiding caching effects and the
confounding effect of any overhead will make it hard to see a consistent
benefit without a raid array.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	"Greg Stark" <stark(at)enterprisedb(dot)com>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-03 04:26:41
Message-ID:	4136ffa0901022026yfe11975yb45575259f796a39@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 2, 2009 at 11:13 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> When I did that, it when back from 50 s to 33 s, which I think means
> that posix_fadvise is getting called and that that is what is making
> it slower.
>
>> And is this on a system with multiple spindles? How many?
>
> Latitude D830 laptop. Single disk. Fedora 9. kernel-2.6.27.9-73.fc9.x86_64.
>
>> And how much of the data is in shared buffers or in filesystem cache?
>> Is this consistent for repeated queries? Is it only when you're
>> repeating a query for dates that you've already selected?
>
> I stopped the cluster, dropped the page cache, and restarted the
> cluster just before testing. Repeated tests are fast due to caching
> effects. shared_buffers is 240MB. System has 2GB RAM, steady state
> is about 1GB of page cache.

Ahhh. So this is a test of how much impact the extra syscalls and
buffer lookups have on a system where they're not really helping. I'm
still surprised, a 50% performance penalty is a lot worse than I would
have expected, especially when the buffers aren't in cache. I did one
quick test and saw about 10% performance penalty in that test.

--
greg

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: posix_fadvise v22
Date:	2009-01-03 21:42:18
Message-ID:	200901032342.19435.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 02 January 2009 06:49:57 Greg Stark wrote:
> The guc run-time check is checking for known-buggy versions of glibc
> using sysconf to check what version of glibc you have.

Could you please mention the bug number in the relevant source code comments?

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: [SPAM] Re: posix_fadvise v22
Date:	2009-01-03 22:47:21
Message-ID:	87iqowt3ba.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:

> On Friday 02 January 2009 06:49:57 Greg Stark wrote:
>> The guc run-time check is checking for known-buggy versions of glibc
>> using sysconf to check what version of glibc you have.
>
> Could you please mention the bug number in the relevant source code comments?

It's Debian bug# 312406 which was fixed in Debian release 2.3.5-3. So it's
probably one of these but searching for posix_fadvise doesn't find anything in
their bug tracker:

Version 2.3.6

* The following bugs are resolved with this release:

38, 253, 549, 622, 653, 721, 758, 851, 877, 915, 934, 955, 961,
1016, 1037, 1076, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086,
1087, 1088, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098,
1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1109,
1110, 1111, 1112, 1113, 1125, 1137, 1138, 1249, 1250, 1251, 1252,
1253, 1254, 1350, 1358, 1394, 1438, 1498, 1534

Visit <http://sources.redhat.com/bugzilla/> for the details of each bug.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: [SPAM] Re: posix_fadvise v22
Date:	2009-01-05 08:42:52
Message-ID:	4961C80C.6030307@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
>
>> On Friday 02 January 2009 06:49:57 Greg Stark wrote:
>>> The guc run-time check is checking for known-buggy versions of glibc
>>> using sysconf to check what version of glibc you have.
>> Could you please mention the bug number in the relevant source code comments?
>
> It's Debian bug# 312406 which was fixed in Debian release 2.3.5-3. So it's
> probably one of these but searching for posix_fadvise doesn't find anything in
> their bug tracker:

The way I read this is that this was a temporary kernel/libc mismatch in
a development version of Debian 3 years ago that was fixed within 2
months of being reported and was never released to the general public.
So it would be on the same level as any of a million temporary breakages
in Linux distributions under development.

Unless there are other reports of this problem, I wouldn't bother
testing or working around this at all. If people are running PostgreSQL
8.4+ on Debian unstable June 2005 with kernel 2.4, they cannot be helped.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: [SPAM] Re: posix_fadvise v22
Date:	2009-01-10 18:51:05
Message-ID:	24021.1231613465@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> The way I read this is that this was a temporary kernel/libc mismatch in
> a development version of Debian 3 years ago that was fixed within 2
> months of being reported and was never released to the general public.
> So it would be on the same level as any of a million temporary breakages
> in Linux distributions under development.

This is incorrect, as the problem was in fact present on Red Hat and
presumably all other distros as well.

> Unless there are other reports of this problem, I wouldn't bother
> testing or working around this at all. If people are running PostgreSQL
> 8.4+ on Debian unstable June 2005 with kernel 2.4, they cannot be helped.

While I would like to agree with that position, I can't help noticing
lines 2438-2461 of xlog.c, which represent the still-smoking wreckage of
our last attempt to do something with posix_fadvise. It's not that old
either --- we gave up on it less than three years ago:
http://archives.postgresql.org/pgsql-hackers/2006-06/msg01481.php

I think at a minimum there should be a manual configuration switch
(ie something in pg_config_manual.h) to allow the builder to disable
use of posix_fadvise, even if configure thinks it's there. Depending
on buildfarm results we may have to do more than that.

BTW, I intend to un-disable the xlog change when committing the fadvise
patch. In for a penny, in for a pound ...

regards, tom lane

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: [SPAM] Re: posix_fadvise v22
Date:	2009-01-10 19:05:36
Message-ID:	200901101905.n0AJ5bB21902@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> > The way I read this is that this was a temporary kernel/libc mismatch in
> > a development version of Debian 3 years ago that was fixed within 2
> > months of being reported and was never released to the general public.
> > So it would be on the same level as any of a million temporary breakages
> > in Linux distributions under development.
>
> This is incorrect, as the problem was in fact present on Red Hat and
> presumably all other distros as well.
>
> > Unless there are other reports of this problem, I wouldn't bother
> > testing or working around this at all. If people are running PostgreSQL
> > 8.4+ on Debian unstable June 2005 with kernel 2.4, they cannot be helped.
>
> While I would like to agree with that position, I can't help noticing
> lines 2438-2461 of xlog.c, which represent the still-smoking wreckage of
> our last attempt to do something with posix_fadvise. It's not that old
> either --- we gave up on it less than three years ago:
> http://archives.postgresql.org/pgsql-hackers/2006-06/msg01481.php
>
> I think at a minimum there should be a manual configuration switch
> (ie something in pg_config_manual.h) to allow the builder to disable
> use of posix_fadvise, even if configure thinks it's there. Depending
> on buildfarm results we may have to do more than that.
>
> BTW, I intend to un-disable the xlog change when committing the fadvise
> patch. In for a penny, in for a pound ...

I assumed if effective_io_concurrency < 2 that no posix_fadvise() calls
would be made.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Peter Eisentraut" <peter_e(at)gmx(dot)net>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Greg Stark" <greg(dot)stark(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: [SPAM] Re: posix_fadvise v22
Date:	2009-01-10 19:07:06
Message-ID:	603c8f070901101107m3321d2acu439cea602d6c7a1c@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I think at a minimum there should be a manual configuration switch
> (ie something in pg_config_manual.h) to allow the builder to disable
> use of posix_fadvise, even if configure thinks it's there. Depending
> on buildfarm results we may have to do more than that.

This seems pretty pointless, since there is already a GUC to control
the behavior, and the default is off. The only value here would be if
you could demonstrate that even with effective_io_concurrency set to 1
there is a significant performance dropoff. But that would probably
be an argument for rejecting the patch outright, not adding a
compile-time switch.

...Robert

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-11 19:14:50
Message-ID:	1098.1231701290@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
> OK, here's an update of Greg's patch with the runtime configure test
> ripped out, some minor documentation tweaks, and a few unnecessary
> whitespace diff hunks quashed. I think this is about ready for
> committer review.

I've started to look through this, and the only part I seriously don't
like is the nbtsearch.c changes. I've got three complaints about that:

* Doing it inside the index AMs is wrong, or at the very least forces
us to do it over for each index AM (which the patch fails to do).

* As coded, it generates prefetch bursts that are much too large and too
widely spaced to be effective, not to mention that they entirely
ignore the effective_io_concurrency control knob as well as the order
in which the pages will actually be needed. I wonder now whether
Robert's inability to see any benefit came because he was testing
indexscans and not bitmap scans.

* It's only accidental that it's not kicking in during a bitmap
indexscan and bollixing up the much-more-carefully-written
nodeBitmapHeapscan prefetch logic.

What I intend to do over the next day or so is commit the prefetch
infrastructure and the bitmap scan prefetch logic, but I'm bouncing the
indexscan part back for rework. I think that it should be implemented
in or near index_getnext() and pay attention to
effective_io_concurrency.

regards, tom lane

From:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-11 19:45:26
Message-ID:	603c8f070901111145n720e0abdwb95720d1651d63b0@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> * As coded, it generates prefetch bursts that are much too large and too
> widely spaced to be effective, not to mention that they entirely
> ignore the effective_io_concurrency control knob as well as the order
> in which the pages will actually be needed. I wonder now whether
> Robert's inability to see any benefit came because he was testing
> indexscans and not bitmap scans.

Nope, bitmap index scans. For some reason I was having trouble
generating a plan that included an index scan, so I took the easy way
out and tested what was in front of me. :-)

> What I intend to do over the next day or so is commit the prefetch
> infrastructure and the bitmap scan prefetch logic, but I'm bouncing the
> indexscan part back for rework. I think that it should be implemented
> in or near index_getnext() and pay attention to
> effective_io_concurrency.

FWIW, following your first set of commits, I extracted, cleaned up,
and re-posted the uncommitted portion of the patch last night. Based
on this it doesn't sound like there's much point in continuing to do
that, but I'm happy to do so if anyone thinks otherwise.

...Robert

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-11 20:04:17
Message-ID:	87y6xhmwxq.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Well experiments showed that it was very effective, even more so than for
bitmap scans. So much so that it nearly obliterated bitmap scans' advantage
over index scans.

I originally thought like you that all that logic was integral to the thing
but eventually came around to think the opposite. That logic is to overcome a
fundamental problem with bitmap scans -- that there's no logical group to
prefetch and a potentially unbounded stream of pages. Index scans just don't
have that problem so they don't need that extra logic.

> * It's only accidental that it's not kicking in during a bitmap
> indexscan and bollixing up the much-more-carefully-written
> nodeBitmapHeapscan prefetch logic.

ok.

So to do that I would have a few questions.

a) ISTM necessary to keep a dynahash of previously prefetched pointers around
to avoid repeatedly prefetching the same pages. That could get quite large
though. Or do you think it would be fine to visit the buffer cache,
essentially using that as the hash, for every index pointer?

b) How would index_getnext keep two read pointers like bitmap heap scans? I
think this would require adding a new index AM api similar to your
tuplestore api where the caller can maintain multiple read pointers in the
scan. And I'm not sure how that would work with mark/restore.

c) I would be afraid that pushing the index scan to reach for the next index
leaf page prematurely (and not just a async prefetch, but a blocking read)
would cause extra random i/o which would slow down the critical path of
reading the current index tuples. So I think we would still want to pause
when we hit the end of the current leaf page. That would require some form
of feedback in the index am api as well.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-11 20:41:02
Message-ID:	2083.1231706462@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
> FWIW, following your first set of commits, I extracted, cleaned up,
> and re-posted the uncommitted portion of the patch last night. Based
> on this it doesn't sound like there's much point in continuing to do
> that, but I'm happy to do so if anyone thinks otherwise.

Probably not; I'm busy hacking on the code anyway, fixing minor issues
like its failure to do anything sane with temp tables.

A question about that part of the code: I see that PrefetchBuffer simply
does nothing if it finds the target page already in a buffer. I wonder
whether that's sufficient, or whether we ought to do something to try
to ensure that the buffer doesn't get evicted in the near future (ie,
before we actually get to use it).

Simply advancing the usage_count seems too strong because when you
add in the upcoming ReadBuffer, we'd be advancing usage_count twice
for a single use of the page whenever it happened to get prefetched.
We could alleviate that by only advancing usage_count if it's 0, but
that seems like a band-aid not a real fix.

I also thought about setting a buffer flag bit showing that prefetch has
occurred, but then you have to worry about when/how to clear it if the
actual read never occurs.

Thoughts?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-12 05:19:42
Message-ID:	13517.1231737582@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I've applied most of this, with a couple of notable revisions:

1. The runtime check on whether posix_fadvise works is really gone.
We'll see whether we need to put anything back, but I think our first
try should be under the assumption that it works. (BTW, I was wrong
in my earlier claim that the buildfarm wouldn't exercise posix_fadvise
by default --- the patch defaults to io_concurrency = 1 if fadvise
is available, so it will try to prefetch one page ahead.)

2. I fixed it so that setting effective_io_concurrency to zero disables
prefetching altogether; there was no way to do that in the patch as
submitted.

3. I did not apply the changes in nbtsearch.c, since as I mentioned
earlier I wasn't convinced it's a reasonable approach. We can argue
about that some more, but even if 8.4 ships with concurrency only
for bitmap scans, it's still useful.

The two main loose ends that could stand further work are the
plain-index-scan case and the question of whether PrefetchBuffer
should try to protect an already-existing buffer from being evicted.

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-12 06:55:56
Message-ID:	87mydxm2rn.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> "Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
>> OK, here's an update of Greg's patch with the runtime configure test
>> ripped out, some minor documentation tweaks, and a few unnecessary
>> whitespace diff hunks quashed. I think this is about ready for
>> committer review.
>
> I've applied most of this, with a couple of notable revisions:
>
> 1. The runtime check on whether posix_fadvise works is really gone.
> We'll see whether we need to put anything back, but I think our first
> try should be under the assumption that it works. (BTW, I was wrong
> in my earlier claim that the buildfarm wouldn't exercise posix_fadvise
> by default --- the patch defaults to io_concurrency = 1 if fadvise
> is available, so it will try to prefetch one page ahead.)
>
> 2. I fixed it so that setting effective_io_concurrency to zero disables
> prefetching altogether; there was no way to do that in the patch as
> submitted.

Hm. the original intent was that effective_io_concurrency 1 meant no
prefetching because there was only one drive.

I wonder if that worked earlier and got lost along the way or if I always had
this wrong.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-12 13:02:24
Message-ID:	27519.1231765344@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> 2. I fixed it so that setting effective_io_concurrency to zero disables
>> prefetching altogether; there was no way to do that in the patch as
>> submitted.

> Hm. the original intent was that effective_io_concurrency 1 meant no
> prefetching because there was only one drive.

Well, "no prefetch" is an entirely different behavior from "prefetch one
block ahead". Given the way you've defined the GUC, a setting of one
has to mean the latter. My complaint was basically that with the patch
applied, the code was physically incapable of providing the former.
Which you'd surely want if only for testing/comparison purposes.

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: posix_fadvise v22
Date:	2009-01-28 12:56:38
Message-ID:	87eiynlhbt.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

The biggest question I have here is about doing it at the index_* abstraction
level. I've looked pretty hard at how to do this and run into some pretty
major problems.

I don't see any way to do it without changing the access method api. The
index_* methods cannot start fiddling with the current scan position without
messing up things like CURRENT OF and mark/restore irrecoverably.

But if we're going to change the index am api then we lose all of the
advantages of putting the logic in indexam.c in the first place. It won't
help any other index am without special code in each one

The best plan I came up with at this level is to add an am method

<am>getprefetchbitmap(IndexScanDesc scan,
ScanDirection direction,
TIDBitmap *bitmap,
int nblocks)

Which returns up to nblocks worth of bitmap starting from the current scan
position in the specified direction based on whatever's convenient to the
internal representation. I think it would be best if it stopped at the end
of the page at least if the next index page isn't in shared buffers.

Then nodeIndexscan.c would keep track of how the value of target_prefetch
just like nodeBitmapHeapScan, incrementing it whenever the caller continues
the scan and resetting it to zero if the direction changes.

However the getprefetchbitmap() call would only remove duplicates from the
upcoming blocks. It wouldn't know which blocks have already been prefetched
from previous calls. So nodeIndexscan would also have to remove duplicates
itself.

This splitting the work between three layers of abstraction is pretty messy
and creates a lot of duplicated work and doesn't seem to buy us anything. It
*still* wouldn't help any non-btree index types until they add the new method
-- and if they add the new method they might as well just add the USE_PREFETCH
code anyways. I don't see how the new method is useful for anything else.

I have another plan which would be a lot simpler but is a much more
brute-force approach. If we add a new scan pointer in addition to the current
position and the mark and add a slew of new methods like getnext_prefetch()
and reset_prefetch() to reset it to the current position. Also, add a hash
table internal to PrefetchBuffer and have it return a boolean indicating
whether it actually did a prefetch. Then index_prefetch would reset the
prefetch pointer and scan forward using it calling PrefetchBuffer on *every*
pointer counting how many trues returns it sees from PrefetchBuffer.*

*[Hm, not quite, we want to count recent prefetches that probably are still
queued separately from old ones that are probably in cache. And we also have
to think about how long to treat prefetches as probably still being in cache.
But with some additional thought I think this could be made to work.]

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!