Re: Commitfest patches

Lists: pgsql-hackers
From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Commitfest patches
Date: 2008-03-28 09:08:57
Message-ID: 87wsnnz046.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Bruce, you seem to have removed one of my three patches from the queue. I
would actually prefer you remove the other two and put back that one. It's the
one I most urgently need feedback on to continue.

The patch I'm so interested in receiving feedback on is the patch to preread
pages in bitmap index scans using posix_fadvise. This is basically complete
modulo documentation and autoconf tests.

However there is a larger patch struggling to get out here. The patch I
submitted just adds a new low level routine to preread blocks and doesn't
change the buffer manager routines at all. That means we have to make two
trips through the buffer manager to check if the block is present and then
subsequently to assign a buffer and read it in (from cache) synchronously.

A more invasive form of this patch would be to assign and pin a buffer when
the preread is done. That would men subsequently we would have a pinned buffer
ready to go and not need to go back to the buffer manager a second time. We
would instead just "complete" the i/o by issuing a normal read call.

The up-side would be fewer trips through the buffer manager and attendant
locking. The down-side would be stealing buffers from the shared buffers which
could be holding other pages.

I don't feel like writing the more invasive patch only to throw it out and
take the one I've already written, but I don't know what other people's
feelings are between the two. I'm leaning towards the more invasive buffer
manager changes myself.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commitfest patches
Date: 2008-03-28 09:44:42
Message-ID: 1206697482.4285.1518.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2008-03-28 at 09:08 +0000, Gregory Stark wrote:

> A more invasive form of this patch would be to assign and pin a buffer when
> the preread is done. That would men subsequently we would have a pinned buffer
> ready to go and not need to go back to the buffer manager a second time. We
> would instead just "complete" the i/o by issuing a normal read call.

So if posix_fadvise did nothing or there was a longer than optimal
delay, this would be a net loss.

You'd need reasonable evidence that the posix_fadvise facility was a win
on all platforms and recent release levels before we could agree with
that.

I think we need a more thorough examination of this area before we
commit anything. Maybe you've done this, but I haven't seen the
analysis. Can you say more, please? Or at least say what you don't know,
so other experts listening can fill in the blanks.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commitfest patches
Date: 2008-03-28 11:26:54
Message-ID: 87od8zytq9.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> On Fri, 2008-03-28 at 09:08 +0000, Gregory Stark wrote:
>
>> A more invasive form of this patch would be to assign and pin a buffer when
>> the preread is done. That would men subsequently we would have a pinned buffer
>> ready to go and not need to go back to the buffer manager a second time. We
>> would instead just "complete" the i/o by issuing a normal read call.
>
> So if posix_fadvise did nothing or there was a longer than optimal
> delay, this would be a net loss.
>
> You'd need reasonable evidence that the posix_fadvise facility was a win
> on all platforms and recent release levels before we could agree with
> that.

I posted a test program and asked for others to test it on various platforms
and various sized raid arrays. matthew(at)flymine(dot)org was the only person who
bothered to run it and he only found a 16x speedup on his hardware.

http://archives.postgresql.org/pgsql-performance/2008-01/msg00285.php

(unfortunately our mail archives seem to have failed to archive his message,
this is my followup to it)

> I think we need a more thorough examination of this area before we
> commit anything. Maybe you've done this, but I haven't seen the
> analysis. Can you say more, please? Or at least say what you don't know,
> so other experts listening can fill in the blanks.

For heaven's sake. I've been posting about this for months. Search the
archives for "RAID arrays and performance", "There's random access and then
there's random access", and "Bitmap index scan preread using posix_fadvise".

I described which interfaces worked on Linux and Solaris based on empirical
tests. I posted source code for synthetic benchmarks so we could test it on a
wide range of hardware. I posted graphs based on empirical results. I posted
mathematical formulas analysing just how much preread would be expected to
exercise a raid array fully. I'm not sure what else I can do to effect a more
thorough examination.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commitfest patches
Date: 2008-03-28 11:43:50
Message-ID: 1206704630.4285.1587.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2008-03-28 at 11:26 +0000, Gregory Stark wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> For heaven's sake. I've been posting about this for months.

Any chance of getting all that together on a single Wiki page, so we can
review everything? We'll need those docs after its committed anyway.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Prereading using posix_fadvise (was Re: Commitfest patches)
Date: 2008-03-28 13:37:29
Message-ID: 47ECF499.5050205@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> I described which interfaces worked on Linux and Solaris based on empirical
> tests. I posted source code for synthetic benchmarks so we could test it on a
> wide range of hardware. I posted graphs based on empirical results. I posted
> mathematical formulas analysing just how much preread would be expected to
> exercise a raid array fully. I'm not sure what else I can do to effect a more
> thorough examination.

I'm sure posix_fadvise is a win in the case where it's supposed to help:
a scan that does a lot of random reads, on RAID array. And you've posted
results demonstrating that. What we need to make sure is that there's no
significant loss when it's not helping.

It seems that the worst case for this patch is a scan on a table that
doesn't fit in shared_buffers, but is fully cached in the OS cache. In
that case, the posix_fadvise calls would be a certain waste of time.

That could be alleviated by deciding at plan time whether to preread or
not, based on effective_cache_size.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Zeugswetter Andreas OSB SD" <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Prereading using posix_fadvise (was Re: Commitfest patches)
Date: 2008-03-28 15:13:38
Message-ID: E1539E0ED7043848906A8FF995BDA57902EB1583@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Heikki wrote:
> It seems that the worst case for this patch is a scan on a table that
> doesn't fit in shared_buffers, but is fully cached in the OS cache. In

> that case, the posix_fadvise calls would be a certain waste of time.

I think this is a misunderstanding, the fadvise is not issued to read
the
whole table and is not issued for table scans at all (and if it were it
would
only advise for the next N pages).

So it has nothing to do with table size. The fadvise calls need to be
(and are)
limited by what can be used in the near future, and not for the whole
statement.

e.g. N next level index pages that are relevant, or N relevant heap
pages one
index leaf page points at. Maybe in the index case N does not need to be
limited,
since we have a natural limit on how many pointers fit on one page.

In general I think separate reader processes (or threads :-) that
directly read
into the bufferpool would be a more portable and efficient
implementation.
E.g. it could use ScatterGather IO. So I think we should look, that the
fadvise
solution is not obstruing that path, but I think it does not.

Gregory wrote:
>> A more invasive form of this patch would be to assign and pin a
buffer when
>> the preread is done. That would men subsequently we would have a
pinned buffer
>> ready to go and not need to go back to the buffer manager a second
time. We
>> would instead just "complete" the i/o by issuing a normal read call.

I guess you would rather need to mark the buffer for use for this page,
but let any backend that needs it first, pin it and issue the read.
I think the fadviser should not pin it in advance, since he cannot
guarantee to
actually read the page [soon]. Rather remember the buffer and later
check and pin
it for the read. Else you might be blocking the buffer.
But I think doing something like this might be good since it avoids
issuing duplicate
fadvises.

Andreas


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Zeugswetter Andreas OSB SD <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Prereading using posix_fadvise (was Re: Commitfest patches)
Date: 2008-03-28 15:22:25
Message-ID: 47ED0D31.7010206@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zeugswetter Andreas OSB SD wrote:
> Heikki wrote:
>> It seems that the worst case for this patch is a scan on a table that
>> doesn't fit in shared_buffers, but is fully cached in the OS cache. In
>
>> that case, the posix_fadvise calls would be a certain waste of time.
>
> I think this is a misunderstanding, the fadvise is not issued to read
> the
> whole table and is not issued for table scans at all (and if it were it
> would
> only advise for the next N pages).
>
> So it has nothing to do with table size. The fadvise calls need to be
> (and are)
> limited by what can be used in the near future, and not for the whole
> statement.

Right, I was sloppy. Instead of table size, what matters is the amount
of data the scan needs to access. The point remains that if the data is
already in OS cache, the posix_fadvise calls are a waste of time,
regardless of how many pages ahead you advise.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Zeugswetter Andreas OSB SD <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>, Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Prereading using posix_fadvise (was Re: Commitfest patches)
Date: 2008-03-28 15:41:58
Message-ID: 200803281541.m2SFfwk06208@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> > So it has nothing to do with table size. The fadvise calls need to be
> > (and are)
> > limited by what can be used in the near future, and not for the whole
> > statement.
>
> Right, I was sloppy. Instead of table size, what matters is the amount
> of data the scan needs to access. The point remains that if the data is
> already in OS cache, the posix_fadvise calls are a waste of time,
> regardless of how many pages ahead you advise.

I now understand what posix_fadvise() is allowing us to do.
posix_fadvise(POSIX_FADV_WILLNEED) allows us to tell the kernel we will
need a certain block in the future --- this seems much cheaper than a
background reader.

We know we will need the blocks, and telling the kernel can't hurt,
except that there is overhead in telling the kernel. Has anyone
measured how much overhead? I would be interested in a test program
that read the same page over and over again from the kernel, with and
without a posix_fadvise() call.

Should we consider only telling the kernel X pages ahead, meaning when
we are on page 10 we tell it about page 16?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commitfest patches
Date: 2008-03-28 15:45:51
Message-ID: 200803281545.m2SFjpL18390@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
>
> Bruce, you seem to have removed one of my three patches from the queue. I
> would actually prefer you remove the other two and put back that one. It's the
> one I most urgently need feedback on to continue.

I talked to Greg on IM. The complaint was that his posix_fadvise()
patch was added to the TODO list as a URL, rather than getting him
feedback. He is getting feedback now in another thread.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Zeugswetter Andreas OSB SD <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>, Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Prereading using posix_fadvise (was Re: Commitfest patches)
Date: 2008-03-28 15:58:24
Message-ID: 20080328155824.GD9150@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 28, 2008 at 11:41:58AM -0400, Bruce Momjian wrote:
> Should we consider only telling the kernel X pages ahead, meaning when
> we are on page 10 we tell it about page 16?

It's not so interesting for sequential reads, the kernel can work that
out for itself. Disk reads are usually in blocks of at least 128K
anyway, so there's a real good chance you have block 16 already.

The interesting case is index scan, where you so a posix_fadvise() on
the next block *before* returning the items in the current block. Then
by the time you've processed these tuples, the next block will
hopefully have been read in and we can proceed without delay.

Or fadvising all the tuples referred to from an index page at once so
the kernel can determine the optimal order to fetch them. The read()
will still be in order of the tuple, but the delay will (hopefully) be
less.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Zeugswetter Andreas OSB SD <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>, Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Prereading using posix_fadvise (was Re: Commitfest patches)
Date: 2008-03-28 15:59:35
Message-ID: 47ED15E7.7020003@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>>> So it has nothing to do with table size. The fadvise calls need to be
>>> (and are)
>>> limited by what can be used in the near future, and not for the whole
>>> statement.
>> Right, I was sloppy. Instead of table size, what matters is the amount
>> of data the scan needs to access. The point remains that if the data is
>> already in OS cache, the posix_fadvise calls are a waste of time,
>> regardless of how many pages ahead you advise.
>
> I now understand what posix_fadvise() is allowing us to do.
> posix_fadvise(POSIX_FADV_WILLNEED) allows us to tell the kernel we will
> need a certain block in the future --- this seems much cheaper than a
> background reader.

Yep.

> We know we will need the blocks, and telling the kernel can't hurt,
> except that there is overhead in telling the kernel. Has anyone
> measured how much overhead? I would be interested in a test program
> that read the same page over and over again from the kernel, with and
> without a posix_fadvise() call.

Agreed, that needs to be benchmarked next. There's also some overhead in
doing the buffer manager hash table lookup to check whether the page is
in shared_buffers. We could reduce that by the more complex approach
Greg mentioned of allocating a buffer in shared_buffers when we do
posix_fadvise.

> Should we consider only telling the kernel X pages ahead, meaning when
> we are on page 10 we tell it about page 16?

Yes. You don't want to fire off thousands of posix_fadvise calls
upfront. That'll just flood the kernel, and it will most likely ignore
any advise after the first few hundred or so. I'm not sure what the
appropriate amount of read ahead would be, though. Probably depends a
lot on the OS and hardware, and needs to be a adjustable.

In some cases we can't easily read ahead more than a certain number of
pages. For example, in a regular index scan, we can easily fire off
posix_advise calls for all the heap pages referenced by a single index
page, but reading ahead more than that becomes much more complex.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Zeugswetter Andreas OSB SD" <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Prereading using posix_fadvise
Date: 2008-03-28 16:00:18
Message-ID: 87zlsiyh2l.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:

> Right, I was sloppy. Instead of table size, what matters is the amount of data
> the scan needs to access. The point remains that if the data is already in OS
> cache, the posix_fadvise calls are a waste of time, regardless of how many
> pages ahead you advise.

I'm not sure that's really fair to the scan. The typical bitmap index scan
will probably be accessing far fewer pages than what fits in cache. And
prereading them will usually help unless you're always reading the same pages.
The usual case will be bitmap scans for different pages each time.

I'm picturing something like our mail archive search. Each user who uses it
retrieves just 10 hits or so. But there's no reason to think that because
those 10 hits fit in effective_cache_size that they'll actually be found
there.

It might make more sense to compare the table size. If the table size fits in
cache then any random bunch of pages is likely to be in cache somewhere.
Except we have no idea how much of the database this table represents. If the
user has a schema like ones we've seen posted to the lists many times with
hundreds of tables then deductions based on the size of a single table will be
questionable.

I'm also leery about scaling back the prereading for systems with large
effective_cache_size since those are precisely the systems which are likely to
have raid arrays and be helped the most by this on queries where it's helpful.

I feel like we're probably stuck optimizing for the case where i/o is
happening and hoping that the filesystem cache is efficient. If the entire
database fits in cache then the user has to adjust random_page_cost and should
probably set preread_pages to 0 (or effective_spindle_count to 1 depending on
which version of the patch you're reading) as well.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Zeugswetter Andreas OSB SD <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>, Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Prereading using posix_fadvise (was Re: Commitfest patches)
Date: 2008-03-28 16:25:56
Message-ID: 200803281625.m2SGPuA16626@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> > Should we consider only telling the kernel X pages ahead, meaning when
> > we are on page 10 we tell it about page 16?
>
> Yes. You don't want to fire off thousands of posix_fadvise calls
> upfront. That'll just flood the kernel, and it will most likely ignore
> any advise after the first few hundred or so. I'm not sure what the
> appropriate amount of read ahead would be, though. Probably depends a
> lot on the OS and hardware, and needs to be a adjustable.
>
> In some cases we can't easily read ahead more than a certain number of
> pages. For example, in a regular index scan, we can easily fire off
> posix_advise calls for all the heap pages referenced by a single index
> page, but reading ahead more than that becomes much more complex.

And if you read-ahead too far the pages might get pushed out of the
kernel cache before you ask to read them.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Zeugswetter Andreas OSB SD" <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Prereading using posix_fadvise
Date: 2008-03-28 17:24:15
Message-ID: 87ve36yd6o.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Someone wrote:
>>>
>>> Should we consider only telling the kernel X pages ahead, meaning when
>>> we are on page 10 we tell it about page 16?

The patch I posted specifically handles bitmap heap scans. It does in fact
prefetch only a limited number of pages from the bitmap stream based on a guc,
but it tries to be a bit clever about ramping up gradually.

The real danger here, imho, is doing read-ahead for blocks the client never
ends up reading. By ramping up the read-ahead gradually as the client reads
records we protect against that.

> Heikki Linnakangas wrote:
>>
>> Yes. You don't want to fire off thousands of posix_fadvise calls
>> upfront. That'll just flood the kernel, and it will most likely ignore
>> any advise after the first few hundred or so. I'm not sure what the
>> appropriate amount of read ahead would be, though. Probably depends a
>> lot on the OS and hardware, and needs to be a adjustable.

"Bruce Momjian" <bruce(at)momjian(dot)us> writes:
>
> And if you read-ahead too far the pages might get pushed out of the
> kernel cache before you ask to read them.

While these concerns aren't entirely baseless the actual experiments seem to
show the point of diminishing returns is pretty far out there. Look at the
graphs below, keeping in mind that the X axis is the number of blocks
prefetched.

http://archives.postgresql.org/pgsql-hackers/2007-12/msg00088.php

The pink case is analogous to a bitmap index scan where the blocks are read in
order. In that case the point of diminishing returns is reached around 64
pages. But performance doesn't actually dip until around 512 pages. And even
prefetching 8,192 blocks the negative impact on performance is still much less
severe than using a smaller-than-optimal prefetch size.

This is on a piddly little 3-way raid. On a larger raid you would want even
larger prefetch sizes.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc: "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commitfest patches
Date: 2008-03-28 17:34:30
Message-ID: 87r6duycpl.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Bruce Momjian" <bruce(at)momjian(dot)us> writes:

> Gregory Stark wrote:
>>
>> Bruce, you seem to have removed one of my three patches from the queue. I
>> would actually prefer you remove the other two and put back that one. It's the
>> one I most urgently need feedback on to continue.
>
> I talked to Greg on IM. The complaint was that his posix_fadvise()
> patch was added to the TODO list as a URL, rather than getting him
> feedback. He is getting feedback now in another thread.

Sort of. The feedback I'm getting is from people who want to discuss the
"easy" parameters of the patch like how much prefetching to do and when
without actually reading to see what's already been done. I'm happy to discuss
them as I find these things interesting too.

But what I really need is someone to read the patch and say "looks good" or
point out things they don't like. In particular, what I really, really want is
some guidance on the singular key question I asked.

I want to know if we're interested in the more invasive patch restructuring
the buffer manager. My feeling is that we probably are eventually. But I
wonder if people wouldn't feel more comfortable taking baby steps at first
which will have less impact in cases where it's not being heavily used.

It's a lot more work to do the invasive patch and there's not much point in
doing it if we're not interested in taking it when it's ready.

Here's an updated patch. It's mainly just a CVS update but it also includes
some window dressing: an autoconf test, some documentation, and some fancy
arithmetic to auto-tune the amount of prefetching based on a more real-world
parameter "effective_spindle_count". It also includes printing out how much
the prefetching target got ramped up to in the explain output -- though I'm
not sure we really want that in the tree, but it's nice for performance
testing.

Attachment Content-Type Size
bitmap-preread-v9.diff.gz application/octet-stream 9.1 KB

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commitfest patches
Date: 2008-03-28 18:15:46
Message-ID: 47ED35D2.4000201@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> I want to know if we're interested in the more invasive patch restructuring
> the buffer manager. My feeling is that we probably are eventually. But I
> wonder if people wouldn't feel more comfortable taking baby steps at first
> which will have less impact in cases where it's not being heavily used.
>
> It's a lot more work to do the invasive patch and there's not much point in
> doing it if we're not interested in taking it when it's ready.

I like the simplicity of this patch. My biggest worry is the impact on
performance when the posix_fadvise calls are not helping. I'd like to
see some testing of that. How expensive are the extra bufmgr hash table
lookups, compared to all the other stuff that's involved in a bitmap
heap scan?

Another thing I wonder is whether this approach can easily be extended
to other types of scans than bitmap heap scans. Normal index scans seem
most interesting; they can generate a lot of random I/O. I don't see any
big problems there, except again the worst-case performance: If you
issue AdviseBuffer calls for all the heap tuples in an index page, in
the naive way, you can issue hundreds of posix_fadvise calls. But what
if they're all actually on the same page? Surely you only want to go
through the buffer manager once (like we do in the scan, thanks to
ReleaseAndReadBuffer()) in that case, and only call posix_fadvise once.
But again, maybe you can convince me that it's a non-issue by some
benchmarking.

> Here's an updated patch. It's mainly just a CVS update but it also includes
> some window dressing: an autoconf test, some documentation, and some fancy
> arithmetic to auto-tune the amount of prefetching based on a more real-world
> parameter "effective_spindle_count".

I like the "effective_spindle_count". That's very intuitive.

The formula to turn that into preread_pages looks complex, though. I can
see the reasoning behind it, but I doubt thing behave that beautifully
in the real world. (That's not important right now, we have plenty of
time to tweak that after more testing.)

> It also includes printing out how much
> the prefetching target got ramped up to in the explain output -- though I'm
> not sure we really want that in the tree, but it's nice for performance
> testing.

I don't understand the ramp up logic. Can you explain what that's for
and how it works?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commitfest patches
Date: 2008-03-28 18:48:47
Message-ID: 20080328184847.GE9150@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 28, 2008 at 05:34:30PM +0000, Gregory Stark wrote:
> But what I really need is someone to read the patch and say "looks good" or
> point out things they don't like. In particular, what I really, really want is
> some guidance on the singular key question I asked.

I was going to write all sorts of stuff, till I noticed Heikki said
basically everything I was going to say:

- I think normal index scans could benefit from this (it was measurable
when I was playing with AIO in postgres a few years back).
- The integration with the bitmap scan is good, neat even
- I think the number of preread_count is far too small, given you get a
benefit even if you only have one spindle.
- I don't understand the ramp-up method either.

People spend a lot of time worrying about hundreds of posix_fadvise()
calls but you don't need anywhere near that much to be effective. With
AIO I limited the number of outstanding requests to a dozen and it was
still useful. You lose nothing by capping the number of requests at any
point.

> I want to know if we're interested in the more invasive patch restructuring
> the buffer manager. My feeling is that we probably are eventually. But I
> wonder if people wouldn't feel more comfortable taking baby steps at first
> which will have less impact in cases where it's not being heavily used.

I think the way it is now is neat and simple and enough for now.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commitfest patches
Date: 2008-03-30 05:02:53
Message-ID: Pine.GSO.4.64.0803281350110.2521@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 28 Mar 2008, Gregory Stark wrote:

> I described which interfaces worked on Linux and Solaris based on empirical
> tests. I posted source code for synthetic benchmarks so we could test it on a
> wide range of hardware. I posted graphs based on empirical results.

Is it possible to post whatever script that generates the graph (gnuplot?)
so people can compare the results they get to yours? It's when I realized
I didn't have that and would have to recreate one myself that my intention
to just run some quick results and ship them to you lost momentum and I
never circled back to it. It would be nice if you made it easier for
people to generate fancy results here as immediate gratification for them
(while of course just asking them to send the raw data).

I can run some tests on smaller Linux/Solaris systems to see if they don't
show a regression, that was my main concern about this experiment. Some
of the discussion that followed your original request for tests was kind
of confusing as far as how to interpret the results as well; I think I
know what to look for but certainly wouldn't mind some more guidance
there, too.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Cc: "Bruce Momjian" <bruce(at)momjian(dot)us>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commitfest patches
Date: 2008-04-05 02:21:05
Message-ID: 8763uxdotq.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Martijn van Oosterhout" <kleptog(at)svana(dot)org> writes:

> - I think normal index scans could benefit from this (it was measurable
> when I was playing with AIO in postgres a few years back).

I don't want to torture any existing code paths to get prefetching to work.
Heikki suggested I take advantage of the page-at-a-time index scanning though
which does sound like it could be convenient.

> - I think the number of preread_count is far too small, given you get a
> benefit even if you only have one spindle.
> - I don't understand the ramp-up method either.

So the idea is that I was deathly afraid of being accused of doing unnecessary
additional I/O. So I was worried about the user doing something like SELECT
... LIMIT 1. Or worse, just starting a select and discarding it after only
selecting a handful of records.

I also didn't want to start the bitmap scan by doing hundreds of syscalls
before we even return the first record.

So I figured it's safer to read only the first block at first. Only if the
user goes on to the next record do we try prefetching the next block. Each
record the user reads we bump up the prefetch amount by 1 until we hit the
goal value for the size raid array we're using.

That also nicely spreads out the syscalls so we get one prefetch between each
record returned.

We also don't know how densely packed the records are on the pages. If they're
densely packed then we'll have prefetched a whole bunch of records in time for
the second page. But if there's only one record on a page and we're already on
the second page I figured that indicated we would be reading many pages with
sparsely distributed records. So we ramp up the prefetch exponentially by
multiplying by two each time we move to the next page.

The net result is that if you do a bitmap scan which is processing lots of
pointers it'll quickly reach the point where it's prefetching the number of
blocks based on effective_spindle_count. If you're just processing the first
few tuples it'll only read a small number of extra pages. And if you're only
processing one record it won't prefetch at all.

> People spend a lot of time worrying about hundreds of posix_fadvise()
> calls but you don't need anywhere near that much to be effective. With
> AIO I limited the number of outstanding requests to a dozen and it was
> still useful. You lose nothing by capping the number of requests at any
> point.

Well you leave money on the table. But yes, I'm trying to be conservative
about how much to prefetch when we don't know that it's in our favour.

>> I want to know if we're interested in the more invasive patch restructuring
>> the buffer manager. My feeling is that we probably are eventually. But I
>> wonder if people wouldn't feel more comfortable taking baby steps at first
>> which will have less impact in cases where it's not being heavily used.
>
> I think the way it is now is neat and simple and enough for now.

Thanks.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Commitfest patches
Date: 2008-04-05 02:25:57
Message-ID: 871w5ldolm.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Greg Smith" <gsmith(at)gregsmith(dot)com> writes:

> On Fri, 28 Mar 2008, Gregory Stark wrote:
>
>> I described which interfaces worked on Linux and Solaris based on empirical
>> tests. I posted source code for synthetic benchmarks so we could test it on a
>> wide range of hardware. I posted graphs based on empirical results.
>
> Is it possible to post whatever script that generates the graph (gnuplot?) so
> people can compare the results they get to yours?

Unfortunately I couldn't figure out how to get gnuplot to do this. I ended up
doing it in gnumeric. I'll look into fixing it the script up to be more
automatic. Maybe look at gnuplot again.

> I can run some tests on smaller Linux/Solaris systems to see if they don't show
> a regression, that was my main concern about this experiment.

Heikki suggested some things to test for regressions. I'll look at that.

What I'm more curious about and hoped I could get data from others is whether
posix_fadvise(WILL_NEED) does anything useful on various versions of FreeBSD,
OSX, etc. Afaict the syscall doesn't exist at all on Solaris though, which
surprises me :(

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!