Re: Seq scans roadmap

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>
Subject: Seq scans roadmap
Date: 2007-05-08 10:40:19
Message-ID: 46405393.507@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here's my roadmap for the "scan-resistant buffer cache" and
"synchronized scans" patches.

1. Fix the current vacuum behavior of throwing dirty buffers to the
freelist, forcing a lot of WAL flushes. Instead, use a backend-private
ring of shared buffers that are recycled. This is what Simon's
"scan-resistant buffer manager" did.

The theory here is that if a page is read in by vacuum, it's unlikely to
be accessed in the near future, therefore it should be recycled. If
vacuum doesn't dirty the page, it's best to reuse the buffer immediately
for the next page. However, if the buffer is dirty (and not just because
we set hint bits), we ought to delay writing it to disk until the
corresponding WAL record has been flushed to disk.

Simon's patch used a fixed size ring of buffers that are recycled, but I
think the ring should be dynamically sized. Start with a small ring, and
whenever you need to do a WAL flush to write a dirty buffer, increase
the ring size. On every full iteration through the ring, decrease its
size to trim down an unnecessarily large ring.

This only alters the behavior of vacuums, and it's pretty safe to say it
won't get worse than what we have now. In the future, we can use the
buffer ring for seqscans as well; more on that on step 3.

2. Implement the list/table of last/ongoing seq scan positions. This is
Jeff's "synchronized scans" patch. When a seq scan starts on a table
larger than some threshold, it starts from where the previous seq scan
is currently, or where it ended. This will synchronize the scans so that
for two concurrent scans the total I/O is halved in the best case. There
should be no other effect on performance.

If you have a partitioned table, or union of multiple tables or any
other plan where multiple seq scans are performed in arbitrary order,
this change won't change the order the partitions are scanned and won't
therefore ensure they will be synchronized.

Now that we have both pieces of the puzzle in place, it's time to
consider what more we can do with them:

3A. To take advantage of the "cache trail" of a previous seq scan, scan
backwards from where the previous seq scan ended, until a you hit a
buffer that's not in cache.

This will allow taking advantage of the buffer cache even if the table
doesn't fit completely in RAM. That can make a big difference if the
table size is just slightly bigger than RAM, and can avoid the nasty
surprise when a table grows beyond RAM size and queries start taking
minutes instead of seconds.

This should be a non-controversial change on its own from performance
point of view. No query should get slower, and some will become faster.
But see step 3B:

3B. Currently, sequential scans on a large table spoils the buffer cache
by evicting other pages from the cache. In CVS HEAD, as soon as the
table is larger than shared_buffers, the pages in the buffer won't be
used to speed up running the same query again, and there's no reason to
believe the pages read in would be more useful than any other page in
the database, and in particular the pages that were in the buffer cache
before the huge seq scan. If the table being scanned is > 5 *
shared_buffers, the scan will evict every other page from the cache if
there's no other activity in the database (max usage_count is 5).

If the table is much larger than shared_buffers, say 10 times as large,
even with the change 3B to read the pages that are in cache first, using
all shared_buffers to cache the table will only speed up the query by
10%. We should not spoil the cache for such a small gain, and use the
local buffer ring strategy instead. It's better to make queries that are
slow anyway a little bit slower, than making queries that are normally
really fast, slow.

As you may notice, 3A and 3B are at odds with each other. We can
implement both, but you can't use both strategies in the same scan.
Therefore we need to have decision logic of some kind to figure out
which strategy is optimal.

A simple heuristic is to decide based on the table size:

< 0.1*shared_buffers -> start from page 0, keep in cache (like we do now)
< 5 * shared_buffers -> start from last read page, keep in cache
> 5 * shared_buffers -> start from last read page, use buffer ring

I'm not sure about the constants, we might need to make them GUC
variables as Simon argued, but that would be the general approach.

In the future, I'm envisioning a smarter algorithm to size the local
buffer ring. Take all buffers with usage_count=0, plus a few with
usage_count=1 from the clock sweep. That way if there's a lot of buffers
in the buffer cache that are seldomly used, we'll use more buffers to
cache the large scan, and vice versa. And no matter how large the scan,
it wouldn't blow all buffers from the cache. But if you execute the same
query again, the buffers left in the cache from the last scan were
apparently useful, so we use a bigger ring this time.

I'm going to do this incrementally, and we'll see how far we get for
8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
Simon's patch (step 1), run some performance tests with vacuum, and
submit a patch for that. Then I'll move to Jeff's patch (step 2).

Thoughts? Everyone happy with the roadmap?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Cc: "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-08 11:47:18
Message-ID: C3E62232E3BCF24CBA20D72BFDCB6BF803EB6A3F@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki,

On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
implement dynamic I/O caching. The experiments have shown that benefit
of re-using PG buffer cache on large sequential scans is vanishingly
small when the buffer cache size is small compared to the system memory.
Since this is a normal and recommended situation (OS I/O cache is
auto-tuning and easy to administer, etc), IMO the effort to optimize
buffer cache reuse for seq scans > 1 x buffer cache size is not
worthwhile.

On 3B: The scenario described is "multiple readers seq scanning large
table and sharing bufcache", but in practice this is not a common
situation. The common situation is "multiple queries joining several
small tables to one or more large tables that are >> 1 x bufcache". In
the common scenario, the dominant factor is the ability to keep the
small tables in bufcache (or I/O cache for that matter) while running
the I/O bound large table scans as fast as possible.

To that point - an important factor in achieving max I/O rate for large
tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
This is commonly in the range of 512KB -> 2MB, which is only important
when considering a bound on the size of the ring buffer. The effect has
been demonstrated to be significant - in the 20%+ range. Another thing
to consider is the use of readahead inside the heapscan, in which case
sizes >= 32KB are very effective.

We've implemented both ideas (ring buffer, readahead) and see very
significant improvements in I/O and query speeds on DSS workloads. I
would expect benefits on OLTP as well.

The modifications you suggest here may not have the following
properties:
- don't pollute bufcache for seqscan of tables > 1 x bufcache
- for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
minimize L2 cache pollution

- Luke

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of
> Heikki Linnakangas
> Sent: Tuesday, May 08, 2007 3:40 AM
> To: PostgreSQL-development
> Cc: Jeff Davis; Simon Riggs
> Subject: [HACKERS] Seq scans roadmap
>
> Here's my roadmap for the "scan-resistant buffer cache" and
> "synchronized scans" patches.
>
> 1. Fix the current vacuum behavior of throwing dirty buffers
> to the freelist, forcing a lot of WAL flushes. Instead, use a
> backend-private ring of shared buffers that are recycled.
> This is what Simon's "scan-resistant buffer manager" did.
>
> The theory here is that if a page is read in by vacuum, it's
> unlikely to be accessed in the near future, therefore it
> should be recycled. If vacuum doesn't dirty the page, it's
> best to reuse the buffer immediately for the next page.
> However, if the buffer is dirty (and not just because we set
> hint bits), we ought to delay writing it to disk until the
> corresponding WAL record has been flushed to disk.
>
> Simon's patch used a fixed size ring of buffers that are
> recycled, but I think the ring should be dynamically sized.
> Start with a small ring, and whenever you need to do a WAL
> flush to write a dirty buffer, increase the ring size. On
> every full iteration through the ring, decrease its size to
> trim down an unnecessarily large ring.
>
> This only alters the behavior of vacuums, and it's pretty
> safe to say it won't get worse than what we have now. In the
> future, we can use the buffer ring for seqscans as well; more
> on that on step 3.
>
> 2. Implement the list/table of last/ongoing seq scan
> positions. This is Jeff's "synchronized scans" patch. When a
> seq scan starts on a table larger than some threshold, it
> starts from where the previous seq scan is currently, or
> where it ended. This will synchronize the scans so that for
> two concurrent scans the total I/O is halved in the best
> case. There should be no other effect on performance.
>
> If you have a partitioned table, or union of multiple tables
> or any other plan where multiple seq scans are performed in
> arbitrary order, this change won't change the order the
> partitions are scanned and won't therefore ensure they will
> be synchronized.
>
>
> Now that we have both pieces of the puzzle in place, it's
> time to consider what more we can do with them:
>
>
> 3A. To take advantage of the "cache trail" of a previous seq
> scan, scan
> backwards from where the previous seq scan ended, until a you hit a
> buffer that's not in cache.
>
> This will allow taking advantage of the buffer cache even if
> the table
> doesn't fit completely in RAM. That can make a big difference if the
> table size is just slightly bigger than RAM, and can avoid the nasty
> surprise when a table grows beyond RAM size and queries start taking
> minutes instead of seconds.
>
> This should be a non-controversial change on its own from performance
> point of view. No query should get slower, and some will
> become faster.
> But see step 3B:
>
> 3B. Currently, sequential scans on a large table spoils the
> buffer cache
> by evicting other pages from the cache. In CVS HEAD, as soon as the
> table is larger than shared_buffers, the pages in the buffer won't be
> used to speed up running the same query again, and there's no
> reason to
> believe the pages read in would be more useful than any other page in
> the database, and in particular the pages that were in the
> buffer cache
> before the huge seq scan. If the table being scanned is > 5 *
> shared_buffers, the scan will evict every other page from the
> cache if
> there's no other activity in the database (max usage_count is 5).
>
> If the table is much larger than shared_buffers, say 10 times
> as large,
> even with the change 3B to read the pages that are in cache
> first, using
> all shared_buffers to cache the table will only speed up the query by
> 10%. We should not spoil the cache for such a small gain, and use the
> local buffer ring strategy instead. It's better to make
> queries that are
> slow anyway a little bit slower, than making queries that are
> normally
> really fast, slow.
>
>
> As you may notice, 3A and 3B are at odds with each other. We can
> implement both, but you can't use both strategies in the same scan.
> Therefore we need to have decision logic of some kind to figure out
> which strategy is optimal.
>
> A simple heuristic is to decide based on the table size:
>
> < 0.1*shared_buffers -> start from page 0, keep in cache
> (like we do now)
> < 5 * shared_buffers -> start from last read page, keep in cache
> > 5 * shared_buffers -> start from last read page, use buffer ring
>
> I'm not sure about the constants, we might need to make them GUC
> variables as Simon argued, but that would be the general approach.
>
> In the future, I'm envisioning a smarter algorithm to size the local
> buffer ring. Take all buffers with usage_count=0, plus a few with
> usage_count=1 from the clock sweep. That way if there's a lot
> of buffers
> in the buffer cache that are seldomly used, we'll use more buffers to
> cache the large scan, and vice versa. And no matter how large
> the scan,
> it wouldn't blow all buffers from the cache. But if you
> execute the same
> query again, the buffers left in the cache from the last scan were
> apparently useful, so we use a bigger ring this time.
>
> I'm going to do this incrementally, and we'll see how far we get for
> 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
> Simon's patch (step 1), run some performance tests with vacuum, and
> submit a patch for that. Then I'll move to Jeff's patch (step 2).
>
> Thoughts? Everyone happy with the roadmap?
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>
>


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Luke Lonergan <LLonergan(at)greenplum(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-08 12:28:39
Message-ID: 46406CF7.9020607@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Luke Lonergan wrote:
> On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
> implement dynamic I/O caching. The experiments have shown that benefit
> of re-using PG buffer cache on large sequential scans is vanishingly
> small when the buffer cache size is small compared to the system memory.
> Since this is a normal and recommended situation (OS I/O cache is
> auto-tuning and easy to administer, etc), IMO the effort to optimize
> buffer cache reuse for seq scans > 1 x buffer cache size is not
> worthwhile.

That's interesting. Care to share the results of the experiments you
ran? I was thinking of running tests of my own with varying table sizes.

The main motivation here is to avoid the sudden drop in performance when
a table grows big enough not to fit in RAM. See attached diagram for
what I mean. Maybe you're right and the effect isn't that bad in practice.

I'm thinking of attacking 3B first anyway, because it seems much simpler
to implement.

> On 3B: The scenario described is "multiple readers seq scanning large
> table and sharing bufcache", but in practice this is not a common
> situation. The common situation is "multiple queries joining several
> small tables to one or more large tables that are >> 1 x bufcache". In
> the common scenario, the dominant factor is the ability to keep the
> small tables in bufcache (or I/O cache for that matter) while running
> the I/O bound large table scans as fast as possible.

How is that different from what I described?

> To that point - an important factor in achieving max I/O rate for large
> tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
> This is commonly in the range of 512KB -> 2MB, which is only important
> when considering a bound on the size of the ring buffer. The effect has
> been demonstrated to be significant - in the 20%+ range. Another thing
> to consider is the use of readahead inside the heapscan, in which case
> sizes >= 32KB are very effective.

Yeah I remember the discussion on the L2 cache a while back.

What do you mean with using readahead inside the heapscan? Starting an
async read request?

> The modifications you suggest here may not have the following
> properties:
> - don't pollute bufcache for seqscan of tables > 1 x bufcache
> - for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
> minimize L2 cache pollution

So the difference is that you don't want 3A (the take advantage of pages
already in buffer cache) strategy at all, and want the buffer ring
strategy to kick in earlier instead. Am I reading you correctly?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
image/png 7.4 KB

From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-08 12:44:11
Message-ID: C3E62232E3BCF24CBA20D72BFDCB6BF803EB6A65@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki,

> That's interesting. Care to share the results of the
> experiments you ran? I was thinking of running tests of my
> own with varying table sizes.

Yah - it may take a while - you might get there faster.

There are some interesting effects to look at between I/O cache
performance and PG bufcache, and at those speeds the only tool I've
found that actually measures scan rate in PG is VACUUM. "SELECT
COUNT(*)" measures CPU consumption in the aggregation node, not scan
rate.

Note that the copy from I/O cache to PG bufcache is where the L2 effect
is seen.

> The main motivation here is to avoid the sudden drop in
> performance when a table grows big enough not to fit in RAM.
> See attached diagram for what I mean. Maybe you're right and
> the effect isn't that bad in practice.

There are going to be two performance drops, first when the table
doesn't fit into PG bufcache, the second when it doesn't fit in bufcache
+ I/O cache. The second is severe, the first is almost insignificant
(for common queries).

> How is that different from what I described?

My impression of your descriptions is that they overvalue the case where
there are multiple scanners of a large (> 1x bufcache) table such that
they can share the "first load" of the bufcache, e.g. your 10% benefit
for table = 10x bufcache argument. I think this is a non-common
workload, rather there are normally many small tables and several large
tables such that sharing the PG bufcache is irrelevant to the query
speed.

> Yeah I remember the discussion on the L2 cache a while back.
>
> What do you mean with using readahead inside the heapscan?
> Starting an async read request?

Nope - just reading N buffers ahead for seqscans. Subsequent calls use
previously read pages. The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).

> > The modifications you suggest here may not have the following
> > properties:
> > - don't pollute bufcache for seqscan of tables > 1 x bufcache
> > - for tables > 1 x bufcache use a ringbuffer for I/O that
> is ~ 32KB to
> > minimize L2 cache pollution
>
> So the difference is that you don't want 3A (the take
> advantage of pages already in buffer cache) strategy at all,
> and want the buffer ring strategy to kick in earlier instead.
> Am I reading you correctly?

Yes, I think the ring buffer strategy should be used when the table size
is > 1 x bufcache and the ring buffer should be of a fixed size smaller
than L2 cache (32KB - 128KB seems to work well).

- Luke


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Luke Lonergan <LLonergan(at)greenplum(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-08 12:57:27
Message-ID: 464073B7.4070304@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Luke Lonergan wrote:
>> What do you mean with using readahead inside the heapscan?
>> Starting an async read request?
>
> Nope - just reading N buffers ahead for seqscans. Subsequent calls use
> previously read pages. The objective is to issue contiguous reads to
> the OS in sizes greater than the PG page size (which is much smaller
> than what is needed for fast sequential I/O).

Are you filling multiple buffers in the buffer cache with a single
read-call? The OS should be doing readahead for us anyway, so I don't
see how just issuing multiple ReadBuffers one after each other helps.

> Yes, I think the ring buffer strategy should be used when the table size
> is > 1 x bufcache and the ring buffer should be of a fixed size smaller
> than L2 cache (32KB - 128KB seems to work well).

I think we want to let the ring grow larger than that for updating
transactions and vacuums, though, to avoid the WAL flush problem.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-08 15:09:04
Message-ID: E1539E0ED7043848906A8FF995BDA57901FD9725@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Nope - just reading N buffers ahead for seqscans. Subsequent
> calls use previously read pages. The objective is to issue
> contiguous reads to the OS in sizes greater than the PG page
> size (which is much smaller than what is needed for fast
> sequential I/O).

Problem here is that eighter you issue the large read into throwaway
private memory and hope that when you later read 8k you get the page
from OS buffercache, or you need ScatterGather IO and a way to grab 32
buffers at once.

> Yes, I think the ring buffer strategy should be used when the
> table size is > 1 x bufcache and the ring buffer should be of
> a fixed size smaller than L2 cache (32KB - 128KB seems to work well).

How do you merge those two objectives? It seems the ring buffer needs to
be at least as large as the contiguous read size.
Thus you would need at least 256k ring buffer. Better yet have twice the
IO size as ring buffer size, so two sessions can alternately take the
lead while the other session still blocks a prev page. Modern L2 cache
is 8 Mb, so 512k seems no problem ?

Andreas


From: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-08 15:13:37
Message-ID: E1539E0ED7043848906A8FF995BDA57901FD9726@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> >> What do you mean with using readahead inside the heapscan?
> >> Starting an async read request?
> >
> > Nope - just reading N buffers ahead for seqscans. Subsequent calls
> > use previously read pages. The objective is to issue
> contiguous reads
> > to the OS in sizes greater than the PG page size (which is much
> > smaller than what is needed for fast sequential I/O).
>
> Are you filling multiple buffers in the buffer cache with a
> single read-call?

yes, needs vector or ScatterGather IO.

> The OS should be doing readahead for us
> anyway, so I don't see how just issuing multiple ReadBuffers
> one after each other helps.

Last time I looked OS readahead was only comparable to 32k blocked
reads.
256k blocked reads still perform way better. Also when the OS is
confronted
with an IO storm the 256k reads perform way better than OS readahead.

Andreas


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-08 16:11:57
Message-ID: 87bqgvz44y.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at> writes:

>> Are you filling multiple buffers in the buffer cache with a
>> single read-call?
>
> yes, needs vector or ScatterGather IO.

I would expect that to get only moderate improvement. To get the full benefit
I would think you would want to either fire off a separate thread to do the
read-ahead, use libaio, or funnel the read-ahead requests to a separate thread
like our bgwriter only it would be a bgreader or something like that.

>> The OS should be doing readahead for us
>> anyway, so I don't see how just issuing multiple ReadBuffers
>> one after each other helps.
>
> Last time I looked OS readahead was only comparable to 32k blocked reads.
> 256k blocked reads still perform way better. Also when the OS is confronted
> with an IO storm the 256k reads perform way better than OS readahead.

Well that's going to depend on the OS. Last I checked Linux's readahead logic
is pretty straightforward and doesn't try to do any better than 32k readahead
and is easily fooled. However I wouldn't be surprised if that's changed.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-08 20:56:36
Message-ID: 1178657796.24902.49.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote:
> I'm going to do this incrementally, and we'll see how far we get for
> 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
> Simon's patch (step 1), run some performance tests with vacuum, and
> submit a patch for that. Then I'll move to Jeff's patch (step 2).

I think it's best to postpone 3A (one aspect of my patch). There are a
couple reasons:

1) The results didn't show enough benefit yet.
2) Complex interactions with Simon's patch.

I think it's an area of research for the future, but for now I just want
to concentrate on the most important aspect of my patch: the
synchronized scanning ( #2 in your list ).

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Luke Lonergan <LLonergan(at)greenplum(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-08 21:27:25
Message-ID: 1178659645.24902.62.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-05-08 at 07:47 -0400, Luke Lonergan wrote:
> Heikki,
>
> On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
> implement dynamic I/O caching. The experiments have shown that benefit
> of re-using PG buffer cache on large sequential scans is vanishingly
> small when the buffer cache size is small compared to the system memory.
> Since this is a normal and recommended situation (OS I/O cache is
> auto-tuning and easy to administer, etc), IMO the effort to optimize
> buffer cache reuse for seq scans > 1 x buffer cache size is not
> worthwhile.
>

I think 3A is still an interesting idea, but I agree that it needs some
results to back it up. Let's save 3A for the next release so that we
have more time to see.

> To that point - an important factor in achieving max I/O rate for large
> tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
> This is commonly in the range of 512KB -> 2MB, which is only important
> when considering a bound on the size of the ring buffer. The effect has
> been demonstrated to be significant - in the 20%+ range. Another thing
> to consider is the use of readahead inside the heapscan, in which case
> sizes >= 32KB are very effective.

One thing I'd like us to keep in mind is to have a reasonable number of
buffers active for a sequential scan. If the number is too small, my
sync scans might not even work. Right now my patch only reports every 16
pages, so 32KB (=4 pages) is not going to work for sync scans (I suppose
only testing will tell).

Regards,
Jeff Davis


From: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-09 08:48:00
Message-ID: E1539E0ED7043848906A8FF995BDA57901FD9793@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> >> Are you filling multiple buffers in the buffer cache with a single
> >> read-call?
> >
> > yes, needs vector or ScatterGather IO.
>
> I would expect that to get only moderate improvement.

The vast improvement comes from 256k blocksize.

> To get
> the full benefit I would think you would want to either fire
> off a separate thread to do the read-ahead, use libaio, or
> funnel the read-ahead requests to a separate thread like our
> bgwriter only it would be a bgreader or something like that.

I like bgreader :-) But that looks even more difficult than grabbing 32
[scattered or contiguous] buffers at once.
Especially in a situation where there is no concurrent load it would be
nice to do CPU work while waiting for the next read ahead IO. If there
is enough parallel CPU load it is actually not so important. So I opt,
that on a high load server you get nearly all benefit without any sort
of aio.

> >> The OS should be doing readahead for us anyway, so I don't see how
> >> just issuing multiple ReadBuffers one after each other helps.
> >
> > Last time I looked OS readahead was only comparable to 32k blocked
reads.
> > 256k blocked reads still perform way better. Also when the OS is
> > confronted with an IO storm the 256k reads perform way better than
OS readahead.
>
> Well that's going to depend on the OS. Last I checked Linux's
> readahead logic is pretty straightforward and doesn't try to
> do any better than 32k readahead and is easily fooled.
> However I wouldn't be surprised if that's changed.

My test was on AIX, 32 or 64k seem quite common, at least as default
setting.
Also on some OS's (like HPUX) OS readahead and writebehind strategy
changes with large IO blocksizes, imho beneficially.

Andreas


From: "Simon Riggs" <simon(at)enterprisedb(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-09 09:27:59
Message-ID: 1178702879.10861.29.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote:
> Here's my roadmap for the "scan-resistant buffer cache" and
> "synchronized scans" patches.
>
> 1. Fix the current vacuum behavior of throwing dirty buffers to the
> freelist, forcing a lot of WAL flushes. Instead, use a backend-private
> ring of shared buffers that are recycled. This is what Simon's
> "scan-resistant buffer manager" did.
>
> The theory here is that if a page is read in by vacuum, it's unlikely to
> be accessed in the near future, therefore it should be recycled. If
> vacuum doesn't dirty the page, it's best to reuse the buffer immediately
> for the next page. However, if the buffer is dirty (and not just because
> we set hint bits), we ought to delay writing it to disk until the
> corresponding WAL record has been flushed to disk.
>
> Simon's patch used a fixed size ring of buffers that are recycled, but I
> think the ring should be dynamically sized. Start with a small ring, and
> whenever you need to do a WAL flush to write a dirty buffer, increase
> the ring size. On every full iteration through the ring, decrease its
> size to trim down an unnecessarily large ring.
>
> This only alters the behavior of vacuums, and it's pretty safe to say it
> won't get worse than what we have now.

I think thats too much code, why not just leave it as it is. Would a
dynamic buffer be substantially better? If not, why bother?

> In the future, we can use the
> buffer ring for seqscans as well; more on that on step 3.

There was clear benefit for that. You sound like you are suggesting to
remove the behaviour for Seq Scans, which wouldn't make much sense??

> 2. Implement the list/table of last/ongoing seq scan positions. This is
> Jeff's "synchronized scans" patch. When a seq scan starts on a table
> larger than some threshold, it starts from where the previous seq scan
> is currently, or where it ended. This will synchronize the scans so that
> for two concurrent scans the total I/O is halved in the best case. There
> should be no other effect on performance.
>
> If you have a partitioned table, or union of multiple tables or any
> other plan where multiple seq scans are performed in arbitrary order,
> this change won't change the order the partitions are scanned and won't
> therefore ensure they will be synchronized.
>
>
> Now that we have both pieces of the puzzle in place, it's time to
> consider what more we can do with them:
>
>
> 3A. To take advantage of the "cache trail" of a previous seq scan, scan
> backwards from where the previous seq scan ended, until a you hit a
> buffer that's not in cache.
>
> This will allow taking advantage of the buffer cache even if the table
> doesn't fit completely in RAM. That can make a big difference if the
> table size is just slightly bigger than RAM, and can avoid the nasty
> surprise when a table grows beyond RAM size and queries start taking
> minutes instead of seconds.
>
> This should be a non-controversial change on its own from performance
> point of view. No query should get slower, and some will become faster.
> But see step 3B:
>
> 3B. Currently, sequential scans on a large table spoils the buffer cache
> by evicting other pages from the cache. In CVS HEAD, as soon as the
> table is larger than shared_buffers, the pages in the buffer won't be
> used to speed up running the same query again, and there's no reason to
> believe the pages read in would be more useful than any other page in
> the database, and in particular the pages that were in the buffer cache
> before the huge seq scan. If the table being scanned is > 5 *
> shared_buffers, the scan will evict every other page from the cache if
> there's no other activity in the database (max usage_count is 5).
>
> If the table is much larger than shared_buffers, say 10 times as large,
> even with the change 3B to read the pages that are in cache first, using
> all shared_buffers to cache the table will only speed up the query by
> 10%. We should not spoil the cache for such a small gain, and use the
> local buffer ring strategy instead. It's better to make queries that are
> slow anyway a little bit slower, than making queries that are normally
> really fast, slow.
>
>
> As you may notice, 3A and 3B are at odds with each other. We can
> implement both, but you can't use both strategies in the same scan.

Not sure I've seen any evidence of that.

Most scans will be solo and so should use the ring buffer, since there
is clear evidence of that. If there were evidence to suggest the two
patches conflict then we should turn off the ring buffer only when
concurrent scans are in progress (while remembering that concurrent
scans will not typically overlap as much as the synch scan tests show
and so for much of their execution they too will be solo).

> Therefore we need to have decision logic of some kind to figure out
> which strategy is optimal.
>
> A simple heuristic is to decide based on the table size:
>
> < 0.1*shared_buffers -> start from page 0, keep in cache (like we do now)
> < 5 * shared_buffers -> start from last read page, keep in cache
> > 5 * shared_buffers -> start from last read page, use buffer ring
>
> I'm not sure about the constants, we might need to make them GUC
> variables as Simon argued, but that would be the general approach.

If you want to hardcode it, I'd say use the ring buffer on scans of 1000
blocks or more, or we have a GUC. Sizing things to shared_buffers isn't
appropriate because of the effects of partitioning, as I argued in my
last post, which I think is still valid.

> Thoughts? Everyone happy with the roadmap?

I think separating the patches is now the best way forward, though both
are good.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "CK Tan" <cktan(at)greenplum(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-10 03:52:24
Message-ID: 30E8D12C-C5C1-48DA-BF06-08353C398C35@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

In reference to the seq scans roadmap, I have just submitted a patch
that addresses some of the concerns.

The patch does this:

1. for small relation (smaller than 60% of bufferpool), use the
current logic
2. for big relation:
- use a ring buffer in heap scan
- pin first 12 pages when scan starts
- on consumption of every 4-page, read and pin the next 4-page
- invalidate used pages of in the scan so they do not force out
other useful pages

4 files changed:
bufmgr.c, bufmgr.h, heapam.c, relscan.h

If there are interests, I can submit another scan patch that returns
N tuples at a time, instead of current one-at-a-time interface. This
improves code locality and further improve performance by another
10-20%.

For TPCH 1G tables, we are seeing more than 20% improvement in scans
on the same hardware.

------------------------------------------------------------------------
-
----- PATCHED VERSION
------------------------------------------------------------------------
-
gptest=# select count(*) from lineitem;
count
---------
6001215
(1 row)

Time: 2117.025 ms

------------------------------------------------------------------------
-
----- ORIGINAL CVS HEAD VERSION
------------------------------------------------------------------------
-
gptest=# select count(*) from lineitem;
count
---------
6001215
(1 row)

Time: 2722.441 ms

Suggestions for improvement are welcome.

Regards,
-cktan
Greenplum, Inc.

On May 8, 2007, at 5:57 AM, Heikki Linnakangas wrote:

> Luke Lonergan wrote:
>>> What do you mean with using readahead inside the heapscan?
>>> Starting an async read request?
>> Nope - just reading N buffers ahead for seqscans. Subsequent
>> calls use
>> previously read pages. The objective is to issue contiguous reads to
>> the OS in sizes greater than the PG page size (which is much smaller
>> than what is needed for fast sequential I/O).
>
> Are you filling multiple buffers in the buffer cache with a single
> read-call? The OS should be doing readahead for us anyway, so I
> don't see how just issuing multiple ReadBuffers one after each
> other helps.
>
>> Yes, I think the ring buffer strategy should be used when the
>> table size
>> is > 1 x bufcache and the ring buffer should be of a fixed size
>> smaller
>> than L2 cache (32KB - 128KB seems to work well).
>
> I think we want to let the ring grow larger than that for updating
> transactions and vacuums, though, to avoid the WAL flush problem.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


From: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To: "CK Tan" <cktan(at)greenplum(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-10 10:14:11
Message-ID: E1539E0ED7043848906A8FF995BDA57901FD995A@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> In reference to the seq scans roadmap, I have just submitted
> a patch that addresses some of the concerns.
>
> The patch does this:
>
> 1. for small relation (smaller than 60% of bufferpool), use
> the current logic 2. for big relation:
> - use a ring buffer in heap scan
> - pin first 12 pages when scan starts
> - on consumption of every 4-page, read and pin the next 4-page
> - invalidate used pages of in the scan so they do not
> force out other useful pages

A few comments regarding the effects:

I do not see how this speedup could be caused by readahead, so what are
the effects ?
(It should make no difference to do the CPU work for count(*) inbetween
reading each block when the pages are not dirtied)
Is the improvement solely reduced CPU because no search for a free
buffer is needed and/or L2 cache locality ?

What effect does the advance pinnig have, avoid vacuum ?

A 16 x 8k page ring is too small to allow the needed IO blocksize of
256k.
The readahead is done 4 x one page at a time (=32k).
What is the reasoning behind 1/4 ring for readahead (why not 1/2), is
3/4 the trail for followers and bgwriter ?

I think in anticipation of doing a single IO call for more that one
page, the KillAndReadBuffer function should be split into two parts. One
that does the killing
for n pages, and one that does the reading for n pages.
Killing n before reading n would also have the positive effect of
grouping perhaps needed writes (not interleaving them with the reads).

I think the 60% Nbuffers is a very good starting point. I would only
introduce a GUC when we see evidence that it is needed (I agree with
Simon's partitioning comments, but I'd still wait and see).

Andreas


From: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "CK Tan" <cktan(at)greenplum(dot)com>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-10 10:51:55
Message-ID: E1539E0ED7043848906A8FF995BDA57901FD9968@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Also, that patch doesn't address the VACUUM issue at all. And
> using a small fixed size ring with scans that do updates can
> be devastating. I'm experimenting with different ring sizes
> for COPY at the moment. Too small ring leads to a lot of WAL
> flushes, it's basically the same problem we have with VACUUM
> in CVS HEAD.

My first take on that would be to simply abandon any dirty (and actually
also any still pinned) buffer from the ring and replace the ring slot
with a buffer from the freelist.
If the freelist is empty and LSN allows writing the buffer, write it
(and maybe try to group these).
If the LSN does not allow the write, replace the slot with a buffer from
LRU.

Andreas


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>
Cc: CK Tan <cktan(at)greenplum(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-10 12:22:01
Message-ID: 46430E69.7030400@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zeugswetter Andreas ADI SD wrote:
>> In reference to the seq scans roadmap, I have just submitted
>> a patch that addresses some of the concerns.
>>
>> The patch does this:
>>
>> 1. for small relation (smaller than 60% of bufferpool), use
>> the current logic 2. for big relation:
>> - use a ring buffer in heap scan
>> - pin first 12 pages when scan starts
>> - on consumption of every 4-page, read and pin the next 4-page
>> - invalidate used pages of in the scan so they do not
>> force out other useful pages
>
> A few comments regarding the effects:
>
> I do not see how this speedup could be caused by readahead, so what are
> the effects ?

I was wondering that as well. We'd really need to test all the changes
separately to see where the improvements are really coming from.

Also, that patch doesn't address the VACUUM issue at all. And using a
small fixed size ring with scans that do updates can be devastating. I'm
experimenting with different ring sizes for COPY at the moment. Too
small ring leads to a lot of WAL flushes, it's basically the same
problem we have with VACUUM in CVS HEAD.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>
Cc: CK Tan <cktan(at)greenplum(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-10 13:33:05
Message-ID: 46431F11.9020303@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zeugswetter Andreas ADI SD wrote:
>> Also, that patch doesn't address the VACUUM issue at all. And
>> using a small fixed size ring with scans that do updates can
>> be devastating. I'm experimenting with different ring sizes
>> for COPY at the moment. Too small ring leads to a lot of WAL
>> flushes, it's basically the same problem we have with VACUUM
>> in CVS HEAD.
>
> My first take on that would be to simply abandon any dirty (and actually
> also any still pinned) buffer from the ring and replace the ring slot
> with a buffer from the freelist.
> If the freelist is empty and LSN allows writing the buffer, write it
> (and maybe try to group these).
> If the LSN does not allow the write, replace the slot with a buffer from
> LRU.

That would effectively disable the ring for COPY and the 2nd phase of
VACUUM.

One problem with looking at the LSN is that you need the content lock to
read it, and I wouldn't want to add any new locking. It could be done
inside FlushBuffer when we hold the lock anyway, but I'm afraid the
changes would be pretty invasive.

I'm struggling to get a grip of what the optimal ring size is under
various circumstances. Some thoughts I have this far:
- a small ring gives better L2 cache behavior
- for read-only queries, and for queries that just hint bits, 1 buffer
is enough
- small ring with query that writes WAL (COPY, mass updates, 2nd phase
of VACUUM) leads to a lot of WAL flushes, which can become bottleneck.

But all these assumptions need to be validated. I'm setting up tests
with different ring sizes and queries to get a clear picture of this:
- VACUUM on a clean table
- VACUUM on a table with 1 dead tuple per page
- read-only scan, large table
- read-only scan, table fits in OS cache
- COPY

In addition, I'm going to run VACUUM in a DBT-2 test to see the affect
on other queries running concurrently.

I think a ring that grows when WAL flushes occur covers all the use
cases reasonably well, but I need to do the testing...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, CK Tan <cktan(at)greenplum(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-10 17:33:28
Message-ID: 46435768.9020909@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> However, it caught me by total surprise that the performance with 1
> buffer is so horrible. Using 2 buffers is enough to avoid whatever the
> issue is with just 1 buffer. I have no idea what's causing that. There
> must be some interaction that I don't understand.

Ok, I found the reason for that. I was using this query for the selects:
SELECT COUNT(*) FROM (SELECT 1 FROM stock_copytest LIMIT 10000000) AS a;

Stock_copytest is larger than RAM size, that's why I used the LIMIT to
make the result set memory resident. That had the side effect that
apparently the limit-node kept the single buffer pinned which defeated
the buffer ring completely. To avoid issues like that we apparently want
to use 2-4 buffers instead of just 1.

I'll review my test methodology and keep testing...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, CK Tan <cktan(at)greenplum(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-10 18:02:37
Message-ID: 46435E3D.206@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> But all these assumptions need to be validated. I'm setting up tests
> with different ring sizes and queries to get a clear picture of this:
> - VACUUM on a clean table
> - VACUUM on a table with 1 dead tuple per page
> - read-only scan, large table
> - read-only scan, table fits in OS cache
> - COPY

Just to keep you guys informed, here's my results on a read-only scan on
a table bigger than shared_buffers but smaller than RAM:

select-1 | 00:00:10.853831
select-1 | 00:00:10.380667
select-1 | 00:00:11.530528
select-2 | 00:00:08.634105
select-2 | 00:00:02.674084
select-4 | 00:00:02.65664
select-8 | 00:00:02.662922
select-16 | 00:00:02.682475
select-32 | 00:00:02.693163
select-64 | 00:00:02.722031
select-128 | 00:00:02.873645
select-256 | 00:00:03.185586
select-512 | 00:00:03.534285
select-1024 | 00:00:03.741867

lshw utility tells me that this server has 32KB of L1 cache and 4MB of
L2 cache. The performance starts to drop between 64-128 buffers, which
is 512 - 1024 KB, so I'm not sure how it's related to cache size but
using a small number of buffers is clearly better than using a large number.

However, it caught me by total surprise that the performance with 1
buffer is so horrible. Using 2 buffers is enough to avoid whatever the
issue is with just 1 buffer. I have no idea what's causing that. There
must be some interaction that I don't understand.

All the numbers are quite repeatable, I ran the same test script many
times. The runtime of the first select-2 test however varied between
3-10 seconds, somehow the bad karma from using just 1 buffer in the
earlier test carries over to the next test.

I'm not sure what to think about this, but I'll set up more test
scenarios with VACUUM and COPY.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "CK Tan" <cktan(at)greenplum(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-10 18:12:23
Message-ID: AD9630BA-9A0F-408C-A7E1-D799722986AC@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The patch has no effect on scans that do updates. The
KillAndReadBuffer routine does not force out a buffer if the dirty
bit is set. So updated pages revert to the current performance
characteristics.

-cktan
GreenPlum, Inc.

On May 10, 2007, at 5:22 AM, Heikki Linnakangas wrote:

> Zeugswetter Andreas ADI SD wrote:
>>> In reference to the seq scans roadmap, I have just submitted a
>>> patch that addresses some of the concerns.
>>>
>>> The patch does this:
>>>
>>> 1. for small relation (smaller than 60% of bufferpool), use the
>>> current logic 2. for big relation:
>>> - use a ring buffer in heap scan
>>> - pin first 12 pages when scan starts
>>> - on consumption of every 4-page, read and pin the next 4-page
>>> - invalidate used pages of in the scan so they do not force out
>>> other useful pages
>> A few comments regarding the effects:
>> I do not see how this speedup could be caused by readahead, so
>> what are
>> the effects ?
>
> I was wondering that as well. We'd really need to test all the
> changes separately to see where the improvements are really coming
> from.
>
> Also, that patch doesn't address the VACUUM issue at all. And using
> a small fixed size ring with scans that do updates can be
> devastating. I'm experimenting with different ring sizes for COPY
> at the moment. Too small ring leads to a lot of WAL flushes, it's
> basically the same problem we have with VACUUM in CVS HEAD.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>


From: "CK Tan" <cktan(at)greenplum(dot)com>
To: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-10 18:27:00
Message-ID: 833BC7B3-048A-4CFC-89C5-119725FA4773@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sorry, 16x8K page ring is too small indeed. The reason we selected 16
is because greenplum db runs on 32K page size, so we are indeed
reading 128K at a time. The #pages in the ring should be made
relative to the page size, so you achieve 128K per read.

Also agree that KillAndReadBuffer could be split into a
KillPinDontRead(), and ReadThesePinnedPages() functions. However, we
are thinking of AIO and would rather see a ReadNPagesAsync() function.

-cktan
Greenplum, Inc.

On May 10, 2007, at 3:14 AM, Zeugswetter Andreas ADI SD wrote:

>
>> In reference to the seq scans roadmap, I have just submitted
>> a patch that addresses some of the concerns.
>>
>> The patch does this:
>>
>> 1. for small relation (smaller than 60% of bufferpool), use
>> the current logic 2. for big relation:
>> - use a ring buffer in heap scan
>> - pin first 12 pages when scan starts
>> - on consumption of every 4-page, read and pin the next 4-page
>> - invalidate used pages of in the scan so they do not
>> force out other useful pages
>
> A few comments regarding the effects:
>
> I do not see how this speedup could be caused by readahead, so what
> are
> the effects ?
> (It should make no difference to do the CPU work for count(*)
> inbetween
> reading each block when the pages are not dirtied)
> Is the improvement solely reduced CPU because no search for a free
> buffer is needed and/or L2 cache locality ?
>
> What effect does the advance pinnig have, avoid vacuum ?
>
> A 16 x 8k page ring is too small to allow the needed IO blocksize of
> 256k.
> The readahead is done 4 x one page at a time (=32k).
> What is the reasoning behind 1/4 ring for readahead (why not 1/2), is
> 3/4 the trail for followers and bgwriter ?
>
> I think in anticipation of doing a single IO call for more that one
> page, the KillAndReadBuffer function should be split into two
> parts. One
> that does the killing
> for n pages, and one that does the reading for n pages.
> Killing n before reading n would also have the positive effect of
> grouping perhaps needed writes (not interleaving them with the reads).
>
> I think the 60% Nbuffers is a very good starting point. I would only
> introduce a GUC when we see evidence that it is needed (I agree with
> Simon's partitioning comments, but I'd still wait and see).
>
> Andreas
>


From: Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>
To: "CK Tan" <cktan(at)greenplum(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-11 10:47:24
Message-ID: E1539E0ED7043848906A8FF995BDA5790211EA6A@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Sorry, 16x8K page ring is too small indeed. The reason we
> selected 16 is because greenplum db runs on 32K page size, so
> we are indeed reading 128K at a time. The #pages in the ring
> should be made relative to the page size, so you achieve 128K
> per read.

Ah, ok. New disks here also have a peak at 128k with no other concurrent
IO.
Writes benefit from larger blocksizes though, 512k and more.
Reads with other concurrent IO might also benefit from larger
blocksizes.

Comment to all: to test optimal blocksizes make sure you have other
concurrent IO on the disk.

> Also agree that KillAndReadBuffer could be split into a
> KillPinDontRead(), and ReadThesePinnedPages() functions.
> However, we are thinking of AIO and would rather see a
> ReadNPagesAsync() function.

Yes, you could start the aio and return an already read buffer to allow
concurrent cpu work.
However, you would still want to do blocked aio_readv calls to make sure
the physical read uses the large blocksize.
So I'd say aio would benefit from the same split.

In another posting you wrote:
> The patch has no effect on scans that do updates.
> The KillAndReadBuffer routine does not force out a buffer if
> the dirty bit is set. So updated pages revert to the current
> performance characteristics.

Yes I see, the ring slot is replaced by a standard ReadBuffer in that
case, looks good.

I still think it would be better to write out the buffers and keep them
in the ring when possible, but that seems to need locks and some sort of
synchronization with the new walwriter, so looks like a nice project for
after 8.3.

Andreas


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, CK Tan <cktan(at)greenplum(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-11 21:59:59
Message-ID: 4644E75F.1090306@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I'll review my test methodology and keep testing...

I ran a set of tests on a 100 warehouse TPC-C stock table that is ~3.2
GB in size and the server has 4 GB of memory. IOW the table fits in OS
cache, but not in shared_buffers (set at 1 GB).

copy - COPY from a file
select - SELECT COUNT(*) FROM stock
vacuum - VACUUM on a clean table, effectively a read-only operation
vacuum_hintbits - VACUUM on a table with no dead tuples, but hint bits
need to be set on every page
vacuum_dirty - VACUUM with exactly 1 dead tuple per page,

The number after the test name is the ring size used.

There was no indexes on the table, which means that the vacuum tests
only had to do one pass. The 1st vacuum phase of a real-world table is
like a mixture of vacuum- and vacuum_hintbits-tests, and 2nd phase is
like the vacuum_dirty test.

copy-1 | 00:31:47.042365
copy-2 | 00:17:57.630772
copy-4 | 00:17:55.041794
copy-8 | 00:08:31.014009
copy-16 | 00:05:38.39848
copy-32 | 00:05:52.295512
copy-64 | 00:06:08.404646
copy-128 | 00:05:05.032448
copy-256 | 00:05:48.573146
copy-512 | 00:04:56.098752
copy-1024 | 00:05:27.05316
select-4 | 00:00:04.344873
select-4 | 00:00:02.2498
select-1 | 00:00:08.754011
select-1 | 00:00:10.521174
select-1 | 00:00:10.819376
select-1 | 00:00:14.818831
select-1 | 00:00:14.893562
select-1 | 00:00:16.973934
select-2 | 00:00:15.722776
select-2 | 00:00:02.291078
select-2 | 00:00:02.230167
select-4 | 00:00:02.232935
select-8 | 00:00:02.238791
select-16 | 00:00:02.245566
select-32 | 00:00:02.267158
select-64 | 00:00:02.311878
select-128 | 00:00:02.487086
select-256 | 00:00:02.764085
select-512 | 00:00:03.161025
select-1024 | 00:00:03.387246
vacuum-1 | 00:00:01.843337
vacuum-2 | 00:00:01.612738
vacuum-4 | 00:00:01.6304
vacuum-8 | 00:00:01.655126
vacuum-16 | 00:00:01.641808
vacuum-32 | 00:00:01.664108
vacuum-64 | 00:00:01.729106
vacuum-128 | 00:00:01.879023
vacuum-256 | 00:00:02.218303
vacuum-512 | 00:00:02.569571
vacuum-1024 | 00:00:02.791995
vacuum_dirty-1 | 00:24:15.424337
vacuum_dirty-2 | 00:13:26.981835
vacuum_dirty-4 | 00:08:07.260113
vacuum_dirty-8 | 00:05:24.1476
vacuum_dirty-16 | 00:03:52.690336
vacuum_dirty-32 | 00:02:40.759203
vacuum_dirty-64 | 00:02:45.14425
vacuum_dirty-128 | 00:02:46.718922
vacuum_dirty-256 | 00:02:43.797785
vacuum_dirty-512 | 00:02:36.363763
vacuum_dirty-1024 | 00:02:32.767481
vacuum_hintbits-1 | 00:00:37.847935
vacuum_hintbits-2 | 00:00:38.788662
vacuum_hintbits-4 | 00:00:43.554029
vacuum_hintbits-8 | 00:00:42.040379
vacuum_hintbits-16 | 00:00:44.187508
vacuum_hintbits-32 | 00:00:38.252052
vacuum_hintbits-64 | 00:00:37.920379
vacuum_hintbits-128 | 00:00:38.463007
vacuum_hintbits-256 | 00:00:38.157724
vacuum_hintbits-512 | 00:00:38.309285
vacuum_hintbits-1024 | 00:00:39.178738

I ran the some of the select tests multiple times because the behavior
changed when the test was repeated. I don't know what's going on in the
select-1 test, it looks like the same effect I had with the more complex
query involving a LIMIT-node, but this time I'm just doing a plain
SELECT COUNT(*). I ran the test script multiple times; the results shown
above are copy-pasted from one particular run but the numbers didn't
change much from run to run. In particular, the run times for the
select-1 test really do increase as you repeat the test many times. The
copy results seem to vary quite a bit, though.

For comparison, here's the test results with vanilla CVS HEAD:

copy-head | 00:06:21.533137
copy-head | 00:05:54.141285
select-head | 00:00:16.213693
select-head | 00:00:18.500792
vacuum-head | 00:00:12.843479
vacuum-head | 00:00:08.719845
vacuum_dirty-head | 00:22:02.533553
vacuum_dirty-head | 00:22:02.852786
vacuum_hintbits-head | 00:00:38.278701
vacuum_hintbits-head | 00:00:35.226191

Looking at the results, it seems that using a fixed sized ring of 32
pages hits the sweet spot on all tests. I wonder if that holds on other
hardware.

The test scripts I used are attached. I used a modified DBT-2 schema and
dump file, so you'll need to replace that with some other large table to
run it. I would appreciate it if others would repeat the tests on other
hardware to get a bigger sample.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
testscript.sql text/x-sql 17.0 KB
testscript-head.sql text/x-sql 3.1 KB

From: "Simon Riggs" <simon(at)enterprisedb(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>, "CK Tan" <cktan(at)greenplum(dot)com>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-12 07:35:27
Message-ID: 1178955327.10861.394.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2007-05-11 at 22:59 +0100, Heikki Linnakangas wrote:
> For comparison, here's the test results with vanilla CVS HEAD:
>
> copy-head | 00:06:21.533137
> copy-head | 00:05:54.141285

I'm slightly worried that the results for COPY aren't anywhere near as
good as the SELECT and VACUUM results. It isn't clear from those numbers
that the benefit really is significant.

Are you thinking that having COPY avoid cache spoiling is a benefit just
of itself? Or do you see a pattern of benefit from your other runs?

(BTW what was wal_buffers set to? At least twice the ring buffer size,
hopefully).

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Simon Riggs" <simon(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>, "CK Tan" <cktan(at)greenplum(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-12 15:42:40
Message-ID: C26B2E80.2FE95%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Simon,

On 5/12/07 12:35 AM, "Simon Riggs" <simon(at)enterprisedb(dot)com> wrote:

> I'm slightly worried that the results for COPY aren't anywhere near as
> good as the SELECT and VACUUM results. It isn't clear from those numbers
> that the benefit really is significant.

COPY is bottlenecked on datum formation and format translation with very low
performance, so I don't think we should expect the ring buffer to make much
of a dent.

- Luke


From: "CK Tan" <cktan(at)greenplum(dot)com>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc: "Simon Riggs" <simon(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-13 23:41:29
Message-ID: E7E0AC37-B064-417E-B3E6-77BBB25EDA9E@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi All,

COPY/INSERT are also bottlenecked on record at a time insertion into
heap, and in checking for pre-insert trigger, post-insert trigger and
constraints.

To speed things up, we really need to special case insertions without
triggers and constraints, [probably allow for unique constraints],
and make these insertions to go into heap N tuples at a time. With
this change, comes the benefit of optimizing REDO log to log multiple
inserts or even logging a whole new heap page that gets filled in a
single WAL record.

Those with triggers and other constraints would still have to go in
one at a time because of the trigger/constraints semantics.

It seems to me that dirty pages should be written out by the bg
writer instead of circumventing it using ring buffer. If it is slow,
we should change bg writer.

-cktan

On May 12, 2007, at 8:42 AM, Luke Lonergan wrote:

> Hi Simon,
>
> On 5/12/07 12:35 AM, "Simon Riggs" <simon(at)enterprisedb(dot)com> wrote:
>
>> I'm slightly worried that the results for COPY aren't anywhere
>> near as
>> good as the SELECT and VACUUM results. It isn't clear from those
>> numbers
>> that the benefit really is significant.
>
> COPY is bottlenecked on datum formation and format translation with
> very low
> performance, so I don't think we should expect the ring buffer to
> make much
> of a dent.
>
> - Luke
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "CK Tan" <cktan(at)greenplum(dot)com>
Cc: "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-13 23:54:24
Message-ID: 21639.1179100464@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"CK Tan" <cktan(at)greenplum(dot)com> writes:
> COPY/INSERT are also bottlenecked on record at a time insertion into
> heap, and in checking for pre-insert trigger, post-insert trigger and
> constraints.

> To speed things up, we really need to special case insertions without
> triggers and constraints, [probably allow for unique constraints],

Do you have any profiling data to back up these assertions? I haven't
noticed that firing zero tuples takes any visible percentage of COPY
time.

regards, tom lane


From: "CK Tan" <cktan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-14 02:36:57
Message-ID: AC541650-FB8C-4011-8D3E-5931556DA5C8@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sorry, I should have been clearer. I meant because we need to check
for trigger firing pre/post insertion, and the trigger definitions
expect tuples to be inserted one by one, therefore we cannot insert N-
tuples at a time into the heap. Checking for triggers itself is not
taking up much CPU at all. If we could predetermine that there is not
any triggers for a relation, inserts into that relation could then
follow a different path that inserts N-tuples at a time.

Regards,
-cktan

On May 13, 2007, at 4:54 PM, Tom Lane wrote:

> "CK Tan" <cktan(at)greenplum(dot)com> writes:
>> COPY/INSERT are also bottlenecked on record at a time insertion into
>> heap, and in checking for pre-insert trigger, post-insert trigger and
>> constraints.
>
>> To speed things up, we really need to special case insertions without
>> triggers and constraints, [probably allow for unique constraints],
>
> Do you have any profiling data to back up these assertions? I haven't
> noticed that firing zero tuples takes any visible percentage of COPY
> time.
>
> regards, tom lane
>


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, CK Tan <cktan(at)greenplum(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-14 11:41:45
Message-ID: 46484AF9.9090106@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Fri, 2007-05-11 at 22:59 +0100, Heikki Linnakangas wrote:
>> For comparison, here's the test results with vanilla CVS HEAD:
>>
>> copy-head | 00:06:21.533137
>> copy-head | 00:05:54.141285
>
> I'm slightly worried that the results for COPY aren't anywhere near as
> good as the SELECT and VACUUM results. It isn't clear from those numbers
> that the benefit really is significant.

Agreed, the benefit isn't clear.

> Are you thinking that having COPY avoid cache spoiling is a benefit just
> of itself? Or do you see a pattern of benefit from your other runs?

I think it's worth having just to avoid cache spoiling. I wouldn't
bother otherwise, but since we have the infrastructure for vacuum and
large seqscans, we might as well use it for COPY as well.

> (BTW what was wal_buffers set to? At least twice the ring buffer size,
> hopefully).

Good question. [checks]. wal_buffers was set to 128KB. I tried raising
it to 1MB, but it didn't make any difference.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Simon Riggs <simon(at)enterprisedb(dot)com>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, CK Tan <cktan(at)greenplum(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-15 09:32:20
Message-ID: 46497E24.6060500@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Just to keep you guys informed, I've been busy testing and pondering
over different buffer ring strategies for vacuum, seqscans and copy.
Here's what I'm going to do:

Use a fixed size ring. Fixed as in doesn't change after the ring is
initialized, however different kinds of scans use differently sized rings.

I said earlier that it'd be invasive change to see if a buffer needs a
WAL flush and choose another victim if that's the case. I looked at it
again and found a pretty clean way of doing that, so I took that
approach for seq scans.

1. For VACUUM, use a ring of 32 buffers. 32 buffers is small enough to
give the L2 cache benefits and keep cache pollution low, but at the same
time it's large enough that it keeps the need to WAL flush reasonable
(1/32 of what we do now).

2. For sequential scans, also use a ring of 32 buffers, but whenever a
buffer in the ring would need a WAL flush to recycle, we throw it out of
the buffer ring instead. On read-only scans (and scans that only update
hint bit) this gives the L2 cache benefits and doesn't pollute the
buffer cache. On bulk updates, it's effectively the current behavior. On
scans that do some updates, it's something in between. In all cases it
should be no worse than what we have now. 32 buffers should be large
enough to leave a "cache trail" for Jeff's synchronized scans to work.

3. For COPY that doesn't write WAL, use the same strategy as for
sequential scans. This keeps the cache pollution low and gives the L2
cache benefits.

4. For COPY that writes WAL, use a large ring of 2048-4096 buffers. We
want to use a ring that can accommodate 1 WAL segment worth of data, to
avoid having to do any extra WAL flushes, and the WAL segment size is
2048 pages in the default configuration.

Some alternatives I considered but rejected:

* Instead of throwing away dirtied buffers in seq scans, accumulate them
in another fixed sized list. When the list gets full, do a WAL flush and
put them to the shared freelist or a backend-private freelist. That
would eliminate the cache pollution of bulk DELETEs and bulk UPDATEs,
and it could be used for vacuum as well. I think this would be the
optimal algorithm but I don't feel like inventing something that
complicated at this stage anymore. Maybe for 8.4.

* Using a different sized ring for 1st and 2nd vacuum phase. Decided
that it's not worth the trouble, the above is already an order of
magnitude better than the current behavior.

I'm going to rerun the performance tests I ran earlier with new patch,
tidy it up a bit, and submit it in the next few days. This turned out to
be even more laborious patch to review than I thought. While the patch
is short and in the end turned out to be very close to Simon's original
patch, there's many different usage scenarios that need to be catered
for and tested.

I still need to check the interaction with Jeff's patch. This is close
enough to Simon's original patch that I believe the results of the tests
Jeff ran earlier are still valid.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Cc: "Simon Riggs" <simon(at)enterprisedb(dot)com>, "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>, "CK(dot)Tan" <cktan(at)greenplum(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-15 09:39:17
Message-ID: C3E62232E3BCF24CBA20D72BFDCB6BF804066A4A@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki,

32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
effect.

How about using 256/blocksize?

- Luke

> -----Original Message-----
> From: Heikki Linnakangas [mailto:hlinnaka(at)gmail(dot)com] On
> Behalf Of Heikki Linnakangas
> Sent: Tuesday, May 15, 2007 2:32 AM
> To: PostgreSQL-development
> Cc: Simon Riggs; Zeugswetter Andreas ADI SD; CK.Tan; Luke
> Lonergan; Jeff Davis
> Subject: Re: [HACKERS] Seq scans roadmap
>
> Just to keep you guys informed, I've been busy testing and
> pondering over different buffer ring strategies for vacuum,
> seqscans and copy.
> Here's what I'm going to do:
>
> Use a fixed size ring. Fixed as in doesn't change after the
> ring is initialized, however different kinds of scans use
> differently sized rings.
>
> I said earlier that it'd be invasive change to see if a
> buffer needs a WAL flush and choose another victim if that's
> the case. I looked at it again and found a pretty clean way
> of doing that, so I took that approach for seq scans.
>
> 1. For VACUUM, use a ring of 32 buffers. 32 buffers is small
> enough to give the L2 cache benefits and keep cache pollution
> low, but at the same time it's large enough that it keeps the
> need to WAL flush reasonable
> (1/32 of what we do now).
>
> 2. For sequential scans, also use a ring of 32 buffers, but
> whenever a buffer in the ring would need a WAL flush to
> recycle, we throw it out of the buffer ring instead. On
> read-only scans (and scans that only update hint bit) this
> gives the L2 cache benefits and doesn't pollute the buffer
> cache. On bulk updates, it's effectively the current
> behavior. On scans that do some updates, it's something in
> between. In all cases it should be no worse than what we have
> now. 32 buffers should be large enough to leave a "cache
> trail" for Jeff's synchronized scans to work.
>
> 3. For COPY that doesn't write WAL, use the same strategy as
> for sequential scans. This keeps the cache pollution low and
> gives the L2 cache benefits.
>
> 4. For COPY that writes WAL, use a large ring of 2048-4096
> buffers. We want to use a ring that can accommodate 1 WAL
> segment worth of data, to avoid having to do any extra WAL
> flushes, and the WAL segment size is
> 2048 pages in the default configuration.
>
> Some alternatives I considered but rejected:
>
> * Instead of throwing away dirtied buffers in seq scans,
> accumulate them in another fixed sized list. When the list
> gets full, do a WAL flush and put them to the shared freelist
> or a backend-private freelist. That would eliminate the cache
> pollution of bulk DELETEs and bulk UPDATEs, and it could be
> used for vacuum as well. I think this would be the optimal
> algorithm but I don't feel like inventing something that
> complicated at this stage anymore. Maybe for 8.4.
>
> * Using a different sized ring for 1st and 2nd vacuum phase.
> Decided that it's not worth the trouble, the above is already
> an order of magnitude better than the current behavior.
>
>
> I'm going to rerun the performance tests I ran earlier with
> new patch, tidy it up a bit, and submit it in the next few
> days. This turned out to be even more laborious patch to
> review than I thought. While the patch is short and in the
> end turned out to be very close to Simon's original patch,
> there's many different usage scenarios that need to be
> catered for and tested.
>
> I still need to check the interaction with Jeff's patch. This
> is close enough to Simon's original patch that I believe the
> results of the tests Jeff ran earlier are still valid.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
>


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Luke Lonergan <LLonergan(at)greenplum(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)enterprisedb(dot)com>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, "CK(dot)Tan" <cktan(at)greenplum(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-15 09:42:28
Message-ID: 46498084.5000909@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Luke Lonergan wrote:
> 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
> effect.
>
> How about using 256/blocksize?

Sounds reasonable. We need to check the effect on the synchronized
scans, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Luke Lonergan <LLonergan(at)greenplum(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)enterprisedb(dot)com>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, "CK(dot)Tan" <cktan(at)greenplum(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-15 17:25:35
Message-ID: 1179249935.24902.114.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
> Luke Lonergan wrote:
> > 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
> > effect.
> >
> > How about using 256/blocksize?
>
> Sounds reasonable. We need to check the effect on the synchronized
> scans, though.
>

I am a little worried that there will be greater differences in position
as the number of scans increase. If we have only 8 buffers and several
scans progressing, will they all be able to stay within a few buffers of
eachother at any given time?

Also, with 8 buffers, that means each scan must report every 4 pages at
most (and maybe every page), which increases lock contention (the new
design Heikki and I discussed requires a lock every time a backend
reports its position).

Regards,
Jeff Davis


From: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)enterprisedb(dot)com>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, "CK(dot)Tan" <cktan(at)greenplum(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-15 22:34:11
Message-ID: 20070515223411.GV20707@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 15, 2007 at 10:25:35AM -0700, Jeff Davis wrote:
> On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
> > Luke Lonergan wrote:
> > > 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
> > > effect.
> > >
> > > How about using 256/blocksize?
> >
> > Sounds reasonable. We need to check the effect on the synchronized
> > scans, though.
> >
>
> I am a little worried that there will be greater differences in position
> as the number of scans increase. If we have only 8 buffers and several
> scans progressing, will they all be able to stay within a few buffers of
> eachother at any given time?
>
> Also, with 8 buffers, that means each scan must report every 4 pages at
> most (and maybe every page), which increases lock contention (the new
> design Heikki and I discussed requires a lock every time a backend
> reports its position).

Given that spoiling the L2 cache is a trivial cost compared to extra
physical IO, ISTM we should go with a largish ring for sync scans. What
do you think would be the ideal size? 32 buffers?
--
Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Luke Lonergan <LLonergan(at)greenplum(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)enterprisedb(dot)com>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, "CK(dot)Tan" <cktan(at)greenplum(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-15 22:46:51
Message-ID: 464A385B.8020107@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis wrote:
> On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
>> Luke Lonergan wrote:
>>> 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
>>> effect.
>>>
>>> How about using 256/blocksize?
>> Sounds reasonable. We need to check the effect on the synchronized
>> scans, though.
>>
>
> I am a little worried that there will be greater differences in position
> as the number of scans increase. If we have only 8 buffers and several
> scans progressing, will they all be able to stay within a few buffers of
> eachother at any given time?

I'm not sure. Needs testing. Assuming the scan leaves behind a cache
trail in the OS cache as well, it might not be that bad if a scan
joining the party starts a little bit behind.

> Also, with 8 buffers, that means each scan must report every 4 pages at
> most (and maybe every page), which increases lock contention (the new
> design Heikki and I discussed requires a lock every time a backend
> reports its position).

Keep in mind that processing a 32K page takes longer than processing an
8K page.

But we'll see..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Simon Riggs" <simon(at)enterprisedb(dot)com>, "CK(dot)Tan" <cktan(at)greenplum(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-16 08:31:30
Message-ID: E1539E0ED7043848906A8FF995BDA5790211F032@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> > > > 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2
> > > > cache effect.

I'd say in a scenario where 32k pages are indicated you will also want
larger than average L2 caches.

> > > >
> > > > How about using 256/blocksize?

The reading ahead uses 1/4 ring size. To the best of our knowledge, this
1/4 needs to be 128k for reading.
So I'd say we need 512/blocksize.

Andreas


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Simon Riggs" <simon(at)enterprisedb(dot)com>, "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>, "CK(dot)Tan" <cktan(at)greenplum(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-16 17:31:33
Message-ID: C2708E05.3055D%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I think the analysis on syncscan needs to take the external I/O cache into
account. I believe it is not necessary or desirable to keep the scans in
lock step within the PG bufcache.

The main benefit of a sync scan will be the ability to start the scan where
other scans have already filled the I/O cache with useful blocks. This will
require some knowledge of the size of the I/O cache by the syncscan logic,
but that's where the largest amount of I/O cache memory (by far) is located.

- Luke

On 5/15/07 3:34 PM, "Jim C. Nasby" <decibel(at)decibel(dot)org> wrote:

> On Tue, May 15, 2007 at 10:25:35AM -0700, Jeff Davis wrote:
>> On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
>>> Luke Lonergan wrote:
>>>> 32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
>>>> effect.
>>>>
>>>> How about using 256/blocksize?
>>>
>>> Sounds reasonable. We need to check the effect on the synchronized
>>> scans, though.
>>>
>>
>> I am a little worried that there will be greater differences in position
>> as the number of scans increase. If we have only 8 buffers and several
>> scans progressing, will they all be able to stay within a few buffers of
>> eachother at any given time?
>>
>> Also, with 8 buffers, that means each scan must report every 4 pages at
>> most (and maybe every page), which increases lock contention (the new
>> design Heikki and I discussed requires a lock every time a backend
>> reports its position).
>
> Given that spoiling the L2 cache is a trivial cost compared to extra
> physical IO, ISTM we should go with a largish ring for sync scans. What
> do you think would be the ideal size? 32 buffers?


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)enterprisedb(dot)com>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, "CK(dot)Tan" <cktan(at)greenplum(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-16 23:56:44
Message-ID: 1179359804.24902.198.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2007-05-16 at 10:31 -0700, Luke Lonergan wrote:
> I think the analysis on syncscan needs to take the external I/O cache into
> account. I believe it is not necessary or desirable to keep the scans in
> lock step within the PG bufcache.

I partially agree. I don't think we need any huge amount of shared
buffers for the scans to use, and the scans should be able to catch up
by using the OS buffer cache (which is still faster than fetching from
disk).

However, as I said before, I'm worried that, if the ring is too small,
it might not work to my expectations. I will try to test this to
eliminate my doubts.

> The main benefit of a sync scan will be the ability to start the scan where
> other scans have already filled the I/O cache with useful blocks. This will
> require some knowledge of the size of the I/O cache by the syncscan logic,
> but that's where the largest amount of I/O cache memory (by far) is located.
>

I don't think it's necessarily the largest "by far". However, it may be
the largest.

If you mean that the main benefit of sync scans is to make use of blocks
that happen to be in cache before the scan began, I disagree. I think
that's a possible benefit, but I was unable to show any huge benefit in
my tests (someone may be able to on different hardware with different
test cases).

The main benefits that I see are:
(1) reduce total number of blocks read from disk by making use of
blocks as they are read by another concurrent seqscan.
(2) eliminate the need for random I/O on concurrent sequential scans.

I like the idea of using already-in-cache blocks. The code is there and
it works, but I just didn't see the benefits yet. After I get any issues
with this patch resolved and the reviewers are satisfied, I'd like to
work on it.

Regards,
Jeff Davis


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Simon Riggs" <simon(at)enterprisedb(dot)com>, "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>, "CK(dot)Tan" <cktan(at)greenplum(dot)com>
Subject: Re: Seq scans roadmap
Date: 2007-05-17 14:42:27
Message-ID: C271B7E3.30721%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Jeff,

On 5/16/07 4:56 PM, "Jeff Davis" <pgsql(at)j-davis(dot)com> wrote:

>> The main benefit of a sync scan will be the ability to start the scan where
>> other scans have already filled the I/O cache with useful blocks. This will
>> require some knowledge of the size of the I/O cache by the syncscan logic,
>> but that's where the largest amount of I/O cache memory (by far) is located.

> I don't think it's necessarily the largest "by far". However, it may be
> the largest.

Compare the size of a 32 block ring buffer (between 256KB and 1024KB) and
16,000,000 KB of RAM on a common server, automatically used to maximum
extent by the OS dynamic I/O caching.

> If you mean that the main benefit of sync scans is to make use of blocks
> that happen to be in cache before the scan began, I disagree.

That's not what I meant.

> I think
> that's a possible benefit, but I was unable to show any huge benefit in
> my tests (someone may be able to on different hardware with different
> test cases).

I agree, I don't think this is worth pursuing.

> The main benefits that I see are:
> (1) reduce total number of blocks read from disk by making use of
> blocks as they are read by another concurrent seqscan.
> (2) eliminate the need for random I/O on concurrent sequential scans.

Yes on (1), but with (2), again, the OS prefetch reduces the seeking to a
minimal level.

With (1), we just have to define the meaning of "concurrent" to be "within
the span of the OS I/O cache" and we're describing the same effect.

- Luke