Quick Links

Re: Sequential scans

Lists:	pgsql-hackers

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Sequential scans
Date:	2007-05-02 13:26:39
Message-ID:	4638918F.90805@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I'm starting to review the "synchronized scans" and "scan-resistant
buffer cache" patches. The patches have complex interactions so I'm
taking a holistic approach.

There's four outstanding issues with the sync scans in particular:

1. The simplistic hash approach. While it's nice to not have a lock, I'm
worried of collisions. If you had a collision every now and then, it
wouldn't be that bad, but because the hash value is computed from the
oid, a collision would be persistent. If you create a database and
happen to have two frequently seqscanned tables that collide, the only
way to get rid of the collision is to drop and recreate a table.
Granted, that'd probably be very rare in practice, but when it happens
it would be next to impossible to figure out what's going on.

Let's use a normal hash table instead, and use a lock to protect it. If
we only update it every 10 pages or so, the overhead should be
negligible. To further reduce contention, we could modify ReadBuffer to
let the caller know if the read resulted in a physical read or not, and
only update the entry when a page is physically read in. That way all
the synchronized scanners wouldn't be updating the same value, just the
one performing the I/O. And while we're at it, let's use the full
relfilenode instead of just the table oid in the hash.

2. Under what circumstances does the patch help and when does it hurt? I
think the patch is safe in that it should never be any worse than what
we have now. But when does it help? That needs to be looked at together
with the other patch.

I need to dig the archives for the performance test results you posted
earlier and try to understand them.

There's six distinct scenarios I've come up with this far that need to
be looked at:
A. A seq scan on a small table
B. A seq scan on a table that's 110% the size of shared_buffers, but
smaller than RAM
C. A seq scan on a table that's 110% the size of RAM
D. A seq scan on a huge table
E. Two simultaneous seq scans on a large table starting at the same time
F. Two simultaneous seq scans on a large table, 2nd one starting when
the 1st one is halfway through

Also, does it change things if you have a bulk update instead of
read-only query? How about bitmap heap scans and large index scans? And
vacuums? And the above scenarios need to be considered both alone, and
in the presence of other OLTP kind of workload.

I realize that we can't have everything, and as long as we get some
significant benefit in some scenarios, and don't hurt others, the patch
is worthwhile. But let's try to cover as much as we reasonably can.

One random idea I had to cover B & C without having the offset variable:
Start scanning *backwards* from the page that's in the shared hash
table, until you hit a page that's not in buffer cache. Then you
continue scanning forwards from the page you started from.

This needs more thought but I think we can come up with a pretty simple
solution that covers the most common cases.

3. By having different backends doing the reads, are we destroying OS
readahead as Tom suggested? I remember you performed some tests on that,
and it was a problem on some systems but not on others. This needs some
thought, there may be some simple way to address that.

4. It fails regression tests. You get an assertion failure on the portal
test. I believe that changing the direction of a scan isn't handled
properly; it's probably pretty easy to fix.

Jeff, could you please fix 1 and 4? I'll give 2 and 3 some more thought,
and take a closer look at the scan-resistant scans patch. Any comments
and ideas are welcome, of course..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-02 17:52:18
Message-ID:	1178128338.28383.154.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote:
> Hi,
>
> I'm starting to review the "synchronized scans" and "scan-resistant
> buffer cache" patches. The patches have complex interactions so I'm
> taking a holistic approach.
>
> There's four outstanding issues with the sync scans in particular:
>
> 1. The simplistic hash approach. While it's nice to not have a lock, I'm
> worried of collisions. If you had a collision every now and then, it
> wouldn't be that bad, but because the hash value is computed from the
> oid, a collision would be persistent. If you create a database and
> happen to have two frequently seqscanned tables that collide, the only
> way to get rid of the collision is to drop and recreate a table.
> Granted, that'd probably be very rare in practice, but when it happens
> it would be next to impossible to figure out what's going on.
>
> Let's use a normal hash table instead, and use a lock to protect it. If
> we only update it every 10 pages or so, the overhead should be
> negligible. To further reduce contention, we could modify ReadBuffer to
> let the caller know if the read resulted in a physical read or not, and
> only update the entry when a page is physically read in. That way all
> the synchronized scanners wouldn't be updating the same value, just the
> one performing the I/O. And while we're at it, let's use the full
> relfilenode instead of just the table oid in the hash.

What should be the maximum size of this hash table? Is there already-
existing hash table code that I should use to be consistent with the
rest of the code?

I'm still trying to understand the effect of using the full relfilenode.
Do you mean using the entire relation _segment_ as the key? That doesn't
make sense to me. Or do you just mean using the relfilenode (without the
segment) as the key?

> 3. By having different backends doing the reads, are we destroying OS
> readahead as Tom suggested? I remember you performed some tests on that,
> and it was a problem on some systems but not on others. This needs some
> thought, there may be some simple way to address that.

Linux with CFQ I/O scheduler performs very poorly and inconsistently
with concurrent sequential scans regardless of whether the scans are
synchronized or not. I suspect the reason for this is that CFQ is
designed to care more about the process issuing the request than any
other factor.

Every other I/O system performed either ideally (no interference between
scans) or had some interference but still much better than current
behavior.

Of course, my tests are limited and there are many possible combinations
of I/O systems that I did not try.

> 4. It fails regression tests. You get an assertion failure on the portal
> test. I believe that changing the direction of a scan isn't handled
> properly; it's probably pretty easy to fix.
>

I will examine the code more carefully. As a first guess, is it possible
that test is failing because of the non-deterministic order in which
tuples are returned?

> Jeff, could you please fix 1 and 4? I'll give 2 and 3 some more thought,
> and take a closer look at the scan-resistant scans patch. Any comments
> and ideas are welcome, of course..
>

Yes. I'll also try to address the other issues in your email. Thanks for
your comments.

Regards,
Jeff Davis

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Jeff Davis" <pgsql(at)j-davis(dot)com>, "Simon Riggs" <simon(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-02 17:54:43
Message-ID:	87ejlz6rh8.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:

> Let's use a normal hash table instead, and use a lock to protect it. If we only
> update it every 10 pages or so, the overhead should be negligible. To further
> reduce contention, we could modify ReadBuffer to let the caller know if the
> read resulted in a physical read or not, and only update the entry when a page
> is physically read in. That way all the synchronized scanners wouldn't be
> updating the same value, just the one performing the I/O. And while we're at
> it, let's use the full relfilenode instead of just the table oid in the hash.

It's probably fine to just do that. But if we find it's a performance
bottleneck we could probably still manage to avoid the lock except when
actually inserting a new hash element. If you just store in the hash an index
into an array stored in global memory then you could get away without a lock
on the element in the array.

It starts to get to be a fair amount of code when you think about how you
would reuse elements of the array. That's why I suggest only looking at this
if down the road we find that it's a bottleneck.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-02 19:02:39
Message-ID:	4638E04F.4040100@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>
>> Let's use a normal hash table instead, and use a lock to protect it. If we only
>> update it every 10 pages or so, the overhead should be negligible. To further
>> reduce contention, we could modify ReadBuffer to let the caller know if the
>> read resulted in a physical read or not, and only update the entry when a page
>> is physically read in. That way all the synchronized scanners wouldn't be
>> updating the same value, just the one performing the I/O. And while we're at
>> it, let's use the full relfilenode instead of just the table oid in the hash.
>
> It's probably fine to just do that. But if we find it's a performance
> bottleneck we could probably still manage to avoid the lock except when
> actually inserting a new hash element. If you just store in the hash an index
> into an array stored in global memory then you could get away without a lock
> on the element in the array.
>
> It starts to get to be a fair amount of code when you think about how you
> would reuse elements of the array. That's why I suggest only looking at this
> if down the road we find that it's a bottleneck.

Another trick you could do is to use acquire the lock conditionally when
updating it. But I doubt it's a problem anyhow, if we put some sane
lower limit in there so that it's not used at all for small tables.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-02 19:58:26
Message-ID:	4638ED62.4090308@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis wrote:
> On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote:
>> Let's use a normal hash table instead, and use a lock to protect it. If
>> we only update it every 10 pages or so, the overhead should be
>> negligible. To further reduce contention, we could modify ReadBuffer to
>> let the caller know if the read resulted in a physical read or not, and
>> only update the entry when a page is physically read in. That way all
>> the synchronized scanners wouldn't be updating the same value, just the
>> one performing the I/O. And while we're at it, let's use the full
>> relfilenode instead of just the table oid in the hash.
>
> What should be the maximum size of this hash table?

Good question. And also, how do you remove entries from it?

I guess the size should somehow be related to number of backends. Each
backend will realistically be doing just 1 or max 2 seq scan at a time.
It also depends on the number of large tables in the databases, but we
don't have that information easily available. How about using just
NBackends? That should be plenty, but wasting a few hundred bytes of
memory won't hurt anyone.

I think you're going to need an LRU list and counter of used entries in
addition to the hash table, and when all entries are in use, remove the
least recently used one.

The thing to keep an eye on is that it doesn't add too much overhead or
lock contention in the typical case when there's no concurrent scans.

For the locking, use a LWLock.

> Is there already-
> existing hash table code that I should use to be consistent with the
> rest of the code?

Yes, see utils/hash/dynahash.c, and ShmemInitHash (in
storage/ipc/shmem.c) since it's in shared memory. There's plenty of
examples that use hash tables, see for example
storage/freespace/freespace.c.

> I'm still trying to understand the effect of using the full relfilenode.
> Do you mean using the entire relation _segment_ as the key? That doesn't
> make sense to me. Or do you just mean using the relfilenode (without the
> segment) as the key?

No, not the segment. RelFileNode consists of tablespace oid, database
oid and relation oid. You can find it in scan->rs_rd->rd_node. The
segmentation works at a lower level.

> Linux with CFQ I/O scheduler performs very poorly and inconsistently
> with concurrent sequential scans regardless of whether the scans are
> synchronized or not. I suspect the reason for this is that CFQ is
> designed to care more about the process issuing the request than any
> other factor.
>
> Every other I/O system performed either ideally (no interference between
> scans) or had some interference but still much better than current
> behavior.

Hmm. Should we care then? CFG is the default on Linux, and an average
sysadmin is unlikely to change it.

What we could do quite easily is:
- when ReadBuffer is called, let the caller know if the read did
physical I/O.
- when the previous ReadBuffer didn't result in physical I/O, assume
that we're not the pack leader. If the next buffer isn't already in
cache, wait a few milliseconds before initiating the read, giving the
pack leader a chance to do it instead.

Needs testing, of course..

>> 4. It fails regression tests. You get an assertion failure on the portal
>> test. I believe that changing the direction of a scan isn't handled
>> properly; it's probably pretty easy to fix.
>>
>
> I will examine the code more carefully. As a first guess, is it possible
> that test is failing because of the non-deterministic order in which
> tuples are returned?

No, it's an assertion failure, not just different output than expected.
But it's probably quite simple to fix..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-02 21:11:48
Message-ID:	1178140308.28383.159.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2007-05-02 at 20:02 +0100, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> > "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> >
> >> Let's use a normal hash table instead, and use a lock to protect it. If we only
> >> update it every 10 pages or so, the overhead should be negligible. To further
> >> reduce contention, we could modify ReadBuffer to let the caller know if the
> >> read resulted in a physical read or not, and only update the entry when a page
> >> is physically read in. That way all the synchronized scanners wouldn't be
> >> updating the same value, just the one performing the I/O. And while we're at
> >> it, let's use the full relfilenode instead of just the table oid in the hash.
> >
> > It's probably fine to just do that. But if we find it's a performance
> > bottleneck we could probably still manage to avoid the lock except when
> > actually inserting a new hash element. If you just store in the hash an index
> > into an array stored in global memory then you could get away without a lock
> > on the element in the array.
> >
> > It starts to get to be a fair amount of code when you think about how you
> > would reuse elements of the array. That's why I suggest only looking at this
> > if down the road we find that it's a bottleneck.
>
> Another trick you could do is to use acquire the lock conditionally when
> updating it. But I doubt it's a problem anyhow, if we put some sane
> lower limit in there so that it's not used at all for small tables.
>

The more sophisticated the data structure the less able we are to avoid
locking, correct? For instance, if we have an LRU list it might be
tricky or impossible to avoid locking even on just reads.

Regards,
Jeff Davis

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-02 21:28:26
Message-ID:	4639027A.7080809@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis wrote:
> On Wed, 2007-05-02 at 20:02 +0100, Heikki Linnakangas wrote:
>> Gregory Stark wrote:
>>> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>>>
>>>> Let's use a normal hash table instead, and use a lock to protect it. If we only
>>>> update it every 10 pages or so, the overhead should be negligible. To further
>>>> reduce contention, we could modify ReadBuffer to let the caller know if the
>>>> read resulted in a physical read or not, and only update the entry when a page
>>>> is physically read in. That way all the synchronized scanners wouldn't be
>>>> updating the same value, just the one performing the I/O. And while we're at
>>>> it, let's use the full relfilenode instead of just the table oid in the hash.
>>> It's probably fine to just do that. But if we find it's a performance
>>> bottleneck we could probably still manage to avoid the lock except when
>>> actually inserting a new hash element. If you just store in the hash an index
>>> into an array stored in global memory then you could get away without a lock
>>> on the element in the array.
>>>
>>> It starts to get to be a fair amount of code when you think about how you
>>> would reuse elements of the array. That's why I suggest only looking at this
>>> if down the road we find that it's a bottleneck.
>> Another trick you could do is to use acquire the lock conditionally when
>> updating it. But I doubt it's a problem anyhow, if we put some sane
>> lower limit in there so that it's not used at all for small tables.
>>
>
> The more sophisticated the data structure the less able we are to avoid
> locking, correct? For instance, if we have an LRU list it might be
> tricky or impossible to avoid locking even on just reads.

Agreed. I'm not concerned about reads, though. You only need to read
from the structure once when you start a scan. It's the updates that
cause most of the traffic.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-02 22:37:01
Message-ID:	1178145421.28383.189.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote:
> Jeff Davis wrote:
> > What should be the maximum size of this hash table?
>
> Good question. And also, how do you remove entries from it?
>
> I guess the size should somehow be related to number of backends. Each
> backend will realistically be doing just 1 or max 2 seq scan at a time.
> It also depends on the number of large tables in the databases, but we
> don't have that information easily available. How about using just
> NBackends? That should be plenty, but wasting a few hundred bytes of
> memory won't hurt anyone.

One entry per relation, not per backend, is my current design.

> I think you're going to need an LRU list and counter of used entries in
> addition to the hash table, and when all entries are in use, remove the
> least recently used one.
>
> The thing to keep an eye on is that it doesn't add too much overhead or
> lock contention in the typical case when there's no concurrent scans.
>
> For the locking, use a LWLock.
>

Ok. What would be the potential lock contention in the case of no
concurrent scans?

Also, is it easy to determine the space used by a dynahash with N
entries? I haven't looked at the dynahash code yet, so perhaps this will
be obvious.

> No, not the segment. RelFileNode consists of tablespace oid, database
> oid and relation oid. You can find it in scan->rs_rd->rd_node. The
> segmentation works at a lower level.

Ok, will do.

> Hmm. Should we care then? CFG is the default on Linux, and an average
> sysadmin is unlikely to change it.
>

Keep in mind that concurrent sequential scans with CFQ are *already*
very poor. I think that alone is an interesting fact that's somewhat
independent of Sync Scans.

> - when ReadBuffer is called, let the caller know if the read did
> physical I/O.
> - when the previous ReadBuffer didn't result in physical I/O, assume
> that we're not the pack leader. If the next buffer isn't already in
> cache, wait a few milliseconds before initiating the read, giving the
> pack leader a chance to do it instead.
>
> Needs testing, of course..
>

An interesting idea. I like that the most out of the ideas of
maintaining a "pack leader". That's very similar to what the Linux
anticipatory scheduler does for us.

> >> 4. It fails regression tests. You get an assertion failure on the portal
> >> test. I believe that changing the direction of a scan isn't handled
> >> properly; it's probably pretty easy to fix.
> >>
> >
> > I will examine the code more carefully. As a first guess, is it possible
> > that test is failing because of the non-deterministic order in which
> > tuples are returned?
>
> No, it's an assertion failure, not just different output than expected.
> But it's probably quite simple to fix..
>

Ok, I'll find and correct it then.

Regards,
Jeff Davis

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-02 22:59:51
Message-ID:	463917E7.4000803@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis wrote:
> On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote:
>> Jeff Davis wrote:
>>> What should be the maximum size of this hash table?
>> Good question. And also, how do you remove entries from it?
>>
>> I guess the size should somehow be related to number of backends. Each
>> backend will realistically be doing just 1 or max 2 seq scan at a time.
>> It also depends on the number of large tables in the databases, but we
>> don't have that information easily available. How about using just
>> NBackends? That should be plenty, but wasting a few hundred bytes of
>> memory won't hurt anyone.
>
> One entry per relation, not per backend, is my current design.

Umm, you naturally have just entry per relation, but we were talking
about how many entries the table needs to hold.. You're patch had a
hard-coded value of 1000 which is quite arbitrary.

>> I think you're going to need an LRU list and counter of used entries in
>> addition to the hash table, and when all entries are in use, remove the
>> least recently used one.
>>
>> The thing to keep an eye on is that it doesn't add too much overhead or
>> lock contention in the typical case when there's no concurrent scans.
>>
>> For the locking, use a LWLock.
>
> Ok. What would be the potential lock contention in the case of no
> concurrent scans?

If you only have one seq scan in the system, there's no contention. If
you have more, they will all need to acquire the lock to update the page
number in the hash table, regardless of the table their scanning, so
there's potential for contention.

To put things into perspective, though, any time you need to evict a
buffer from the buffer cache to read in a new buffer, you need to
acquire the BufFreelistLock. If we only update the page number say every
10 pages, the overhead and lock contention of that lock would be roughly
1/10th of that arising from BufFreelistLock. And we can make it a
conditional acquire, in which case the backends won't need to slow down
their scans, but the updates of the entries in the hash table will get
less frequent instead. We could also take just a shared lock to update
the counter, and rely on the atomicity of the write like your patch did.
You'd only need to take an exclusive lock to move entries in the LRU
list or to add or remove an entry.

But let's keep it simple at first, and do some testing..

> Also, is it easy to determine the space used by a dynahash with N
> entries? I haven't looked at the dynahash code yet, so perhaps this will
> be obvious.

Yes, see hash_estimate_size.

BTW: Thanks for the patch!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-03 00:24:18
Message-ID:	1178151858.28383.212.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote:
> Jeff Davis wrote:
> > On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote:
> >> Jeff Davis wrote:
> >>> What should be the maximum size of this hash table?
> >> Good question. And also, how do you remove entries from it?
> >>
> >> I guess the size should somehow be related to number of backends. Each
> >> backend will realistically be doing just 1 or max 2 seq scan at a time.
> >> It also depends on the number of large tables in the databases, but we
> >> don't have that information easily available. How about using just
> >> NBackends? That should be plenty, but wasting a few hundred bytes of
> >> memory won't hurt anyone.
> >
> > One entry per relation, not per backend, is my current design.
>
> Umm, you naturally have just entry per relation, but we were talking
> about how many entries the table needs to hold.. You're patch had a
> hard-coded value of 1000 which is quite arbitrary.

I think I'm missing something from your statement "I guess the size
should somehow be related to number of backends." If there is only one
entry per relation, why do more backends require a larger hash table?

> If you only have one seq scan in the system, there's no contention. If
> you have more, they will all need to acquire the lock to update the page
> number in the hash table, regardless of the table their scanning, so
> there's potential for contention.
>
> To put things into perspective, though, any time you need to evict a
> buffer from the buffer cache to read in a new buffer, you need to
> acquire the BufFreelistLock. If we only update the page number say every
> 10 pages, the overhead and lock contention of that lock would be roughly
> 1/10th of that arising from BufFreelistLock. And we can make it a
> conditional acquire, in which case the backends won't need to slow down
> their scans, but the updates of the entries in the hash table will get
> less frequent instead. We could also take just a shared lock to update
> the counter, and rely on the atomicity of the write like your patch did.
> You'd only need to take an exclusive lock to move entries in the LRU
> list or to add or remove an entry.
>

If I just step back for a second to consider a simpler design:

We will most likely have no more than a few relations larger than the
minimum threshold for Sync Scanning in any database being scanned
concurrently.

What if I just make an LRU that occupies a fixed size? Any reads or
updates can start at the MRU item and work backward, and any evictions
can happen at the LRU item.

Does a hash table really save us cycles if we keep the list small, say,
100 items? Looking at all 100 items would only be required perhaps on a
scan startup. I don't have a good sense of scale here, is following 50
pointers while holding a lock a significant source of contention?

100 is of course arbitrary, and that could grow or shrink automatically
at runtime.

> Yes, see hash_estimate_size.
>

Ok.

Regards,
Jeff Davis

From:	"Simon Riggs" <simon(at)enterprisedb(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Jeff Davis" <pgsql(at)j-davis(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-03 07:01:32
Message-ID:	1178175693.3633.38.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote:

> Umm, you naturally have just entry per relation, but we were talking
> about how many entries the table needs to hold.. You're patch had a
> hard-coded value of 1000 which is quite arbitrary.

We need to think of the interaction with partitioning here. People will
ask whether we would recommend that individual partitions of a large
table should be larger/smaller than a particular size, to allow these
optimizations to kick in.

My thinking is that database designers would attempt to set partition
size larger than the sync scan limit, whatever it is. That means:
- they wouldn't want the limit to vary when cache increases, so we *do*
need a GUC to control the limit. My suggestion now would be
large_scan_threshold, since it effects both caching and synch scans.
- so there will be lots of partitions, so a hardcoded limit of 1000
would not be sufficient. A new GUC, or a link to an existing one, is
probably required.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-03 08:25:29
Message-ID:	46399C79.4050400@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis wrote:
> On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote:
>> Jeff Davis wrote:
>>> On Wed, 2007-05-02 at 20:58 +0100, Heikki Linnakangas wrote:
>>>> Jeff Davis wrote:
>>>>> What should be the maximum size of this hash table?
>>>> Good question. And also, how do you remove entries from it?
>>>>
>>>> I guess the size should somehow be related to number of backends. Each
>>>> backend will realistically be doing just 1 or max 2 seq scan at a time.
>>>> It also depends on the number of large tables in the databases, but we
>>>> don't have that information easily available. How about using just
>>>> NBackends? That should be plenty, but wasting a few hundred bytes of
>>>> memory won't hurt anyone.
>>> One entry per relation, not per backend, is my current design.
>> Umm, you naturally have just entry per relation, but we were talking
>> about how many entries the table needs to hold.. You're patch had a
>> hard-coded value of 1000 which is quite arbitrary.
>
> I think I'm missing something from your statement "I guess the size
> should somehow be related to number of backends." If there is only one
> entry per relation, why do more backends require a larger hash table?

The hash table keeps track of ongoing seq scans. That's presumably
related to number of backends; I can't imagine a plan where a backend is
executing more than one seq scan at a time, so that gives us an upper
bound of NBackends simultaneous seq scans in the system.

Sure, if you only have say 5 tables that are large enough that we try to
keep track of them, there can't be more than 5 entries in the table. But
we don't know how many such tables there is at startup time.

A hard coded constant in the range of 10 - 100 might be just fine in
practice.

> If I just step back for a second to consider a simpler design:
>
> We will most likely have no more than a few relations larger than the
> minimum threshold for Sync Scanning in any database being scanned
> concurrently.

Note that the list would be shared with all databases in the cluster.
Your point is still valid, though.

> What if I just make an LRU that occupies a fixed size? Any reads or
> updates can start at the MRU item and work backward, and any evictions
> can happen at the LRU item.
>
> Does a hash table really save us cycles if we keep the list small, say,
> 100 items? Looking at all 100 items would only be required perhaps on a
> scan startup. I don't have a good sense of scale here, is following 50
> pointers while holding a lock a significant source of contention?

Hmm. If you only need to scan through it when you start the scan, I
suppose it would be ok.

> 100 is of course arbitrary, and that could grow or shrink automatically
> at runtime.

Except that it can't shrink, because we don't have any convenient moment
to remove an entry from the list, other than dropping the least recently
used entry out of the list when inserting a new one. And since it's in
shared mem, the allocated size is fixed at boot up, so we can't grow it
either.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)enterprisedb(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-03 08:54:26
Message-ID:	4639A342.10305@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> We need to think of the interaction with partitioning here. People will
> ask whether we would recommend that individual partitions of a large
> table should be larger/smaller than a particular size, to allow these
> optimizations to kick in.
>
> My thinking is that database designers would attempt to set partition
> size larger than the sync scan limit, whatever it is. That means:
> - they wouldn't want the limit to vary when cache increases, so we *do*
> need a GUC to control the limit. My suggestion now would be
> large_scan_threshold, since it effects both caching and synch scans.

They wouldn't? If you add more memory to your box, so that a table that
didn't fit in memory before does now fit, surely you want to switch your
strategy from "don't pollute the cache because it won't fit anyway" to
"let's keep it in cache, so the next scan won't do I/O".

The basic problem with partitions is that if you have a query like
"SELECT * FROM partitioned_table", so that you seq scan multiple
partitions, the size of each partition alone could be below the
threshold, whatever that is, but since you're scanning them all the net
result is the same as scanning one large table above the threshold. The
same could happen in any plan that does multiple seq scans. It could
even happen at the application level if the application submits multiple
statements like "SELECT * FROM table1", "SELECT * FROM table2" one after
each other.

One way to address that would be to manage the recycle buffer ring size
dynamically. When a backend gets a cache miss, the ring would shrink,
and when you get cache hits, it would grow. That way, if you do a seq
scan on a table that fits in cache repeatedly, that table would get more
buffers from the cache on each iteration until it's completely in cache.
But if the table or tables being scanned are too big to fit in cache,
the ring would stay small and not pollute the cache much.

I'm not going to implement that for now, I'm seeing some scary negative
feedback behavior with that, and it'd need a lot of testing anyway. I'm
thinking of just using shared_buffers as the limit. One could argue for
effective_cache_size as well.

> - so there will be lots of partitions, so a hardcoded limit of 1000
> would not be sufficient. A new GUC, or a link to an existing one, is
> probably required.

No matter how many partitions you have, each backend could still be
scanning only one of them at a time.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-03 17:54:09
Message-ID:	1178214849.28383.264.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2007-05-03 at 09:25 +0100, Heikki Linnakangas wrote:
> The hash table keeps track of ongoing seq scans. That's presumably
> related to number of backends; I can't imagine a plan where a backend is
> executing more than one seq scan at a time, so that gives us an upper
> bound of NBackends simultaneous seq scans in the system.
>

What I was trying to say before is that, in my design, it keeps track of
relations being scanned, not scans happening. So 5 backends scanning the
same table would result in one entry that consists of the most recently-
read block by any of those backends for that relation.

Part of the reason I did this is because, with a Sync Scan, it seems
like there may be possible reasons to do multiple seq scans concurrently
in the same backend. This is completely contrived, but consider:

SELECT * FROM bigtable UNION ALL SELECT * FROM bigtable;

I'm not trying to argue for such a plan to exist right now, but if there
is a possibility that we'd want to create such plans in the future, I
think it's better to track by relation rather than by backend.

There is really no reason that I can see to track 5 positions of 5
different backends scanning the same relation. It introduces a lot of
new questions, such as:

(1) When starting a new scan on relation R, with which backend do we
choose to synchronize?
(2) If 3/5 of those scans have finished (and there is therefore no point
to synchronize with those scans), how do we find the two that are still
moving?

I think storing one entry per relation being scanned, regardless of the
number of backends, nicely avoids both of those questions.

> > What if I just make an LRU that occupies a fixed size? Any reads or
> > updates can start at the MRU item and work backward, and any evictions
> > can happen at the LRU item.
> >
> > Does a hash table really save us cycles if we keep the list small, say,
> > 100 items? Looking at all 100 items would only be required perhaps on a
> > scan startup. I don't have a good sense of scale here, is following 50
> > pointers while holding a lock a significant source of contention?
>
> Hmm. If you only need to scan through it when you start the scan, I
> suppose it would be ok.

The idea is that the list would never grow longer than the total number
of tables that are larger than sync_seqscan_threshold, and would have
some maximum (to fit into fixed-size shared memory). The list would be a
bi-directional LRU list (MRU at head, LRU at tail). When evicting an
relation from the list due to space exhaustion, it would delete the tail
of the list. When a new scan starts and it's already in the list, it
would most likely just need to read a few of the hot elements at the
head of the list before it found the relation in question. When it's not
in the list, the scan may need to read all the way through to see that
it's not there.

> > 100 is of course arbitrary, and that could grow or shrink automatically
> > at runtime.
>
> Except that it can't shrink, because we don't have any convenient moment
> to remove an entry from the list, other than dropping the least recently
> used entry out of the list when inserting a new one. And since it's in
> shared mem, the allocated size is fixed at boot up, so we can't grow it
> either.
>

It can shrink when a new scan looks through the list and passes by
entries that don't need to be in the list.

I don't want to over-complicate things, so if you think having a hash
table is a better, simpler, or more readable design, I'll go with that
instead of the list-only approach. The only reason I suggested the list-
only approach was to see if the hash table was really gaining anything
for us. In practice, it will be quite rare that more than 10 different
big relations are being scanned at once, and I don't know how much of a
gain a hash table is when there are (in all but exceptional cases) only
a few entries.

Regards,
Jeff Davis

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Simon Riggs <simon(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-03 18:04:11
Message-ID:	1178215451.28383.272.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2007-05-03 at 08:01 +0100, Simon Riggs wrote:
> On Wed, 2007-05-02 at 23:59 +0100, Heikki Linnakangas wrote:
>
> > Umm, you naturally have just entry per relation, but we were talking
> > about how many entries the table needs to hold.. You're patch had a
> > hard-coded value of 1000 which is quite arbitrary.
>
> We need to think of the interaction with partitioning here. People will
> ask whether we would recommend that individual partitions of a large
> table should be larger/smaller than a particular size, to allow these
> optimizations to kick in.
>
> My thinking is that database designers would attempt to set partition
> size larger than the sync scan limit, whatever it is. That means:
> - they wouldn't want the limit to vary when cache increases, so we *do*
> need a GUC to control the limit. My suggestion now would be
> large_scan_threshold, since it effects both caching and synch scans.
> - so there will be lots of partitions, so a hardcoded limit of 1000
> would not be sufficient. A new GUC, or a link to an existing one, is
> probably required.
>

That's a very good point. I don't know how much we can do to fix it now
though, because that has interactions with the planner too: the planner
"should" choose to UNION ALL the relations in an order dependent on
other concurrent queries. I think this will require more thought.

To address the idea of scaling to more relations being concurrently
scanned I could use Heikki's recommendation of a dynamic hash table.

Regards,
Jeff Davis

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-03 18:27:51
Message-ID:	463A29A7.7080006@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis wrote:
> What I was trying to say before is that, in my design, it keeps track of
> relations being scanned, not scans happening. So 5 backends scanning the
> same table would result in one entry that consists of the most recently-
> read block by any of those backends for that relation.

We're talking across each other...

I understand that the data structure keeps track of relations being
scanned, with one entry per such relation. I think that's very good
design, and I'm not suggesting to change that.

But what's the right size for that? We don't know how many large tables
there is in the database, so we can't use that. A hard-coded constant
maybe? What would it be? I figured we could use NBackends, because there
can't realistically be more than one large seq scan per backend going on
at any point in time. That's an upper bound, in any real world
application there's far less than that.

> Part of the reason I did this is because, with a Sync Scan, it seems
> like there may be possible reasons to do multiple seq scans concurrently
> in the same backend. This is completely contrived, but consider:
>
> SELECT * FROM bigtable UNION ALL SELECT * FROM bigtable;
>
> I'm not trying to argue for such a plan to exist right now, but if there
> is a possibility that we'd want to create such plans in the future, I
> think it's better to track by relation rather than by backend.

Those two seqscans would be executed one after each other, not in
parallel, so you would still have just one seq scan at any given point
in time.

> (1) When starting a new scan on relation R, with which backend do we
> choose to synchronize?
> (2) If 3/5 of those scans have finished (and there is therefore no point
> to synchronize with those scans), how do we find the two that are still
> moving?
>
> I think storing one entry per relation being scanned, regardless of the
> number of backends, nicely avoids both of those questions.

Yeah, agreed. That's a good design.

>>> What if I just make an LRU that occupies a fixed size? Any reads or
>>> updates can start at the MRU item and work backward, and any evictions
>>> can happen at the LRU item.
>>>
>>> Does a hash table really save us cycles if we keep the list small, say,
>>> 100 items? Looking at all 100 items would only be required perhaps on a
>>> scan startup. I don't have a good sense of scale here, is following 50
>>> pointers while holding a lock a significant source of contention?
>> Hmm. If you only need to scan through it when you start the scan, I
>> suppose it would be ok.
>
> The idea is that the list would never grow longer than the total number
> of tables that are larger than sync_seqscan_threshold, and would have
> some maximum (to fit into fixed-size shared memory). The list would be a
> bi-directional LRU list (MRU at head, LRU at tail). When evicting an
> relation from the list due to space exhaustion, it would delete the tail
> of the list. When a new scan starts and it's already in the list, it
> would most likely just need to read a few of the hot elements at the
> head of the list before it found the relation in question. When it's not
> in the list, the scan may need to read all the way through to see that
> it's not there.

Yeah, that works well when there's just a few hot elements in the list.
I'm not sure how large the list would need to grow until scanning it
would become a performance bottleneck, but I suppose a list of 5-10
elements would be perfectly fine. But is that enough?

>>> 100 is of course arbitrary, and that could grow or shrink automatically
>>> at runtime.
>> Except that it can't shrink, because we don't have any convenient moment
>> to remove an entry from the list, other than dropping the least recently
>> used entry out of the list when inserting a new one. And since it's in
>> shared mem, the allocated size is fixed at boot up, so we can't grow it
>> either.
>>
>
> It can shrink when a new scan looks through the list and passes by
> entries that don't need to be in the list.

How do you identify an entry that doesn't need to be there?

> I don't want to over-complicate things, so if you think having a hash
> table is a better, simpler, or more readable design, I'll go with that
> instead of the list-only approach. The only reason I suggested the list-
> only approach was to see if the hash table was really gaining anything
> for us. In practice, it will be quite rare that more than 10 different
> big relations are being scanned at once, and I don't know how much of a
> gain a hash table is when there are (in all but exceptional cases) only
> a few entries.

How about implementing the list-only approach first, since you're going
to need the LRU list anyway, and we can add consider adding the hash
table later?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sequential scans
Date:	2007-05-03 19:04:47
Message-ID:	1178219087.28383.286.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2007-05-03 at 19:27 +0100, Heikki Linnakangas wrote:
> I understand that the data structure keeps track of relations being
> scanned, with one entry per such relation. I think that's very good
> design, and I'm not suggesting to change that.
>
> But what's the right size for that? We don't know how many large tables
> there is in the database, so we can't use that. A hard-coded constant
> maybe? What would it be? I figured we could use NBackends, because there
> can't realistically be more than one large seq scan per backend going on
> at any point in time. That's an upper bound, in any real world
> application there's far less than that.

Ok, I understand you now. We'll choose the maximum size of the
table/list as the max_connections on startup.

> > Part of the reason I did this is because, with a Sync Scan, it seems
> > like there may be possible reasons to do multiple seq scans concurrently
> > in the same backend. This is completely contrived, but consider:
> >
> > SELECT * FROM bigtable UNION ALL SELECT * FROM bigtable;
> >
> > I'm not trying to argue for such a plan to exist right now, but if there
> > is a possibility that we'd want to create such plans in the future, I
> > think it's better to track by relation rather than by backend.
>
> Those two seqscans would be executed one after each other, not in
> parallel, so you would still have just one seq scan at any given point
> in time.

Right, I was just throwing out a hypothetical situation (if highly
contrived) in which it may help to do multiple scans at once. I think
this can be safely ignored for now though, since the planner would never
generate such a plan, and I can't think of a reasonable use case.

> >>> 100 is of course arbitrary, and that could grow or shrink automatically
> >>> at runtime.
> >> Except that it can't shrink, because we don't have any convenient moment
> >> to remove an entry from the list, other than dropping the least recently
> >> used entry out of the list when inserting a new one. And since it's in
> >> shared mem, the allocated size is fixed at boot up, so we can't grow it
> >> either.
> >>
> >
> > It can shrink when a new scan looks through the list and passes by
> > entries that don't need to be in the list.
>
> How do you identify an entry that doesn't need to be there?
>

Some type of clock-like algorithm where a counter was decremented when
an element is passed up and incremented when it's chosen might work.
It's probably easier to just use a hash table though.

> > I don't want to over-complicate things, so if you think having a hash
> > table is a better, simpler, or more readable design, I'll go with that
> > instead of the list-only approach. The only reason I suggested the list-
> > only approach was to see if the hash table was really gaining anything
> > for us. In practice, it will be quite rare that more than 10 different
> > big relations are being scanned at once, and I don't know how much of a
> > gain a hash table is when there are (in all but exceptional cases) only
> > a few entries.
>
> How about implementing the list-only approach first, since you're going
> to need the LRU list anyway, and we can add consider adding the hash
> table later?
>

Sounds good. Will do.

Regards,
Jeff Davis