Quick Links

Re: [DOCS] synchronize_seqscans' description is a bit misleading

Lists:	pgsql-docspgsql-hackers

From:	Gurjeet Singh <gurjeet(at)singh(dot)im>
To:	PostgreSQL Docs <pgsql-docs(at)postgresql(dot)org>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	synchronize_seqscans' description is a bit misleading
Date:	2013-04-11 01:57:06
Message-ID:	CABwTF4VwxS+jjT2RZSzHny5LArW+jFjFn5uiGH8cTRCXETGNag@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-docs pgsql-hackers

If I'm reading the code right [1], this GUC does not actually *synchronize*
the scans, but instead just makes sure that a new scan starts from a block
that was reported by some other backend performing a scan on the same
relation.

Since the backends scanning the relation may be processing the relation at
different speeds, even though each one took the hint when starting the
scan, they may end up being out of sync with each other. Even in a single
query, there may be different scan nodes scanning different parts of the
same relation, and even they don't synchronize with each other (and for
good reason).

Imagining that all scans on a table are always synchronized, may make some
wrongly believe that adding more backends scanning the same table will not
incur any extra I/O; that is, only one stream of blocks will be read from
disk no matter how many backends you add to the mix. I noticed this when I
was creating partition tables, and each of those was a CREATE TABLE AS
SELECT FROM original_table (to avoid WAL generation), and running more than
3 such transactions caused the disk read throughput to behave unpredictably,
sometimes even dipping below 1 MB/s for a few seconds at a stretch.

Please note that I am not complaining about the implementation, which I
think is the best we can do without making backends wait for each other.
It's just that the documentation [2] implies that the scans are
synchronized through the entire run, which is clearly not the case. So I'd
like the docs to be improved to reflect that.

How about something like:

<doc>
synchronize_seqscans (boolean)
This allows sequential scans of large tables to start from a point in
the table that is already being read by another backend. This increases the
probability that concurrent scans read the same block at about the same
time and hence share the I/O workload. Note that, due to the difference in
speeds of processing the table, the backends may eventually get out of
sync, and hence stop sharing the I/O workload.

When this is enabled, ... The default is on.
</doc>

Best regards,

[1] src/backend/access/heap/heapam.c
[2]
http://www.postgresql.org/docs/9.2/static/runtime-config-compatible.html#GUC-SYNCHRONIZE-SEQSCANS

--
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gurjeet Singh <gurjeet(at)singh(dot)im>
Cc:	PostgreSQL Docs <pgsql-docs(at)postgresql(dot)org>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: synchronize_seqscans' description is a bit misleading
Date:	2013-04-11 03:10:05
Message-ID:	18280.1365649805@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-docs pgsql-hackers

Gurjeet Singh <gurjeet(at)singh(dot)im> writes:
> If I'm reading the code right [1], this GUC does not actually *synchronize*
> the scans, but instead just makes sure that a new scan starts from a block
> that was reported by some other backend performing a scan on the same
> relation.

Well, that's the only *direct* effect, but ...

> Since the backends scanning the relation may be processing the relation at
> different speeds, even though each one took the hint when starting the
> scan, they may end up being out of sync with each other.

The point you're missing is that the synchronization is self-enforcing:
whichever backend gets ahead of the others will be the one forced to
request (and wait for) the next physical I/O. This will naturally slow
down the lower-CPU-cost-per-page scans. The other ones tend to catch up
during the I/O operation.

The feature is not terribly useful unless I/O costs are high compared to
the CPU cost-per-page. But when that is true, it's actually rather
robust. Backends don't have to have exactly the same per-page
processing cost, because pages stay in shared buffers for a while after
the current scan leader reads them.

> Imagining that all scans on a table are always synchronized, may make some
> wrongly believe that adding more backends scanning the same table will not
> incur any extra I/O; that is, only one stream of blocks will be read from
> disk no matter how many backends you add to the mix. I noticed this when I
> was creating partition tables, and each of those was a CREATE TABLE AS
> SELECT FROM original_table (to avoid WAL generation), and running more than
> 3 such transactions caused the disk read throughput to behave unpredictably,
> sometimes even dipping below 1 MB/s for a few seconds at a stretch.

It's not really the scans that's causing that to be unpredictable, it's
the write I/O from the output side, which is forcing highly
nonsequential behavior (or at least I suspect so ... how many disk units
were involved in this test?)

regards, tom lane

From:	Gurjeet Singh <gurjeet(at)singh(dot)im>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL Docs <pgsql-docs(at)postgresql(dot)org>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: synchronize_seqscans' description is a bit misleading
Date:	2013-04-11 03:39:38
Message-ID:	CABwTF4XZvDki20+edBaHPzT3rvahEtpnoa+W+nwKWgBvqoPG4Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-docs pgsql-hackers

On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Gurjeet Singh <gurjeet(at)singh(dot)im> writes:
> > If I'm reading the code right [1], this GUC does not actually
> *synchronize*
> > the scans, but instead just makes sure that a new scan starts from a
> block
> > that was reported by some other backend performing a scan on the same
> > relation.
>
> Well, that's the only *direct* effect, but ...
>
> > Since the backends scanning the relation may be processing the relation
> at
> > different speeds, even though each one took the hint when starting the
> > scan, they may end up being out of sync with each other.
>
> The point you're missing is that the synchronization is self-enforcing:
> whichever backend gets ahead of the others will be the one forced to
> request (and wait for) the next physical I/O. This will naturally slow
> down the lower-CPU-cost-per-page scans. The other ones tend to catch up
> during the I/O operation.
>

Got it. So far, so good.

Let's consider a pathological case where a scan is performed by a user
controlled cursor, whose scan speed depends on how fast the user presses
the "Next" button, then this scan is quickly going to fall out of sync with
other scans. Moreover, if a new scan happens to pick up the block reported
by this slow scan, then that new scan may have to read blocks off the disk
afresh.

So, again, it is not guaranteed that all the scans on a relation will
synchronize with each other. Hence my proposal to include the term
'probability' in the definition.

> The feature is not terribly useful unless I/O costs are high compared to
> the CPU cost-per-page. But when that is true, it's actually rather
> robust. Backends don't have to have exactly the same per-page
> processing cost, because pages stay in shared buffers for a while after
> the current scan leader reads them.
>

Agreed. Even if the buffer has been evicted from shared_buffers, there's a
high likelihood that the scan that's close on the heels of others will
fetch it from FS cache.

>
> > Imagining that all scans on a table are always synchronized, may make
> some
> > wrongly believe that adding more backends scanning the same table will
> not
> > incur any extra I/O; that is, only one stream of blocks will be read from
> > disk no matter how many backends you add to the mix. I noticed this when
> I
> > was creating partition tables, and each of those was a CREATE TABLE AS
> > SELECT FROM original_table (to avoid WAL generation), and running more
> than
> > 3 such transactions caused the disk read throughput to behave
> unpredictably,
> > sometimes even dipping below 1 MB/s for a few seconds at a stretch.
>
> It's not really the scans that's causing that to be unpredictable, it's
> the write I/O from the output side, which is forcing highly
> nonsequential behavior (or at least I suspect so ... how many disk units
> were involved in this test?)
>

You may be right. I don't have access to the system anymore, and I don't
remember the disk layout, but it's quite possible that write operations
were causing the read throughput to drop. I did try to reproduce the
behaviour on my laptop with up to 6 backends doing pure reads on a table
that was multiple times the system RAM, but I could not get them to get out
of sync.

--
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gurjeet Singh <gurjeet(at)singh(dot)im>
Cc:	PostgreSQL Docs <pgsql-docs(at)postgresql(dot)org>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [DOCS] synchronize_seqscans' description is a bit misleading
Date:	2013-04-11 03:56:44
Message-ID:	19284.1365652604@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-docs pgsql-hackers

Gurjeet Singh <gurjeet(at)singh(dot)im> writes:
> On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The point you're missing is that the synchronization is self-enforcing:

> Let's consider a pathological case where a scan is performed by a user
> controlled cursor, whose scan speed depends on how fast the user presses
> the "Next" button, then this scan is quickly going to fall out of sync with
> other scans. Moreover, if a new scan happens to pick up the block reported
> by this slow scan, then that new scan may have to read blocks off the disk
> afresh.

Sure --- if a backend stalls completely, it will fall out of the
synchronized group. And that's a good thing; we'd surely not want to
block the other queries while waiting for a user who just went to lunch.

> So, again, it is not guaranteed that all the scans on a relation will
> synchronize with each other. Hence my proposal to include the term
> 'probability' in the definition.

Yeah, it's definitely not "guaranteed" in any sense. But I don't really
think your proposed wording is an improvement. The existing wording
isn't promising guaranteed sync either, to my eyes.

Perhaps we could compromise on, say, changing "so that concurrent scans
read the same block at about the same time" to "so that concurrent scans
tend to read the same block at about the same time", or something like
that. I don't mind making it sound a bit more uncertain, but I don't
think that we need to emphasize the probability of failure.

regards, tom lane

From:	Gurjeet Singh <gurjeet(at)singh(dot)im>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL Docs <pgsql-docs(at)postgresql(dot)org>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [DOCS] synchronize_seqscans' description is a bit misleading
Date:	2013-04-11 08:07:41
Message-ID:	CABwTF4Vo0CypkG0hAn31tws=7yLgS9GgH9domE6wHK-zorCRLQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-docs pgsql-hackers

On Wed, Apr 10, 2013 at 11:56 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Gurjeet Singh <gurjeet(at)singh(dot)im> writes:
> > So, again, it is not guaranteed that all the scans on a relation will
> > synchronize with each other. Hence my proposal to include the term
> > 'probability' in the definition.
>
> Yeah, it's definitely not "guaranteed" in any sense. But I don't really
> think your proposed wording is an improvement. The existing wording
> isn't promising guaranteed sync either, to my eyes.
>

Given Postgres' track record of delivering what it promises, I expect
casual readers to take that phrase as a definitive guide to what is
happening internally.

>
> Perhaps we could compromise on, say, changing "so that concurrent scans
> read the same block at about the same time" to "so that concurrent scans
> tend to read the same block at about the same time",

Given that, on first read the word "about" did not deter me from assuming
the best, I don't think adding "tend" would make much difference in a
readers (mis)understanding. Perhaps we can spare a few more words to make
more clear.

> or something like
> that. I don't mind making it sound a bit more uncertain, but I don't
> think that we need to emphasize the probability of failure.
>

I agree we don't want to stress the failure case too much, especially when
it makes the performance no worse than the absence of the feature. But we
don't want the reader to get the wrong idea either.

In addition to the slight doc improvement being suggested, perhaps a
wiki.postgresql.org entry would allow us to explain the behaviour in more
detail.

--
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.