Re: Bulk Inserts and WAL Inserts

Lists: pgsql-hackers
From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Bulk Inserts
Date: 2009-09-14 14:26:53
Message-ID: op.uz83q3tecke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I've done a little experiment with bulk inserts.

=> heap_bulk_insert()

Behaves like heap_insert except it takes an array of tuples (HeapTuple
*tups, int ntups).

- Grabs a page (same as heap_insert)

- While holding exclusive lock, inserts as many tuples as it can on the
page.
- Either the page gets full
- Or we run out of tuples.

- Generate xlog : choice between
- Full Xlog mode :
- if we inserted more than 10 tuples (totaly bogus heuristic), log the
entire page
- Else, log individual tuples as heap_insert does
- Light log mode :
- if page was empty, only xlog a "new empty page" record, not page
contents
- else, log fully
- heap_sync() at the end

- Release the page
- If we still have tuples to insert, repeat.

Am I right in assuming that :

1)
- If the page was empty,
- and log archiving isn't used,
- and the table is heap_sync()'d at the end,
=> only a "new empty page" record needs to be created, then the page can
be completely filled ?

2)
- If the page isn't empty
- or log archiving is used,
=> logging either the inserted tuples or the entire page is OK to
guarantee persistence ?

(I used kill -9 to test it, recovery seems to work).

Test on a concurrent COPY, 4 threads, on a table with 8 INT columns.

* 8.5 HEAD : Total Time 44 s
* Bulk inserts, Full XLog : Total Time 24 s
* Bulk inserts, Light XLog : Total Time 10 s

Quite a bit faster... I presume with more CPUs it would scale.

I'm not posting the patch because it's quite ugly (especially the part to
store tuples in copy.c and bulk-insert them, I should probably have used a
tuplestore...)
I think the tuples need to be stored and then bulk-inserted because the
exclusive lock on the buffer can't be held for a long time.

Lock stats (from the patch I just posted) :

* 8.5 HEAD : Total Time 44 s

-------- Lock stats for PID 28043
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
28043 7 0 0 0.00 0.00
2500002 804378 23.59 ( 53.11 %) 7.38 ( 16.61 %) WALInsert
28043 8 0 0 0.00 0.00
25775 32 2.91 ( 6.54 %) 0.90 ( 2.02 %) WALWrite

-------- Lock stats for PID 28044
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
28044 7 0 0 0.00 0.00
2500002 802515 22.26 ( 50.11 %) 8.70 ( 19.59 %) WALInsert
28044 8 0 0 0.00 0.00
25620 42 4.00 ( 9.01 %) 1.12 ( 2.52 %) WALWrite

-------- Lock stats for PID 28045
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
28045 7 0 0 0.00 0.00
2500002 799145 22.47 ( 50.32 %) 8.72 ( 19.52 %) WALInsert
28045 8 0 0 0.00 0.00
25725 38 4.08 ( 9.14 %) 1.05 ( 2.35 %) WALWrite

-------- Lock stats for PID 28042
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
28042 7 0 0 0.00 0.00
2500002 809477 23.49 ( 52.44 %) 7.89 ( 17.62 %) WALInsert
28042 8 0 0 0.00 0.00
25601 37 3.27 ( 7.31 %) 1.05 ( 2.34 %) WALWrite

* Bulk inserts, Full XLog : Total Time 24 s

-------- Lock stats for PID 32486
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
32486 7 0 0 0.00 0.00
23765 1128 9.22 ( 38.98 %) 4.05 ( 17.14 %) WALInsert
32486 8 0 0 0.00 0.00
21120 19 2.64 ( 11.17 %) 1.32 ( 5.59 %) WALWrite

-------- Lock stats for PID 32484
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
32484 7 0 0 0.00 0.00
23865 1083 9.87 ( 41.68 %) 2.87 ( 12.11 %) WALInsert
32484 8 0 0 0.00 0.00
21105 11 1.68 ( 7.11 %) 1.09 ( 4.62 %) WALWrite
32484 8508 0 0 0.00 0.00
1 1 0.19 ( 0.81 %) 0.00 ( 0.00 %)
32484 18846 0 0 0.00 0.00
1 1 0.25 ( 1.05 %) 0.00 ( 0.00 %)

-------- Lock stats for PID 32485
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
32485 7 0 0 0.00 0.00
23816 1107 8.94 ( 37.75 %) 4.05 ( 17.09 %) WALInsert
32485 8 0 0 0.00 0.00
21109 21 2.59 ( 10.93 %) 1.36 ( 5.77 %) WALWrite
32485 16618 0 0 0.00 0.00
1 2 0.23 ( 0.98 %) 0.00 ( 0.00 %)

-------- Lock stats for PID 32482
PID Lock ShAcq ShWait ShWaitT ShHeldT
ExAcq ExWait ExWaitT ExHeldT Name
32482 7 0 0 0.00 0.00
23813 1053 9.70 ( 40.75 %) 3.41 ( 14.32 %) WALInsert
32482 8 0 0 0.00 0.00
21119 15 2.24 ( 9.43 %) 1.06 ( 4.44 %) WALWrite
32482 6770 0 0 0.00 0.00
3 1 0.17 ( 0.70 %) 0.00 ( 0.00 %)

* Bulk inserts, Light XLog : Total Time 10 s

No Lock stats to show (wait tims is < 0.01 s)...


From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: Bulk Inserts
Date: 2009-09-14 19:25:41
Message-ID: op.uz9hk3hqcke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Replying to myself...

Jeff suggested to build pages in local memory and insert them later in the
table. This is what's used in CLUSTER for instance, I believe.

It has some drawbacks though :

- To insert the tuples in indexes, the tuples need tids, but if you build
the page in local memory, you don't know on which page they will be until
after allocating the page, which will probably be done after the page is
built, so it's a bit of a chicken and egg problem.

- It only works on new pages. Pages which are not empty, but have free
space, cannot be written in this way.

The little experiment I made yesterday does not have these drawbacks,
since it allocates pages in the standard way, simply it inserts many
tuples in one operation instead of just inserting one. If the page
happened to be empty, it's even better, but it's not necessary. If your
table has lots of free space, it will be used.


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulk Inserts
Date: 2009-09-15 01:55:50
Message-ID: f67928030909141855y2ff8993epe4d967a769cebb56@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2009/9/14 Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>

>
> I've done a little experiment with bulk inserts.
>
> => heap_bulk_insert()
>
> Behaves like heap_insert except it takes an array of tuples (HeapTuple
> *tups, int ntups).
>
> - Grabs a page (same as heap_insert)
>
> - While holding exclusive lock, inserts as many tuples as it can on the
> page.
> - Either the page gets full
> - Or we run out of tuples.
>
> - Generate xlog : choice between
> - Full Xlog mode :
> - if we inserted more than 10 tuples (totaly bogus
> heuristic), log the entire page
> - Else, log individual tuples as heap_insert does
>

Does that heuristic change the timings much? If not, it seems like it would
better to keep it simple and always do the same thing, like log the tuples
(if it is done under one WALInsertLock, which I am assuming it is..)

> - Light log mode :
> - if page was empty, only xlog a "new empty page" record,
> not page contents
> - else, log fully
> - heap_sync() at the end
>
> - Release the page
> - If we still have tuples to insert, repeat.
>
> Am I right in assuming that :
>
> 1)
> - If the page was empty,
> - and log archiving isn't used,
> - and the table is heap_sync()'d at the end,
> => only a "new empty page" record needs to be created, then the page can be
> completely filled ?
>

Do you even need the new empty page record? I think a zero page will be
handled correctly next time it is read into shared buffers, won't it? But I
guess it is need to avoid problems with partial page writes that would
leave in a state that is neither all zeros nor consistent.

> 2)
> - If the page isn't empty
> - or log archiving is used,
> => logging either the inserted tuples or the entire page is OK to guarantee
> persistence ?
>

If the entire page is logged, would it have to marked as not removable by
the log compression tool? Or can the tool recreate the needed delta?

Jeff


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulk Inserts
Date: 2009-09-15 02:33:00
Message-ID: f67928030909141933o525b355au4a5bbd1a20faa59b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2009/9/14 Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>

>
> Replying to myself...
>
> Jeff suggested to build pages in local memory and insert them later in the
> table. This is what's used in CLUSTER for instance, I believe.
>
> It has some drawbacks though :
>
> - To insert the tuples in indexes, the tuples need tids, but if you build
> the page in local memory, you don't know on which page they will be until
> after allocating the page, which will probably be done after the page is
> built, so it's a bit of a chicken and egg problem.
>

Yes, I did not consider that to be a problem because I did not think it
would be used on indexed tables. I figured that the gain from doing bulk
inserts into the table would be so diluted by the still-bottle-necked index
maintenance that it was OK not to use this optimization for indexed tables.

>
> - It only works on new pages. Pages which are not empty, but have free
> space, cannot be written in this way.
>

My original thought was based on the idea of still using heap_insert, but
with a modified form of bistate which would hold the exclusive lock and not
just a pin. If heap_insert is being driven by the unmodified COPY code,
then it can't guarantee that COPY won't stall on a pipe read or something,
and so probably shouldn't hold an exclusive lock while filling the block.
That is why I decided a local buffer would be better, as the exclusive lock
is really a no-op and wouldn't block anyone. But if you are creating a new
heap_bulk_insert and modifying the COPY to go with it, then you can
guarantee it won't stall from the driving end, instead.

Whether any of these approaches will be maintainable enough to be
integrated into the code base is another matter. It seems like there is
already a lot of discussion going on around various permutations of copy
options.

Cheers,

Jeff


From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulk Inserts
Date: 2009-09-15 07:23:05
Message-ID: op.u0aesrf8cke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Yes, I did not consider that to be a problem because I did not think it
> would be used on indexed tables. I figured that the gain from doing bulk
> inserts into the table would be so diluted by the still-bottle-necked
> index maintenance that it was OK not to use this optimization for
> indexed tables.

I've tested with indexes, and the index update time is much larger than
the inserts time. Bulk inserts still provide a little bonus though, and
having a solution that works in all cases is better IMHO.

> My original thought was based on the idea of still using heap_insert, but
> with a modified form of bistate which would hold the exclusive lock and
> not
> just a pin. If heap_insert is being driven by the unmodified COPY code,
> then it can't guarantee that COPY won't stall on a pipe read or
> something,
> and so probably shouldn't hold an exclusive lock while filling the block.

Exactly, that's what I was thinking too, and reached the same conclusion.

> That is why I decided a local buffer would be better, as the exclusive
> lock
> is really a no-op and wouldn't block anyone. But if you are creating a
> new
> heap_bulk_insert and modifying the COPY to go with it, then you can
> guarantee it won't stall from the driving end, instead.

I think it's better, but you have to buffer tuples : at least a full
page's worth, or better, several pages' worth of tuples, in case inline
compression kicks in and shrinks them, since the purpose is to be able to
fill a complete page in one go.

> Whether any of these approaches will be maintainable enough to be
> integrated into the code base is another matter. It seems like there is
> already a lot of discussion going on around various permutations of copy
> options.

It's not really a COPY mod, since it would also be good for big INSERT
INTO SELECT FROM which is wal-bound too (even more so than COPY, since
there is no parsing to do).


From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulk Inserts
Date: 2009-09-15 07:28:29
Message-ID: op.u0ae1rg8cke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Does that heuristic change the timings much? If not, it seems like it
> would
> better to keep it simple and always do the same thing, like log the
> tuples
> (if it is done under one WALInsertLock, which I am assuming it is..)

It is the logging of whole pages that makes it faster.
If you fill a page with tuples in one operation (while holding exclusive
lock) and then insert WAL records for each tuple, there is no speed gain.

Inserting a full page WAL record (since you just filled the page
completely) :

- only takes WalInsertLock once instead of once per tuple
- reduces wal traffic
- is about 2x faster in my benchmark

And inserting a "clear new page" record (if the page was previously
new/empty and relation is fsync'd at the end) :

- only takes WalInsertLock once instead of once per tuple
- reduces wal traffic a lot
- is about 4x faster in my benchmark

> Do you even need the new empty page record? I think a zero page will be
> handled correctly next time it is read into shared buffers, won't it?

I have no idea ;)

> But I
> guess it is need to avoid problems with partial page writes that would
> leave in a state that is neither all zeros nor consistent.

Plus, empty page records make for very small WAL traffic and I didn't see
any performance difference with or without them.

> If the entire page is logged, would it have to marked as not removable by
> the log compression tool? Or can the tool recreate the needed delta?

No, the tool cannot recreate the data, since the idea is precisely to
replace a lot of "tuple insert" messages with one "entire page" message,
which takes both less space and less time. The warm-standby replicators
that get this WAL need to know the page contents to replicate it... (also,
it will probably be faster for them to redo a page write than redo all the
tuple inserts).

Here is what I'm thinking about now :

* have some kind of BulkInsertState which contains
- info about the relation, indexes, triggers, etc
- a tuple queue.

The tuple queue may be a tuple store, or simply tuple copies in a local
memory context.

You'd have functions to :

- Setup the BulkInsertState
- Add a tuple to the BulkInsertState
- Finish the operation and clear the BulkInsertState

When adding a tuple, it is stored in the queue.
When the queue is full, a bulk insert operation takes place, hopefully we
can fill an entire page, and return.
Post insert triggers and index updates are also handled at this point.

When finished, the function that clears the state also inserts all
remaining tuples in the queue.

With this you could also do something *really* interesting : bulk index
updates...

Bulk index updates are probably mutually exclusive with after-row triggers
though.

Another angle of attack would be to make wal-writing more efficient...


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulk Inserts
Date: 2009-09-16 03:04:54
Message-ID: f67928030909152004i103d7c4by3353826ac21aaa5d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2009/9/15 Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>

> Does that heuristic change the timings much? If not, it seems like it
>> would
>> better to keep it simple and always do the same thing, like log the tuples
>> (if it is done under one WALInsertLock, which I am assuming it is..)
>>
>
> It is the logging of whole pages that makes it faster.
> If you fill a page with tuples in one operation (while holding exclusive
> lock) and then insert WAL records for each tuple, there is no speed gain.
>

> Inserting a full page WAL record (since you just filled the page
> completely) :
>
> - only takes WalInsertLock once instead of once per tuple
>

OK, that makes sense. I thought you had hacked either XLogInsert or the
heap WAL replay code so that you could just accumulate tuples in the rdata
chain and then submit them all under the cover of a single WALInsertLock.
If you haven't done that, then of course doing the bulk insert doesn't help
much if you still to tuple-by-tuple XLogInsert. So in the case that it is
under the limit, you first run through the tuples putting them into the
block, then run through the tuples again doing the XLogInserts?

> - reduces wal traffic
> - is about 2x faster in my benchmark
>
> And inserting a "clear new page" record (if the page was previously
> new/empty and relation is fsync'd at the end) :
>
> - only takes WalInsertLock once instead of once per tuple
> - reduces wal traffic a lot
> - is about 4x faster in my benchmark

Do you have an IO bottleneck even in the absence of fsyncs? My experience
on multi-core machines with decent IO systems has been that the amount of
WAL traffic (by volume) matters rather little, as opposed to the number
WALInsertLocks taken, which matter quite a bit. Of course this depends
quite a bit on your OS and hardware.

...

Another angle of attack would be to make wal-writing more efficient...
>

If you mean to do this without changing the xlog interfaces, I'm not
optimistic.

If you have to call XLogInsert once per row that is copied (or
insert...select), then my experiments show that simply taking the
WALInsertLock and immediately releasing it, doing absolutely no real work
while it is held, is already a substanial multi-core scalibility
bottleneck. Once we accept that this must be done, the next existing
bottleneck is the memcpy of the first byte from the rdata chain into the
shared wal_buffers, presumably because this copy involves fighting the cache
line away from other cores. Once you've copied the first byte, the rest of
them seem to be almost free. (Again, this is probably hardware and
situation dependent).

I've seen some suggestions that the wal_buffer block initation work be moved
from being done by AdvanceXLInsert to instead be done by XLogWrite.
However, I've not seen any indication that AdvanceXLInsert is a meaningful
bottlneck in the first place. Except when wal_buffers is too small: then
AdvanceXLInsert is a bottleneck, but only because XLogWrite is getting
called from within it, in which case moving work from one to the other is
probably not going to make things better.

Jeff


From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulk Inserts
Date: 2009-09-16 10:15:34
Message-ID: op.u0chf8qwcke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> OK, that makes sense. I thought you had hacked either XLogInsert or the
> heap WAL replay code so that you could just accumulate tuples in the
> rdata
> chain and then submit them all under the cover of a single WALInsertLock.

Ah, no, I did not do that.

This would be difficult to do : rdata chain contains buffer pointers, and
when we are in XLogInsert, we have an exclusive lock on the buffer. If
XLogInsert accumulated xlog records as you say, then later, when it's time
to write them to the xlog, it would no longer hold exclusive lock on the
buffer, so its contents could have changed, and if XLogInsert decides to
do a full page write, the contents will be wrong.

Besides, the LSN needs to be set in the page at every call to heap_insert
(or else WAL will not be "Ahead"), so if XLogInsert accumulated stuff
before writing it, it would need a mechanism to assign a LSN without
having written the xlog data yet.

> If you haven't done that, then of course doing the bulk insert doesn't
> help much if you still do tuple-by-tuple XLogInsert.

Exactly.

> So in the case that it is
> under the limit, you first run through the tuples putting them into the
> block, then run through the tuples again doing the XLogInserts?

Yes, exactly. This isn't really optimal...
I wonder if I could build one rdata chain containing all updates to the
tuples and submit it in one go. Would it work ?...

> Do you have an IO bottleneck even in the absence of fsyncs?

Sometimes, yes :

- When xlog is on a (rather slow) software RAID1, I've found that the
fsyncs which happen when switching to a new xlog segment are significant,
and the amount of log data written is dangerously close to the max
throughput, too.
- When xlog is on a much faster RAID5, there is no such problem. fsync on
raid5 is horrendously slow if you have many pending small writes, but if
all you need to sync is a 16MB file which nicely aligns on raid stripes,
it's not a problem.

So in order to benchmark the right thing, I have :
- all the tables in a big RAMDISK
- xlog on RAID5
- fsync=fdatasync

I've also found that setting wal_buffers to a large value like 128MB gives
a significant speed boost when doing COPY or INSERT INTO SELECT, probably
because it allows the backends to always find space in the buffers even if
the walwriter is a bit busy.

> My experience
> on multi-core machines with decent IO systems has been that the amount of
> WAL traffic (by volume) matters rather little, as opposed to the number
> WALInsertLocks taken, which matter quite a bit. Of course this depends
> quite a bit on your OS and hardware.

Exactly : with the configuration above, iowait is extremely small (<1%) ,
yet all processes still spend 50% of their time waiting on WALInsertLock.

The lock is held for about 15% of the total time ; with 4 cores and 4
threads, if a core spends X time holding the lock, it's logical that the
others spend 3*X time waiting on it.

But consider this :

- 4 processes spend 50% of their time waiting on the lock
- therefore at any time there are average 2 processes waiting
- therefore every time the lock is Released, a process is woken up

=> more than 200.000 context switches per second seen in vmstat

Actually it does 1 context switch every two inserted rows, which is
enormous...

I've done a bit of profiling and found that 70% of the time spent holding
this lock is... computing the header's CRC.
Just for a quick check, I removed all CRC calculations. There was
absolutely no performance gain.

>> Another angle of attack would be to make wal-writing more efficient...
> If you mean to do this without changing the xlog interfaces, I'm not
> optimistic.

Agree : that's why I didn't even try ;)

> If you have to call XLogInsert once per row that is copied (or
> insert...select), then my experiments show that simply taking the
> WALInsertLock and immediately releasing it, doing absolutely no real work
> while it is held, is already a substanial multi-core scalibility
> bottleneck.

Confirmed by context switch issue above...

Having all cores block on the same lock for every row can be OK if it's a
spinlock protecting just a few lines of code... which is not the present
case...

> Once we accept that this must be done, the next existing
> bottleneck is the memcpy of the first byte from the rdata chain into the
> shared wal_buffers, presumably because this copy involves fighting the
> cache
> line away from other cores. Once you've copied the first byte, the rest
> of
> them seem to be almost free. (Again, this is probably hardware and
> situation dependent).

Since I have one 4 core CPU I probably don't see this effect (a multi-cpu
case would).

> I've seen some suggestions that the wal_buffer block initation work be
> moved
> from being done by AdvanceXLInsert to instead be done by XLogWrite.
> However, I've not seen any indication that AdvanceXLInsert is a
> meaningful
> bottlneck in the first place. Except when wal_buffers is too small: then
> AdvanceXLInsert is a bottleneck, but only because XLogWrite is getting
> called from within it, in which case moving work from one to the other is
> probably not going to make things better.

That correlates with the benefits I saw by increasing wal_buffers to a
very large value...

I'm going to try other things and report back.


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulk Inserts
Date: 2009-09-18 04:52:21
Message-ID: f67928030909172152m9e54850vba8e38a61e96f93f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2009/9/16 Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>

>
> OK, that makes sense. I thought you had hacked either XLogInsert or the
>> heap WAL replay code so that you could just accumulate tuples in the rdata
>> chain and then submit them all under the cover of a single WALInsertLock.
>>
>
> Ah, no, I did not do that.
>
> This would be difficult to do : rdata chain contains buffer pointers, and
> when we are in XLogInsert, we have an exclusive lock on the buffer. If
> XLogInsert accumulated xlog records as you say, then later, when it's time
> to write them to the xlog, it would no longer hold exclusive lock on the
> buffer, so its contents could have changed, and if XLogInsert decides to do
> a full page write, the contents will be wrong.
>

Yes, I didn't mean to make XLogInsert unilaterally accumulate rdata
(Actually I have done that, purely as a proof of concept. The resulting
database is completely unrecoverable, but as long you don't bring it down,
it runs fine and lets me test the speed of different concepts without going
to the trouble of implementing them correctly).

heap_bulk_insert would do the accumulation. The hack to XLogInsert would
involve making it insert an extra dummy xlog record header for every tuple
in the rdata chain. Alternatively, the hack to heap replay would involve
making it accept multiple tuples reported under a single WAL record. I
don't know which one would be easier.

>
> Besides, the LSN needs to be set in the page at every call to heap_insert
> (or else WAL will not be "Ahead"), so if XLogInsert accumulated stuff before
> writing it, it would need a mechanism to assign a LSN without having written
> the xlog data yet.
>

Right, XLogInsert would only be able to accumulate as long as it knew it was
going to get called again before the buffer exclusive lock was released.
That is why the accumulation is better done in the heap_bulk_insert,
otherwise it would require an unfortunate amount of communication between
the two.

> If you haven't done that, then of course doing the bulk insert doesn't
>> help much if you still do tuple-by-tuple XLogInsert.
>>
>
> Exactly.
>
> So in the case that it is
>> under the limit, you first run through the tuples putting them into the
>> block, then run through the tuples again doing the XLogInserts?
>>
>
> Yes, exactly. This isn't really optimal...
> I wonder if I could build one rdata chain containing all updates to the
> tuples and submit it in one go. Would it work ?...

I'm sure it could be made to work. I haven't checked the replay code, but I
doubt it would work on this massed record right out of the box. Or we could
insert dummy headers between the each tuple's WAL data.

...

> So in order to benchmark the right thing, I have :
> - all the tables in a big RAMDISK
> - xlog on RAID5
> - fsync=fdatasync
>
> I've also found that setting wal_buffers to a large value like 128MB gives
> a significant speed boost when doing COPY or INSERT INTO SELECT, probably
> because it allows the backends to always find space in the buffers even if
> the walwriter is a bit busy.

That seems very large, even for the high throughput set up you describe. Is
the WAL background writer set at the default interval? (On the other hand,
if 128MB is just a rounding error in your machine's total RAM size, why not
be generous? On the other other hand, I've seen perfomance start to drop as
wal_buffers gets too large, though I never bothered to chased down the
cause.)

At one point, I think that XLogInsert would synchronously write out the
buffer when it noticed it was over half full, but that got ripped out (if
I'm going to block, I might as well wait until it actually is full to do).
Now that there is a background writer, maybe it would be a good idea to have
XLogInserters wake it up if the buffer is half full.
...

> Another angle of attack would be to make wal-writing more efficient...
>>>
>> If you mean to do this without changing the xlog interfaces, I'm not
>> optimistic.
>>
>
> Agree : that's why I didn't even try ;)
>
> If you have to call XLogInsert once per row that is copied (or
>> insert...select), then my experiments show that simply taking the
>> WALInsertLock and immediately releasing it, doing absolutely no real work
>> while it is held, is already a substanial multi-core scalibility
>> bottleneck.
>>
>
> Confirmed by context switch issue above...
>
> Having all cores block on the same lock for every row can be OK if it's a
> spinlock protecting just a few lines of code... which is not the present
> case...

Maybe there could be some hybrid approach. You take the spinlock, check
that you could get the lwlock if you wanted to. If you could get the lwlock
and know you have almost no work to do, then just do it while holding the
spinlock. If you discover you have more work todo (xlog file switch,
nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop the
spinlock. This *really* breaks the encapsulation of lwlock, so it is
probably not much more than idle speculation.

Jeff


From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulk Inserts and WAL Inserts
Date: 2009-09-25 13:58:36
Message-ID: op.u0tfrys1cke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> heap_bulk_insert would do the accumulation. The hack to XLogInsert would
> involve making it insert an extra dummy xlog record header for every
> tuple

WAL record header is something like 40 bytes, so if you make lots of
inserts in a page, you'd be better off WALing the whole page, it takes
less space, it is much faster, and if you're coming right after a
checkpoint, you're going to do it anyway on the first inserted row, so it
would be even better to do it on the last inserted row...

WAL Insert record is 55 bytes + tuple data

However, you can't hold an exclusive lock on a buffer while going out in
the executor to fetch the next tuple, since that can take a undetermined
amount of time : imagine the record comes from a dblink() and there is a
connection loss... so you'd need a TupleBuffer, something like a
tuplestore that doesn't spill to disk and holds only about 32 kB of
tuples, and :
- fill buffer with tuples coming from the executor (or COPY),
- pass this to heap_bulk_insert(),
- toasts tuples (maybe also bulk-insert in the TOAST table)
- take the exclusive lock on the buffer,
- insert tuples quickly,
- log the whole page when it's full, or log individual inserts (in 1
operation)
- release the lock

> in the rdata chain. Alternatively, the hack to heap replay would involve
> making it accept multiple tuples reported under a single WAL record. I
> don't know which one would be easier.

This one is probably easier, since all your WAL records refer to the same
buffer.

> I'm sure it could be made to work. I haven't checked the replay code,
> but I
> doubt it would work on this massed record right out of the box. Or we
> could
> insert dummy headers between the each tuple's WAL data.

For small rows, the header is as large as the row... might as well get rid
of it !

Inserting dummy headers would not work, here's why :

- The critical part (lock-wise) of XLogInsert is generating a LSN
- The LSN calculation needs to verify that the header isn't split between
pages
- Therefore, if you want to insert N records in one operation, then this
critical part is going to take quite some time
- And the lock should be held for as little time as possible...

> That seems very large, even for the high throughput set up you
> describe. Is
> the WAL background writer set at the default interval? (On the other
> hand,
> if 128MB is just a rounding error in your machine's total RAM size, why
> not
> be generous? On the other other hand, I've seen perfomance start to
> drop as
> wal_buffers gets too large, though I never bothered to chased down the
> cause.)
>
> At one point, I think that XLogInsert would synchronously write out the
> buffer when it noticed it was over half full, but that got ripped out (if
> I'm going to block, I might as well wait until it actually is full to
> do).
> Now that there is a background writer, maybe it would be a good idea to
> have
> XLogInserters wake it up if the buffer is half full.
> ...

Here's what I think :

Currently, using big WAL buffers decreases performance (while you'd expect
to increase it). Here's why.

Case 1) consider those concurrent transactions :
- Transaction 1 : do a big operation (vacuum, COPY, etc)
- Transaction 2 : BEGIN; do a few little UPDATEs... COMMIT

Suppose you have a large wal_buffer like 32MB. By the time transaction 2
wants to commit and sync, transaction 1 has inserted lots of WAL in the
buffer, so transaction 1 finds it needs to write and fsync() like 2x 16 MB
WAL segments. Needless to say, transaction 2 commit is going to be pretty
slow. With small buffers, all the stuff would have been in the OS cache if
not already on the disks.

Case 2) you just do a big COPY, or a big VACUUM that writes lots of xlog.
Suppose you're generating like 50 MB/s of WAL.
You've set your WAL writer delay to something very low like 20 ms, so the
buffers are nicely emptied...
However, with this throughput, you're going through 1 xlog segment every
300 ms, so when WALWrite() tries to fsync() the segment to advance to the
next, the data to write is in the OS cache, but the OS probably hasn't
decided to write it to disk yet, so your fsync call is very expensive, and
it is blocking, this means if your buffers are smaller than a WAL segment,
then the buffers are full while you wait for fsync, and everything stands
still.

Here's my current test setup :

- WAL on a dedicated disk (this is really good for performance unless you
got a battery backed up controller)
- fsync=o_dsync
- very aggressive WAL writer delay like 5 ms

Basically, all the WAL disk does is write WAL, so the head hardly moves at
all. The disk takes about 8 ms for a rotation (this is less than
wal_writer_delay). Whenever there is WAL to write, walwriter writes it,
and blocks because of O_DSYNC mode. So, basically, at each disk rotation,
the pending WAL is written. Then, fsync() is a noop. This gives excellent,
and very smooth performance, whereas fsync() gives quite random, and
sometimes pretty high, wait times, since the length of fsync() wait
depends on how much unflushed data sits in the OS buffers.

However there is a problem with O_DSYNC :

- Latency is excellent, but throughput is miserable because you can only
write 1 chunk per disk rotation.
- If all you need to write is a few pages, and you want to sync them, it's
perfect.
- If you want to write at a high throughput, though, you'd better write
data in LARGE chunks, ideally 1 WAL segment at a time.

So, O_DSYNC is much better with large WAL buffers (>16 MB) and small
walwriter delay, ideally delay < disk rotation.

> Maybe there could be some hybrid approach. You take the spinlock, check
> that you could get the lwlock if you wanted to. If you could get the
> lwlock
> and know you have almost no work to do, then just do it while holding the
> spinlock. If you discover you have more work todo (xlog file switch,
> nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop
> the
> spinlock. This *really* breaks the encapsulation of lwlock, so it is
> probably not much more than idle speculation.

That's a description of something that could look like a futex ;)

Anyway, I've made a little patch (attached).
It applies to git revision 1384847ef8731718a79a32cd354e31c31c5294a0 of
current postgres (from last week), it probably applies to current HEAD,
but I have to learn how to do a merge in git before I can tell you that ;)

What it does :

I've put extremely detailed comments in xlog.c, this is a brief summary :

Since the critical part of XLogInsert is generating a LSN, not writing the
data, I have created a LSN generator, which is extremely fast, so it is
protected by a spinlock. Using anything else than a spinlock creates a
nice point of serialization and kills all performance gains.

Concurrent XLogInserts get a LSN from the LSN generator, and then they
insert their data in the buffers, concurrently, under a LW_SHARED
WALInsertLock.

The buffer queue logic was entirely redone too.

XlogWrite marks buffers as free as soon as possible, so if you use 32 MB
wal_buffers, when XLogWrite writes data, the buffer space is reused
immediately, while the previous data is being fsync()ed.

To avoid writing partial WAL records to disk, we must be sure that all
records in the buffer are completely written. This is done by taking an
exclusive lock on WALInsertLock, which waits for all memory writes to
finish, then taking a peek at the last written record, and releasing the
lock immediately.

I've also added an adaptive walwriter delay (depending on load).

I've added a lot of comments in xlog.c, check them out.

Here is the configuration I use for testing :

fsync = on
synchronous_commit = on or off depending on test
wal_sync_method = open_sync
full_page_writes = on
wal_buffers = 32MB
wal_writer_delay = 100ms (adaptive walwriter delay lowers it if needed)

Benchmarks are parallel COPY, and parallel INSERT INTO SELECT. However, be
careful with parallel INSERT INTO SELECT if all your threads insert into
the same table : you'll be benchmarking the FSM more than the WAL itself...

On my setup, 8.5 is bound by the WALInsertLock, but with this patch, my
disks are often maxed out, so I can't really tell. I had to put the tables
on a ramdisk.
I get an almost x2 speedup on parallel COPY of a table with 9 INT fields,
then cpu is maxed out parsing integers...
If you have more than 4 cores and faster disks, I'd be very interested in
your results !

It will print some stuff to the console, if you see lots of ">>W>W>>>W",
it means your WAL writing throughput is maxed out.
"W" means it had to flush buffers in XLogInsert (bad) instead of somewhere
non-critical like the wal writer.
"<" and ">" mean start and end of exclusive lock on WALInsert taken
because it had to flush.

Attached is a little Python benchmark script. Change the definitions for
where to put the files if you want to use it.

Regards,

Pierre

Attachment Content-Type Size
parallel_test_v5.py text/x-python 8.0 KB
xlog_perf_hack_v1.patch.gz application/x-gzip 29.7 KB