WIP parallel restore patch

Lists: pgsql-hackers
From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: WIP parallel restore patch
Date: 2008-10-29 14:52:28
Message-ID: 490878AC.1@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Attached is my latest parallel restore patch. I think it's functionally
complete for Unix.

Many bugs have been fixed since the last patch, and the hardcoded
limitation to two table dependencies is removed. It seems fairly robust
in my recent testing.

Remaining to be done:

. code cleanup
. better error checking in a few places
. final decision re command line option names/defaults
. documentation
. Windows support.

cheers

andrew

Attachment Content-Type Size
parallel_restore_10.patch.gz application/x-gzip 15.2 KB

From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP parallel restore patch
Date: 2008-11-05 12:40:45
Message-ID: 4911944D.2000003@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
>
> Attached is my latest parallel restore patch. I think it's functionally
> complete for Unix.
>
> Many bugs have been fixed since the last patch, and the hardcoded
> limitation to two table dependencies is removed. It seems fairly robust
> in my recent testing.

this version seems to be working much better on my test setup.
This patch results in a fairly nice improvment from ~170min (-m 1
--truncate-before-load) to ~39min (-m 16 --truncate-before-load) on my 8
core test box using a 70GB (uncompressed) dump consisting of 709 tables.

Stefan


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP parallel restore patch
Date: 2008-11-05 15:19:02
Message-ID: 4911B966.3090509@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stefan Kaltenbrunner wrote:
> Andrew Dunstan wrote:
>>
>> Attached is my latest parallel restore patch. I think it's
>> functionally complete for Unix.
>>
>> Many bugs have been fixed since the last patch, and the hardcoded
>> limitation to two table dependencies is removed. It seems fairly
>> robust in my recent testing.
>
> this version seems to be working much better on my test setup.
> This patch results in a fairly nice improvment from ~170min (-m 1
> --truncate-before-load) to ~39min (-m 16 --truncate-before-load) on my
> 8 core test box using a 70GB (uncompressed) dump consisting of 709
> tables.
>
>
>

Great. Thanks for the report.

cheers

andrew


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP parallel restore patch
Date: 2008-11-20 14:24:12
Message-ID: 4925730C.8050302@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
>
> Attached is my latest parallel restore patch. I think it's functionally
> complete for Unix.
>
> Many bugs have been fixed since the last patch, and the hardcoded
> limitation to two table dependencies is removed. It seems fairly robust
> in my recent testing.
>
> Remaining to be done:
>
> . code cleanup
> . better error checking in a few places
> . final decision re command line option names/defaults
> . documentation
> . Windows support.

I've looked around this a bit, and it's fairly clear where the issue
comes in with Windows - we get heap corruption. Most likely because we
have multiple threads working on the same data somewhere.

I notice for example that we're doing a shallow copy of the
ArchiveHandle with a simple memcpy() for each thread. But that struct
contains a number of things like file descriptors and pointers. Have you
verified for each and every one of those that it actually doesn't get
modified anywhere? If not, a deep copy may be needed to make sure of that.

Other than that, are there any global variables that may be addressed
from more than one worker? If so they need to be marked as TLS.

And yes, I got hit by the lack of error checking a couple of times
during my testing - it would probably be a good idea to add that as soon
as possible, it helps a lot with the debugging.

If I run it with just a single thread, it also crashes in PQfinish()
called from die_horribly(), when trying to free conn->pgpass, which has
a value (non-NULL) but is not a valid pointer. This crash happens in the
worker thread, after it has logged that "fseek is required" - that's an
indicator something being passed down to the thread is either wrong or
being scribbled upon after the fact.

I didn't dig into these questions specifically - since you have already
been reading up on this code to do the patch you can probably reach the
answer to them much quicker :-) So I'll stick to the questions.

//Magnus


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP parallel restore patch
Date: 2008-11-20 16:55:51
Message-ID: 49259697.20905@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander wrote:
> Andrew Dunstan wrote:
>
>> Attached is my latest parallel restore patch. I think it's functionally
>> complete for Unix.
>>
>> Many bugs have been fixed since the last patch, and the hardcoded
>> limitation to two table dependencies is removed. It seems fairly robust
>> in my recent testing.
>>
>> Remaining to be done:
>>
>> . code cleanup
>> . better error checking in a few places
>> . final decision re command line option names/defaults
>> . documentation
>> . Windows support.
>>
>
> I've looked around this a bit, and it's fairly clear where the issue
> comes in with Windows - we get heap corruption. Most likely because we
> have multiple threads working on the same data somewhere.
>
> I notice for example that we're doing a shallow copy of the
> ArchiveHandle with a simple memcpy() for each thread. But that struct
> contains a number of things like file descriptors and pointers. Have you
> verified for each and every one of those that it actually doesn't get
> modified anywhere? If not, a deep copy may be needed to make sure of that.
>
> Other than that, are there any global variables that may be addressed
> from more than one worker? If so they need to be marked as TLS.
>
> And yes, I got hit by the lack of error checking a couple of times
> during my testing - it would probably be a good idea to add that as soon
> as possible, it helps a lot with the debugging.
>
> If I run it with just a single thread, it also crashes in PQfinish()
> called from die_horribly(), when trying to free conn->pgpass, which has
> a value (non-NULL) but is not a valid pointer. This crash happens in the
> worker thread, after it has logged that "fseek is required" - that's an
> indicator something being passed down to the thread is either wrong or
> being scribbled upon after the fact.
>
> I didn't dig into these questions specifically - since you have already
> been reading up on this code to do the patch you can probably reach the
> answer to them much quicker :-) So I'll stick to the questions.
>
>
>

OK, Thanks, this will help. I thought I had caught the ArchiveHandle
things, but there might be one or two I missed.

I'll try to have a new version in a few days.

cheers

andrew