Re: parallel restore fixes

Lists: pgsql-hackers
From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: parallel restore fixes
Date: 2009-03-09 23:09:05
Message-ID: 49B5A191.1000209@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


The attached patch fixes two issues with parallel restore:

* the static buffer problem in dumputils.c::fmtId() on Windows
(solution: use thread-local storage)
* ReopenPtr() is called too often

There is one remaining bug I know of that I can reproduce: we can get
into deadlock when two tables are foreign keyed to each other. So I need
to get a bit more paranoid about dependencies.

I can't reproduce Olivier Prennant's file closing problem on Unixware.
Is it still happening after application of this patch?

cheers

andrew

Attachment Content-Type Size
parallel_fix.patch text/x-patch 5.8 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-09 23:22:54
Message-ID: 20090309232254.GI12932@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
>
> The attached patch fixes two issues with parallel restore:
>
> * the static buffer problem in dumputils.c::fmtId() on Windows
> (solution: use thread-local storage)
> * ReopenPtr() is called too often

Hmm, if pg_restore is the only program that's threaded, why are you
calling init_dump_utils on pg_dump and pg_dumpall? It makes me a bit
nervous because there are some other programs that are linking
dumputils.c (psql and some in src/bin/scripts/) and even calling fmtId.

Also I think the fmtId comment needs to be updated.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-09 23:42:59
Message-ID: 17972.1236642179@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> + void
> + init_dump_utils()

This should be
> + void
> + init_dump_utils(void)

please. We don't do K&R C around here. I'd lose the added retval
variable too; that's not contributing anything.

> ! #endif;

Semicolon is bogus here.

Looks pretty sane otherwise.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-09 23:50:34
Message-ID: 18107.1236642634@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Hmm, if pg_restore is the only program that's threaded, why are you
> calling init_dump_utils on pg_dump and pg_dumpall?

... because fmtId will crash on *any* use without that.

> It makes me a bit
> nervous because there are some other programs that are linking
> dumputils.c (psql and some in src/bin/scripts/) and even calling fmtId.

Actually, why bother with init_dump_utils at all? fmtId could be made
to initialize the ID variable for itself on first call, with only one
extra if-test, which is hardly gonna matter.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-10 00:37:06
Message-ID: 49B5B632.4040302@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>> It makes me a bit
>> nervous because there are some other programs that are linking
>> dumputils.c (psql and some in src/bin/scripts/) and even calling fmtId.
>>
>
> Actually, why bother with init_dump_utils at all? fmtId could be made
> to initialize the ID variable for itself on first call, with only one
> extra if-test, which is hardly gonna matter.
>
>

Well, the Windows reference I have suggests TlsAlloc() needs to be
called early in the initialisation process ... I guess I could force it
with a dummy call to fmtId() in restore_toc_entries_parallel() before it
starts spawning children, so we'd be sure there wasn't a race condition,
and nothing else is going to have threads so it won't matter. We'd need
a long comment to that effect, though ;-)

> I'd lose the added retval
> variable too; that's not contributing anything.

It is, in fact. Until I put that in I was getting constant crashes. I
suspect it's something to do with stuff Windows does under the hood on
function return.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-10 01:16:20
Message-ID: 19101.1236647780@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> Actually, why bother with init_dump_utils at all?

> Well, the Windows reference I have suggests TlsAlloc() needs to be
> called early in the initialisation process ...

How early is early? The proposed call sites for init_dump_utils seem
already long past the point where any libc-level infrastructure would
think it is "initialization" time.

>> I'd lose the added retval
>> variable too; that's not contributing anything.

> It is, in fact. Until I put that in I was getting constant crashes. I
> suspect it's something to do with stuff Windows does under the hood on
> function return.

Pardon me while I retrieve my eyebrows from the ceiling. I think you've
got something going on there you don't understand, and you need to
understand it not just put in a cargo-cult fix. (Especially one that's
not documented and hence likely to be removed by the next person who
touches the code.)

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-10 01:44:50
Message-ID: 49B5C612.90505@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> Actually, why bother with init_dump_utils at all?
>>>
>
>
>> Well, the Windows reference I have suggests TlsAlloc() needs to be
>> called early in the initialisation process ...
>>
>
> How early is early? The proposed call sites for init_dump_utils seem
> already long past the point where any libc-level infrastructure would
> think it is "initialization" time.
>

Well, I think at least it needs to be done where other threads won't be
calling it at the same time.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-10 02:09:07
Message-ID: 19822.1236650947@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> How early is early? The proposed call sites for init_dump_utils seem
>> already long past the point where any libc-level infrastructure would
>> think it is "initialization" time.

> Well, I think at least it needs to be done where other threads won't be
> calling it at the same time.

Oh, I see, ye olde race condition. Still, it seems like a bad idea
to expect that we will catch every program that might call fmtId;
as Alvaro notes, that's all over our client-side code.

How about this: by default, fmtId uses the same logic as now (one static
PQExpBuffer). If told to by a call of init_parallel_dump_utils(), which
need only be called by pg_restore during its startup, then it switches to
using per-thread storage. init_parallel_dump_utils can be the place
that calls TlsAlloc. This is almost the same as what you suggested a
couple messages back, but perhaps a bit clearer as to what's going on;
and it definitely cuts the number of places we need to touch.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-10 20:39:59
Message-ID: 49B6D01F.6050204@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>
> How about this: by default, fmtId uses the same logic as now (one static
> PQExpBuffer). If told to by a call of init_parallel_dump_utils(), which
> need only be called by pg_restore during its startup, then it switches to
> using per-thread storage. init_parallel_dump_utils can be the place
> that calls TlsAlloc. This is almost the same as what you suggested a
> couple messages back, but perhaps a bit clearer as to what's going on;
> and it definitely cuts the number of places we need to touch.
>
>
>

OK, here 'tis.

Moving on to the deadlock with crossed FKs issue.

cheers

andrew

Attachment Content-Type Size
parallel_fix.patch text/x-patch 6.1 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-10 21:39:38
Message-ID: 15395.1236721178@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> OK, here 'tis.

Looks fairly reasonable to me, but of course I haven't tested it.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-11 15:19:34
Message-ID: 49B7D686.60906@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> OK, here 'tis.
>
> Looks fairly reasonable to me, but of course I haven't tested it.

Well, I have to do a blitz of parallel restores next week, so hopefully
that will hit any soft spots.

--Josh


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel restore fixes
Date: 2009-03-11 16:47:14
Message-ID: 49B7EB12.6030404@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
> Tom Lane wrote:
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>> OK, here 'tis.
>>
>> Looks fairly reasonable to me, but of course I haven't tested it.
>
> Well, I have to do a blitz of parallel restores next week, so
> hopefully that will hit any soft spots.

I have a known outstanding bug to do with deadlock from FKs that cross
(i.e. A has an FK that references B, and B has an FK that references A).
The solution to this could be mildly complex, but I have an outline of
it in my head. Workaround: recreate the failed FK at the end of the restore.

The only other reported problem is the one on Unixware to do with
closing the archive. I haven't been able to reproduce it on Linux or
Windows, the two platforms I test on, although it might be fixed by the
patch I just applied.

cheers

andrew