Re: PATCH: optimized DROP of multiple tables within a transaction

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: optimized DROP of multiple tables within a transaction
Date: 2012-12-20 01:54:48
Message-ID: 50D26FE8.1040800@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19.12.2012 02:18, Andres Freund wrote:
> On 2012-12-17 00:31:00 +0100, Tomas Vondra wrote:
>
> I think except of the temp buffer issue mentioned below its ready.
>
>> -DropRelFileNodeAllBuffers(RelFileNodeBackend rnode)
>> +DropRelFileNodeAllBuffers(RelFileNodeBackend * rnodes, int nnodes)
>> {
>> - int i;
>> + int i, j;
>> +
>> + /* sort the list of rnodes */
>> + pg_qsort(rnodes, nnodes, sizeof(RelFileNodeBackend), rnode_comparator);
>>
>> /* If it's a local relation, it's localbuf.c's problem. */
>> - if (RelFileNodeBackendIsTemp(rnode))
>> + for (i = 0; i < nnodes; i++)
>> {
>> - if (rnode.backend == MyBackendId)
>> - DropRelFileNodeAllLocalBuffers(rnode.node);
>> - return;
>> + if (RelFileNodeBackendIsTemp(rnodes[i]))
>> + {
>> + if (rnodes[i].backend == MyBackendId)
>> + DropRelFileNodeAllLocalBuffers(rnodes[i].node);
>> + }
>> }
>
> While you deal with local buffers here you don't anymore in the big loop
> over shared buffers. That wasn't needed earlier since we just returned
> after noticing we have local relation, but thats not the case anymore.

Hmm, but that would require us to handle the temp relations explicitly
wherever we call DropRelFileNodeAllBuffers. Currently there are two such
places - smgrdounlink() and smgrdounlinkall().

By placing it into DropRelFileNodeAllBuffers() this code is shared and I
think it's a good thing.

But that does not mean the code is perfect - it was based on the
assumption that if there's a mix of temp and regular relations, the temp
relations will be handled in the first part and the rest in the second one.

Maybe it'd be better to improve it so that the temp relations are
removed from the array after the first part (and skip the second one if
there are no remaining relations).

>
>> for (i = 0; i < NBuffers; i++)
>> {
>> + RelFileNodeBackend *rnode = NULL;
>> volatile BufferDesc *bufHdr = &BufferDescriptors[i];
>> -
>> +
>> /*
>> * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
>> * and saves some cycles.
>> */
>> - if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
>> +
>> + /*
>> + * For low number of relations to drop just use a simple walk through,
>> + * to save the bsearch overhead. The BSEARCH_LIMIT is rather a guess
>> + * than a exactly determined value, as it depends on many factors (CPU
>> + * and RAM speeds, amount of shared buffers etc.).
>> + */
>> + if (nnodes <= BSEARCH_LIMIT)
>
> I think thats a sensible plan. It makes sense that for a small number of
> relations a sequential scan of the rnodes array is faster than a bsearch
> and 10 sounds like a good value although I would guess the optimal value
> is slightly higher on most machines. But if it works fine without
> regressions thats pretty good...

I think it's pointless to look for the optimal value in this case, given
on how many factors it depends. We could use 20 instead of 10, but I
wouldn't go higher probably.

>> +
>> +/*
>> + * Used to sort relfilenode array (ordered by [relnode, dbnode, spcnode]), so
>> + * that it's suitable for bsearch.
>> + */
>> +static int
>> +rnode_comparator(const void * p1, const void * p2)
>> +{
>> + RelFileNodeBackend n1 = * (RelFileNodeBackend *) p1;
>> + RelFileNodeBackend n2 = * (RelFileNodeBackend *) p2;
>> +
>> + if (n1.node.relNode < n2.node.relNode)
>> + return -1;
>> + else if (n1.node.relNode > n2.node.relNode)
>> + return 1;
>> +
>> + if (n1.node.dbNode < n2.node.dbNode)
>> + return -1;
>> + else if (n1.node.dbNode > n2.node.dbNode)
>> + return 1;
>> +
>> + if (n1.node.spcNode < n2.node.spcNode)
>> + return -1;
>> + else if (n1.node.spcNode > n2.node.spcNode)
>> + return 1;
>> + else
>> + return 0;
>> +}
>
> Still surprised this is supposed to be faster than a memcmp, but as you
> seem to have measured it earlier..

It surprised me too. These are the numbers with the current patch:

1) one by one
=============
0 2 4 6 8 10 12 14 16 18 20
--------------------------------------------------------------
current 15 22 28 34 41 75 77 82 92 99 106
memcmp 16 23 29 36 44 122 125 128 153 154 158

Until the number of indexes reaches ~10, the numbers are almost exactly
the same. Then the bsearch branch kicks in and it's clear how much
slower the memcmp comparator is.

2) batches of 100
=================
0 2 4 6 8 10 12 14 16 18 20
--------------------------------------------------------------
current 3 5 8 10 12 15 17 21 23 27 31
memcmp 4 7 10 13 16 19 22 28 30 32 36

Here the difference is much smaller, but even here the memcmp is
consistently a bit slower.

My theory is that while the current comparator starts with the most
variable part (relation OID), and thus usually starts after the
comparing the first 4B, the memcmp starts from the other end (tablespace
and database) and therefore needs to compare more data.

>> +void smgrdounlinkall(SMgrRelation * rels, int nrels, bool isRedo)
>> +{
>> + int i = 0;
>> + RelFileNodeBackend *rnodes;
>> + ForkNumber forknum;
>> +
>> + /* initialize an array which contains all relations to be dropped */
>> + rnodes = palloc(sizeof(RelFileNodeBackend) * nrels);
>> + for (i = 0; i < nrels; i++)
>> + {
>> + RelFileNodeBackend rnode = rels[i]->smgr_rnode;
>> + int which = rels[i]->smgr_which;
>> +
>> + rnodes[i] = rnode;
>> +
>> + /* Close the forks at smgr level */
>> + for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
>> + (*(smgrsw[which].smgr_close)) (rels[i], forknum);
>> + }
>> +
>> + /*
>> + * Get rid of any remaining buffers for the relation. bufmgr will just
>> + * drop them without bothering to write the contents.
>> + */
>> + DropRelFileNodeAllBuffers(rnodes, nrels);
>
> I think it might be a good idea to handle temp relations on/buffers on
> this level instead of trying to put it into
> DropRelFileNodeAllBuffers. Would also allow you to only build an array
> of RelFileNode's instead of RelFileNodeBackend's which might make it
> slightl faster.

Hmmm, sadly that'd require duplication of code because it needs to be
done in smgrdounlink too.

>
>> + /*
>> + * It'd be nice to tell the stats collector to forget it immediately, too.
>> + * But we can't because we don't know the OID (and in cases involving
>> + * relfilenode swaps, it's not always clear which table OID to forget,
>> + * anyway).
>> + */
>
> This and at least one other comment seems to be in too many locations
> now.

I see it in three places - smgrdounlink, smgrdounlinkall and
smgrdounlinkfork. Which ones you consider superfluous? I think it's
appropriate to keep them in all three places.

Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Kupershmidt 2012-12-20 01:57:01 discarding duplicate indexes
Previous Message Tom Lane 2012-12-20 01:29:16 Re: strange OOM errors with EXECUTE in PL/pgSQL