Re: Fast insertion indexes: why no developments

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fast insertion indexes: why no developments
Date: 2013-11-11 23:35:02
Message-ID: CAMkU=1xtdHLwLQHJ2NOQEaNGPFjw2FtNiAX9MZ2n1i+YtphCrA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 5, 2013 at 9:52 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it>wrote:

> Jeff Janes wrote
> > Some experiments I did a few years ago showed that applying sorts to the
> > data to be inserted could be helpful even when the sort batch size was as
> > small as one tuple per 5 pages of existing index. Maybe even less.
>
> Cool!!! Do you have any idea/hint on how I could try and replicate that?
> Do you remember how you did it?
>

I can't find my notes but I remember more or less how I did it.

Since we don't yet have an insertion buffer that allows the rows to be
sorted in different order for different indexes, I had to simulate it just
by using a table with a single index and hoping that that would extrapolate.

create table foo (x bigint);

To speed things up, you may want to prepopulate this with random data so
that the size of the index-to-be will exceed shared_buffers, or physical
RAM, before making the index. Also, the effectiveness might depend on how
much the index has grown since its creations, since leaf pages are
initially correlated between physical order and logical order, but that
decreases over time. So you may want to try different initial seed sizes.

create index on foo (x);

Then I use perl to make run-sorted data with different run sizes, and load
that via \copy. I put all the data points in memory up front rather than
generating it per-run on the fly, so that perl consumes about the same
amount of memory regardless of the run size. You would want to use more
than 1..1e6 if you are on a very large RAM machine.

Something like:

for $run_size in 1 10 100 1000 10000 100000; do
perl -le 'my @x; push @x, int(rand()*1e8) foreach 1..1e6; while (@x)
{print foreach sort {$a<=>$b} splice @x,0,'$run_size'; }'| time psql -c
'\copy foo from stdin';
done

But you probably want another inner loop so that the \copy gets executed
multiple times per run_size, so that each run_size executes for at least a
couple checkpoint cycles.

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeffrey Walton 2013-11-12 00:02:50 Re: Clang 3.3 Analyzer Results
Previous Message Josh Berkus 2013-11-11 23:16:31 Re: pg_dump and pg_dumpall in real life