Re: Using a COPY...FROM through JDBC?

From: Markus Schaber <schabios(at)logi-track(dot)com>
To: swampler(at)noao(dot)edu
Cc: Postgres-JDBC <pgsql-jdbc(at)postgresql(dot)org>, Postgres-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Using a COPY...FROM through JDBC?
Date: 2004-06-06 22:08:00
Message-ID: 20040607000800.636d7353@kingfisher.intern.logi-track.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc pgsql-performance

Hi, Steve,

On Sat, 05 Jun 2004 13:12:29 -0700
Steve Wampler <swampler(at)noao(dot)edu> wrote:

> I've got a simple database (no indices, 6 columns) that I need
> to write data quickly into through JDBC connections from
> multiple such connections simultaneously in a distributed
> environment. (This is going to be a message logging service
> for software generated messages.)
> Using a PreparedStatement, I can get about 400/s inserted. If I
> (on the java side) buffer up the entries and dump them in large
> transaction blocks I can push this up to about 1200/s. I'd
> like to go faster. One approach that I think might be
> promising would be to try using a COPY command instead of
> an INSERT, but I don't have a file for input, I have a
> Java collection, so COPY isn't quite right. Is there anyway to
> efficiently use COPY without having to create a file (remember
> that the java apps are distributed on a LAN and aren't running
> on the DB server.) Is this a dead end because of the way
> COPY is implemented to only use a file?

We also found that using the psql frontend, using COPY seems to give a
factor 10 or more speedup. Sadly, as far as I learned, the current JDBC
driver does not support COPY ... FROM STDIN.

As a very bad workaround, it might be acceptable to use Runtime.exec()
to start the psql command line tool, and issue the statement there, or
even add a C-lib via JNI. Of course, the best "workaround" would be to
implement COPY support for the driver, and send the Patches to the
PGJDBC team for inclusion :-)

We also had to do some trickery to get instertion of lots of rows fast.
We dit lots of benchmarks, and currently use the following method:

Our input data is divided into chunks (the optimal size depends on the
machine, and seems to be between 250 and 3000). As the current pgjdbc
preparedStatements implementation just does a text replacement, but we
wantedto get the last bit of speed out of the machine, we issue a
"PREPARE" statement for the insertion on connection setup, and then
addBatch() a "EXECUTE blubb (data, row, values)" statement.

Then we have several concurrent threads, all running essentially a {get
batch, write batch, commit} loop on their own connection. Increasing
the thread number to more than three did not show further substantial
performance improvements. This lead us to the conclusion that
concurrency can compensate for the time the postmaster is forced to wait
while it syncs the WAL to disk, but there's still a concurrency limit
inside of postgres for inserts (I presume they have to lock at some
times, the multiversioning seems not to cover inserts very well).

Also, we surprisingly found that setting the transaction isolation to
"serializable" can speed things remarkably in some cases...

> Is there something else I can do? Ultimately, this will end
> up on a machine running 1+0 RAID, so I expect that will give
> me some performance boost as well, but I'd like to push it
> up as best I can with my current hardware setup.

As any sane setup runs with syncing enabled in the backend, and each
sync (and so each commit) at least has to write at least one block, you
can calculate the theoretical maximum number of commits your machine can
achieve.

If you have 15k rpm disks (AFAIK, the fastest one currently available),
they spin at 250 rotations per second, so you cannot have more than 250
commits per second.

Regarding the fact that your machine has to do some works between the
sync() calls (e. G. processing the whole next batch), it is very likely
that it misses the next turn, so that you're likely to get a factor 2 or
3 number in reality.

One way to overcome this limit is using multiple writer threads, and
(having a highly capable I/O sybsystem) enabling commit delay in your
backend so that you can have more than one commit during the same write
operation.

It might also help to put the WAL log to a different disk (just link or
mount or mount --bind the appropriate subdirectory in your database), or
even put the indices on a third disk (needs ugly trickery) - it's a
shame that postmaster does not really support this techniques which are
considered standard in any professional database.

If you really need much more speed, that you could try to put the WAL
on a Solid State Disk (essentially a battery-backed RAM) so you can
overcome this physical limit, or (if you really trust your hardware and
your power supply) put the WAL into a RAMDISK or switch of syncing in
your postmaster configuration.

One thing you should check is whether I/O or CPU is the limiting factor.
If you have a cpu utilization higher than 90%, than all the tricks I
told you won't help much. (But using COPY still could help a lot as it
cut's down the CPU usage very much.)

We tested with two machines, a single-processor developer machine, and a
2-way 64-Bit Itanium SMP machine. On the desktop machine, a single
thread already utilized 80% CPU, and so only small improvement was
possible using 2 or more threads.

On the SMP machine, we had substantial improvements using 2 or 3
threads, but then going up to 8 threads gave no more remarkable speedup
constantly utilizing about 120% CPU (Remember we have a 2-way machine).
I think that there are some internal postgres locks that prohibit
further concurrency for inserts in the same table.

> Thanks for any advice!

Hope, that helps,
Markus Schaber

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios(at)logi-track(dot)com | www.logi-track.com

In response to

Responses

Browse pgsql-jdbc by date

  From Date Subject
Next Message Kris Jurka 2004-06-07 07:47:57 Re: Column quoting in result set getters
Previous Message Markus Schaber 2004-06-06 21:12:09 Re: How to insert binary data

Browse pgsql-performance by date

  From Date Subject
Next Message Gregory S. Williamson 2004-06-07 05:54:00 Re: General performance questions about postgres on Apple
Previous Message Laurent Martelli 2004-06-06 21:12:07 Re: Query involving views