COPY and heap_sync

Lists: pgsql-hackers
From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: COPY and heap_sync
Date: 2014-08-30 06:26:01
Message-ID: CAMkU=1z0YBPi6Svpji8WBd6vABVQ_+Opm2HyDnJQ07Y32av0-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

If you insert tuples with COPY into a table created or truncated in the
same transaction, at the end of the COPY it calls heap_sync.

But there cases were people use COPY in a loop with a small amount of data
in each statement. Now it is calling heap_sync many times, and if NBuffers
is large doing that gets very slow.

Could the heap_sync be safely delayed until the end of the transaction,
rather than the end of the COPY?

Cheers,

Jeff


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY and heap_sync
Date: 2014-08-30 07:36:03
Message-ID: CAA4eK1Ke7+872kFsBSHYFYy8E=Lv1d19wAt0gwQjTvabFZQ=TQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Aug 30, 2014 at 11:56 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
> If you insert tuples with COPY into a table created or truncated in the
same transaction, at the end of the COPY it calls heap_sync.
>
> But there cases were people use COPY in a loop with a small amount of
data in each statement. Now it is calling heap_sync many times, and if
NBuffers is large doing that gets very slow.
>
> Could the heap_sync be safely delayed until the end of the transaction,
rather than the end of the COPY?

Wouldn't unconditionally delaying sync until end of transaction
can lead to burst of I/O at that time especially if there are many
such copy commands in a transaction, leading to delay in some
other operation's that might be happening concurrently in the
system.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY and heap_sync
Date: 2014-08-30 08:05:02
Message-ID: CAOeZVieQ1kQuL+VX5dxLxz9Z9ACy1FDLeoeHFzGWh0U42FsVnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Saturday, August 30, 2014, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Sat, Aug 30, 2014 at 11:56 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com
> <javascript:_e(%7B%7D,'cvml','jeff(dot)janes(at)gmail(dot)com');>> wrote:
> >
> > If you insert tuples with COPY into a table created or truncated in the
> same transaction, at the end of the COPY it calls heap_sync.
> >
> > But there cases were people use COPY in a loop with a small amount of
> data in each statement. Now it is calling heap_sync many times, and if
> NBuffers is large doing that gets very slow.
> >
> > Could the heap_sync be safely delayed until the end of the transaction,
> rather than the end of the COPY?
>
> Wouldn't unconditionally delaying sync until end of transaction
> can lead to burst of I/O at that time especially if there are many
> such copy commands in a transaction, leading to delay in some
> other operation's that might be happening concurrently in the
> system.
>
>
>
>
I agree with that but then, it can provide us the same benefits like group
commit,especially when most of the copy commands touch pages which are
nearby,hence reducing the seek time overhead.

We could look at making it optional through a GUC, since it is useful
albeit for some specific usecases.

Regards,

Atri

--
Regards,

Atri
*l'apprenant*


From: Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY and heap_sync
Date: 2014-08-31 00:32:07
Message-ID: CAFcNs+q4nXW0Y8a=xeJJnFoxVHyM1P-kwRB0qB44oTHJvwR_eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Aug 30, 2014 at 5:05 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
>
>
> On Saturday, August 30, 2014, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Sat, Aug 30, 2014 at 11:56 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
wrote:
>> >
>> > If you insert tuples with COPY into a table created or truncated in
the same transaction, at the end of the COPY it calls heap_sync.
>> >
>> > But there cases were people use COPY in a loop with a small amount of
data in each statement. Now it is calling heap_sync many times, and if
NBuffers is large doing that gets very slow.
>> >
>> > Could the heap_sync be safely delayed until the end of the
transaction, rather than the end of the COPY?
>>
>> Wouldn't unconditionally delaying sync until end of transaction
>> can lead to burst of I/O at that time especially if there are many
>> such copy commands in a transaction, leading to delay in some
>> other operation's that might be happening concurrently in the
>> system.
>>
>>
>>
>
> I agree with that but then, it can provide us the same benefits like
group commit,especially when most of the copy commands touch pages which
are nearby,hence reducing the seek time overhead.
>
> We could look at making it optional through a GUC, since it is useful
albeit for some specific usecases.
>

It's interesting... maybe something analogous to "SET CONSTRAINTS
DEFERRED"...

SET COPY COMMIT { IMMEDIATE | DEFERRED }

or

SET COPY MODE { IMMEDIATE | DEFERRED }

Just some thoughts!

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY and heap_sync
Date: 2014-08-31 13:10:07
Message-ID: 54031EAF.3060507@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 8/30/14 2:26 AM, Jeff Janes wrote:
> But there cases were people use COPY in a loop with a small amount of
> data in each statement.

What would be the reason for doing that?


From: Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY and heap_sync
Date: 2014-09-01 03:49:39
Message-ID: CAFcNs+rraii9UTtc2LC_jbHfUkyF1tDEqdqn+tFh7EJbmMiNUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Aug 31, 2014 at 10:10 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>
> On 8/30/14 2:26 AM, Jeff Janes wrote:
> > But there cases were people use COPY in a loop with a small amount of
> > data in each statement.
>
> What would be the reason for doing that?
>

I used that to the same thing many times. In a company that I was employed
we developed scripts to migrate data from one database do another.

The first version we used INSERT statements and was very very slow. Then we
wrote a second version changing the INSERT by COPY statements. The
performance was very better, but we believe that could be better, so in
the third version we created some kind of "cache" (using arrays) to
accumulate the records in memory then after N rows we build the COPY
statement with the cache contents and run it. This was a really good
performance improvement.

It's my use case to we have a feature to postpone the heap_sync in COPY
statements. I don't know if it's a feature that a lot of people wants, but
IMHO it could be nice to improve the bulk load operations.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY and heap_sync
Date: 2014-09-02 15:41:22
Message-ID: CAMkU=1zOi9Npj8He8EOuMAgH-Sb2k_i5KoKoauqmpvwRFuE3Nw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Aug 31, 2014 at 6:10 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:

> On 8/30/14 2:26 AM, Jeff Janes wrote:
> > But there cases were people use COPY in a loop with a small amount of
> > data in each statement.
>
> What would be the reason for doing that?
>
>
As far as I can tell, DRY. They need code to do bulk inserts anyway. So,
just use that everywhere even when it is not in bulk.

Also, you can't interleave a copy command with other queries on the same
connection. So you code it to start a COPY, use it until you discover you
need to run a query (because you encounter something not in you local
cache), end the COPY and do that query, then restart the query.

Under some conditions, the interruption occurs very seldom, under other
conditions it is pretty much every row.

Cheers,

Jeff


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY and heap_sync
Date: 2014-09-03 21:11:20
Message-ID: CA+Tgmoa3VXCx9GKAON1_FkVcG6_-uo6=S+fL0zj8N2dVTTBptA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Aug 30, 2014 at 2:26 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> If you insert tuples with COPY into a table created or truncated in the same
> transaction, at the end of the COPY it calls heap_sync.
>
> But there cases were people use COPY in a loop with a small amount of data
> in each statement. Now it is calling heap_sync many times, and if NBuffers
> is large doing that gets very slow.
>
> Could the heap_sync be safely delayed until the end of the transaction,
> rather than the end of the COPY?

I don't think there's any data integrity problem with that, but if the
fsync() should fail it would be reported at commit time rather than in
response to the COPY. That might be OK though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company