Re: Running update in chunks?

Lists: pgsql-general
From: Tim Uckun <timuckun(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Running update in chunks?
Date: 2013-01-21 08:04:34
Message-ID: CAGuHJrPHRNmmoWOQD7YfArgGHRrUkBzC33ipsJr1tODYJCmMFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I have seen a lot of slow update questions asked both here and on
stack overflow but they usually involve large tables. In my case the
dataset is kind of small.

I have an app in which I import data and then merge the records with
an existing table. Currently I do most of the heavy lifting with code
and it works well enough but as the imports get bigger I thought I
would rewrite the code to speed it up using postgres. Basically I get
the data which I consider to be dirty and I put it into a table using
hstore to store the data. I then run a series of update queries to
locate the "real" records in the various tables. The import data
looks like this https://gist.github.com/4584366 and has about 98K
records in it. The lookup table is very small only a couple of hundred
records in it.

This is the query I am running

update cars.imports i
set make_id = md.make_id
from cars.models md where i.model_id = md.id;

Here is the analyse

"Update on imports i (cost=2.46..49720.34 rows=138858 width=526)
(actual time=51968.553..51968.553 rows=0 loops=1)"
" -> Hash Join (cost=2.46..49720.34 rows=138858 width=526) (actual
time=0.044..408.170 rows=98834 loops=1)"
" Hash Cond: (i.model_id = md.id)"
" -> Seq Scan on imports i (cost=0.00..47808.58 rows=138858
width=516) (actual time=0.010..323.616 rows=98834 loops=1)"
" -> Hash (cost=1.65..1.65 rows=65 width=14) (actual
time=0.026..0.026 rows=65 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 4kB"
" -> Seq Scan on models md (cost=0.00..1.65 rows=65
width=14) (actual time=0.002..0.012 rows=65 loops=1)"
"Total runtime: 51968.602 ms"

This query takes fifty seconds on a macbook air with i7 processor and
eight gigs of RAM and SSD hard drive. I am using postgres 9.2
installed with homebrew using the standard conf file.

So it seems to me that this query is running as fast as it could but
it's still much slower than doing things with code one record at a
time (using some memoization).

Anyway... Presuming I can't really do anything to speed up this query
does it make sense to try and do this in chunks and if so what is the
best technique for doing that.


From: Richard Huxton <dev(at)archonet(dot)com>
To: Tim Uckun <timuckun(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-21 09:30:59
Message-ID: 50FD0AD3.8020808@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 21/01/13 08:04, Tim Uckun wrote:
> This is the query I am running
>
> update cars.imports i
> set make_id = md.make_id
> from cars.models md where i.model_id = md.id;
>
>
> Here is the analyse
Looks like it's the actual update that's taking all the time.
> This query takes fifty seconds on a macbook air with i7 processor and
> eight gigs of RAM and SSD hard drive. I am using postgres 9.2
> installed with homebrew using the standard conf file.
Can you try a couple of things just to check timings. Probably worth
EXPLAIN ANALYSE.

SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
md.id;

CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models
md ON i.model_id = md.id;

Now the first one should take half a second judging by your previous
explain. If the second one takes 50 seconds too then that's just the
limit of your SSD's write. If it's much faster then something else is
happening.

--
Richard Huxton
Archonet Ltd


From: Tim Uckun <timuckun(at)gmail(dot)com>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-21 10:30:06
Message-ID: CAGuHJrOKGPi7wrJocUgj+8B2U5abJW9OdyDrfC0j7goFOu_Fhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> Can you try a couple of things just to check timings. Probably worth EXPLAIN
> ANALYSE.
>
> SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
> md.id;

Takes about 300 ms

>
> CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md
> ON i.model_id = md.id;

Takes about 300 ms

>
> Now the first one should take half a second judging by your previous
> explain. If the second one takes 50 seconds too then that's just the limit
> of your SSD's write. If it's much faster then something else is happening.


From: Richard Huxton <dev(at)archonet(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Running update in chunks?
Date: 2013-01-21 11:12:39
Message-ID: 50FD22A7.4070203@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 21/01/13 10:30, Tim Uckun wrote:
>> Can you try a couple of things just to check timings. Probably worth EXPLAIN
>> ANALYSE.
>>
>> SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
>> md.id;
>
> Takes about 300 ms
>
>> CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md
>> ON i.model_id = md.id;
> Takes about 300 ms
OK - so writing all the data takes very under one second but updating
the same amount takes 50 seconds.

The only differences I can think of are WAL logging (transaction log)
and index updates (the temp table has no indexes).

1. Try CREATE TABLE rather than CREATE TEMP TABLE - if that's still
quick then it's not the time taken to write WAL.
2. Run the update query against your new tt table and see how long that
takes.
3. Add indexes and repeat (in particular I'd be suspicious of the gin
index on "data")

My guess is that it's the time taken to update the "data" index - gin
indexes can be slow to rebuild (although 50 seconds seems *very* slow).
If so there are a few options:
1. Split the table and put whatever this "data" is into an import_data
table - assuming it doesn't change often.
2. Try a fill-factor of 50% or less - keeping the updates on the same
data page as the original might help
3. Drop the gin index before doing your bulk update and rebuild it at
the end. This is a common approach with bulk-loading / updates.

Oh - I'm assuming you're only updating those rows whose id has changed -
that seemed to be the suggestion in your first message. If not, simply
adding "AND make_id <> md.make_id" should help. Also (and you may well
have considered this) - for a normalised setup you'd just have the
model-id in "imports" and look up the make-id through the "models" table.

--
Richard Huxton
Archonet Ltd


From: Tim Uckun <timuckun(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-21 23:45:25
Message-ID: CAGuHJrOso5Z6+_f7OWxeE5CV99DGmtZod6iBSwaCuzeLreU73g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Oh I forgot

SELECT version();

"PostgreSQL 9.2.2 on x86_64-apple-darwin12.2.1, compiled by Apple
clang version 4.1 (tags/Apple/clang-421.11.65) (based on LLVM 3.1svn),
64-bit"

SELECT name, current_setting(name), source
FROM pg_settings
WHERE source NOT IN ('default', 'override');

"application_name";"pgAdmin III - Query Tool";"client"
"bytea_output";"escape";"session"
"client_encoding";"UNICODE";"session"
"client_min_messages";"notice";"session"
"DateStyle";"ISO, DMY";"session"
"default_text_search_config";"pg_catalog.english";"configuration file"
"lc_messages";"en_NZ.UTF-8";"configuration file"
"lc_monetary";"en_NZ.UTF-8";"configuration file"
"lc_numeric";"en_NZ.UTF-8";"configuration file"
"lc_time";"en_NZ.UTF-8";"configuration file"
"log_timezone";"NZ";"configuration file"
"max_connections";"20";"configuration file"
"max_stack_depth";"2MB";"environment variable"
"search_path";"chrysler, public";"session"
"shared_buffers";"1600kB";"configuration file"
"TimeZone";"NZ";"configuration file"


From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Tim Uckun <timuckun(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-22 00:15:03
Message-ID: 50FDDA07.7020407@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 01/21/2013 03:45 PM, Tim Uckun wrote:
> Oh I forgot
> ...
> "shared_buffers";"1600kB";"configuration file"

You *reduced* shared buffers to 1.6MB? IIRC the typical default is 32MB
and the most common adjustment is to *increase* shared buffers. Most of
my servers are set to 2GB.

Try bumping that up to a reasonable value
(http://www.postgresql.org/docs/9.2/static/runtime-config-resource.html)
and share the results. Don't forget to restart PG after changing that
setting.

Cheers,
Steve


From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Tim Uckun <timuckun(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-22 00:23:22
Message-ID: 50FDDBFA.6000301@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 01/21/2013 04:15 PM, Steve Crawford wrote:
> On 01/21/2013 03:45 PM, Tim Uckun wrote:
>> Oh I forgot
>> ...
Me, too. I forgot to ask for the table definition. If there are
variable-length fields like "text" or "varchar", what is the typical
size of the data.

Also, what is the physical size of the table (\dt+ yourtable)?

Perhaps even the output of
select * from pg_stat_user_tables where relname='yourtable';
might be useful.

Cheers,
Steve


From: Tim Uckun <timuckun(at)gmail(dot)com>
To: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-22 01:02:58
Message-ID: CAGuHJrNKFjrEuq7LXbLSj8tJ+Ur-tQ-=ARjgQtW=3-0K1hGtMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I already posted the schema earlier. It's a handful of integer fields
with one hstore field.

On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
<scrawford(at)pinpointresearch(dot)com> wrote:
> select * from pg_stat_user_tables where relname='yourtable';

Messy output

"relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tim Uckun <timuckun(at)gmail(dot)com>
Cc: Steve Crawford <scrawford(at)pinpointresearch(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-22 07:07:02
Message-ID: CAMkU=1zmfVJypqG7GiAmgbAu8Bf_NKYLVuORD+CiBQBE9pcOpA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Monday, January 21, 2013, Tim Uckun wrote:

> I already posted the schema earlier. It's a handful of integer fields
> with one hstore field.
>

one hstore field can easily be equivalent to 50 text fields with an index
on each one.

I'm pretty sure that that is your bottleneck.

what does \di+ show?

Cheers,

Jeff

>
>


From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Tim Uckun <timuckun(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-22 18:50:14
Message-ID: 50FEDF66.7050100@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 01/21/2013 05:02 PM, Tim Uckun wrote:
> I already posted the schema earlier. It's a handful of integer fields
> with one hstore field.

Oh well. I can't find it but maybe it got lost in shipping or eaten by a
spam filter.

>
>
> On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
> <scrawford(at)pinpointresearch(dot)com> wrote:
>> select * from pg_stat_user_tables where relname='yourtable';
>
>
> Messy output
Don't know if you are in psql but if so, expanded display works for
this. I.e.:
steve(at)[local] => \x
Expanded display is on.
steve(at)[local] => select * from pg_stat_user_tables where relname='footest';
-[ RECORD 1 ]-----+------------------------------
relid | 781691
schemaname | public
relname | footest
seq_scan | 3
seq_tup_read | 609
idx_scan |
idx_tup_fetch |
n_tup_ins | 609
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup | 301
n_dead_tup | 0
last_vacuum |
last_autovacuum |
last_analyze |
last_autoanalyze | 2012-12-19 08:42:23.347368-08
vacuum_count | 0
autovacuum_count | 0
analyze_count | 0
autoanalyze_count | 2

>
> "relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
> 26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
> 12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
> 12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24
>
>
So at least autovacuum is running (and some manual vacuum and analyze as
well).

Cheers,
Steve


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
Cc: Tim Uckun <timuckun(at)gmail(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-22 18:59:14
Message-ID: 20130122185914.GB4221@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Steve Crawford escribió:
> On 01/21/2013 05:02 PM, Tim Uckun wrote:
> >I already posted the schema earlier. It's a handful of integer fields
> >with one hstore field.
>
> Oh well. I can't find it but maybe it got lost in shipping or eaten
> by a spam filter.

This is what we have the archives are for:

http://www.postgresql.org/message-id/flat/50FEDF66(dot)7050100(at)pinpointresearch(dot)com#50FEDF66(dot)7050100@pinpointresearch.com

The original message is at the top of the page (obviously).

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Tim Uckun <timuckun(at)gmail(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-22 19:22:27
Message-ID: 50FEE6F3.3060402@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 01/22/2013 10:59 AM, Alvaro Herrera wrote:
> Steve Crawford escribió:
>> On 01/21/2013 05:02 PM, Tim Uckun wrote:
>>> I already posted the schema earlier. It's a handful of integer fields
>>> with one hstore field.
>> Oh well. I can't find it but maybe it got lost in shipping or eaten
>> by a spam filter.
> This is what we have the archives are for:
>
> http://www.postgresql.org/message-id/flat/50FEDF66(dot)7050100(at)pinpointresearch(dot)com#50FEDF66(dot)7050100@pinpointresearch.com
>
> The original message is at the top of the page (obviously).
>
Didn't notice that the information was over on the github site (which,
of course, also makes it impossible to search for in my email and
unavailable to the mail archives for those wishing to view it in the
future).

Cheers,
Steve


From: Tim Uckun <timuckun(at)gmail(dot)com>
To: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Running update in chunks?
Date: 2013-01-25 09:00:15
Message-ID: CAGuHJrM6v9=1BhjRGTL-7jUyXKLAdWVj9YVb=xmQfnV4Q3HNGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

relid | 26710
schemaname | cars
relname | imports
seq_scan | 280
seq_tup_read | 25873543
idx_scan | 4
idx_tup_fetch | 2749
n_tup_ins | 98926
n_tup_upd | 6350466
n_tup_del | 92
n_tup_hot_upd | 625286
n_live_tup | 98834
n_dead_tup | 0
last_vacuum | 2013-01-25 21:55:36.078614+13
last_autovacuum | 2013-01-25 21:58:40.850546+13
last_analyze | 2013-01-25 21:55:36.305967+13
last_autoanalyze | 2013-01-25 21:51:54.307639+13
vacuum_count | 6
autovacuum_count | 32
analyze_count | 6
autoanalyze_count | 25

On Wed, Jan 23, 2013 at 7:50 AM, Steve Crawford
<scrawford(at)pinpointresearch(dot)com> wrote:
> On 01/21/2013 05:02 PM, Tim Uckun wrote:
>>
>> I already posted the schema earlier. It's a handful of integer fields
>> with one hstore field.
>
>
> Oh well. I can't find it but maybe it got lost in shipping or eaten by a
> spam filter.
>
>
>>
>>
>> On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
>> <scrawford(at)pinpointresearch(dot)com> wrote:
>>>
>>> select * from pg_stat_user_tables where relname='yourtable';
>>
>>
>>
>> Messy output
>
> Don't know if you are in psql but if so, expanded display works for this.
> I.e.:
> steve(at)[local] => \x
> Expanded display is on.
> steve(at)[local] => select * from pg_stat_user_tables where relname='footest';
> -[ RECORD 1 ]-----+------------------------------
> relid | 781691
> schemaname | public
> relname | footest
> seq_scan | 3
> seq_tup_read | 609
> idx_scan |
> idx_tup_fetch |
> n_tup_ins | 609
> n_tup_upd | 0
> n_tup_del | 0
> n_tup_hot_upd | 0
> n_live_tup | 301
> n_dead_tup | 0
> last_vacuum |
> last_autovacuum |
> last_analyze |
> last_autoanalyze | 2012-12-19 08:42:23.347368-08
> vacuum_count | 0
> autovacuum_count | 0
> analyze_count | 0
> autoanalyze_count | 2
>
>
>
>>
>>
>> "relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
>>
>> 26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
>> 12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
>> 12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24
>>
>>
> So at least autovacuum is running (and some manual vacuum and analyze as
> well).
>
> Cheers,
> Steve
>