Re: Running update in chunks?

Lists: pgsql-general
From: "Kevin Grittner" <kgrittn(at)mail(dot)com>
To: "Tim Uckun" <timuckun(at)gmail(dot)com>
Cc: "Richard Huxton" <dev(at)archonet(dot)com>,pgsql-general(at)postgresql(dot)org
Subject: Re: Running update in chunks?
Date: 2013-01-22 02:21:50
Message-ID: 20130122022150.120590@gmx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Kevin Grittner wrote:

> update imports set make_id = 0
>
> Query returned successfully: 98834 rows affected, 45860 ms execution time.

For difficult problems, there is nothing like a self-contained test
case, that someone else can run to see the issue. Here's a starting
point:

create extension if not exists hstore;
create schema cars;
drop table if exists cars.imports;
CREATE TABLE cars.imports
(
id serial NOT NULL,
target_id integer,
batch_id integer,
make_id integer,
model_id integer,
date timestamp without time zone,
division_id integer,
dealer_id integer,
data hstore,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT imports_pkey PRIMARY KEY (id)
);
CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x;
vacuum freeze analyze;
\timing on
update cars.imports set make_id = 0;
vacuum analyze;
update cars.imports set make_id = 0;

... and here's what I get when I run it on my desktop computer with
ordinary disk drives and a completely default configuration:

test=# create extension if not exists hstore;
CREATE EXTENSION
Time: 48.032 ms
test=# create schema cars;
CREATE SCHEMA
Time: 8.150 ms
test=# drop table if exists cars.imports;
NOTICE:  table "imports" does not exist, skipping
DROP TABLE
Time: 0.205 ms
test=# CREATE TABLE cars.imports
test-# (
test(# id serial NOT NULL,
test(# target_id integer,
test(# batch_id integer,
test(# make_id integer,
test(# model_id integer,
test(# date timestamp without time zone,
test(# division_id integer,
test(# dealer_id integer,
test(# data hstore,
test(# created_at timestamp without time zone NOT NULL,
test(# updated_at timestamp without time zone NOT NULL,
test(# CONSTRAINT imports_pkey PRIMARY KEY (id)
test(# );
NOTICE:  CREATE TABLE will create implicit sequence "imports_id_seq" for serial column "imports.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" for table "imports"
CREATE TABLE
Time: 152.677 ms
test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
CREATE INDEX
Time: 6.391 ms
test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
CREATE INDEX
Time: 64.668 ms
test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
CREATE INDEX
Time: 65.573 ms
test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
CREATE INDEX
Time: 64.959 ms
test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
CREATE INDEX
Time: 64.906 ms
test=# insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x;
INSERT 0 100000
Time: 2516.559 ms
test=# vacuum freeze analyze;
VACUUM
Time: 3357.778 ms
test=# \timing on
Timing is on.
test=# update cars.imports set make_id = 0;
UPDATE 100000
Time: 2937.241 ms
test=# vacuum analyze;
VACUUM
Time: 2097.426 ms
test=# update cars.imports set make_id = 0;
UPDATE 100000
Time: 3935.939 ms

Ubuntu 12.10
i7-3770 CPU @ 3.40GHz with 16GB RAM
Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1.

PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit

test=# SELECT name, current_setting(name), source
test-#   FROM pg_settings
test-#   WHERE source NOT IN ('default', 'override');
           name            |  current_setting   |        source        
----------------------------+--------------------+----------------------
application_name           | psql               | client
client_encoding            | UTF8               | client
DateStyle                  | ISO, MDY           | configuration file
default_text_search_config | pg_catalog.english | configuration file
lc_messages                | en_US.UTF-8        | configuration file
lc_monetary                | en_US.UTF-8        | configuration file
lc_numeric                 | en_US.UTF-8        | configuration file
lc_time                    | en_US.UTF-8        | configuration file
log_timezone               | US/Central         | configuration file
max_connections            | 100                | configuration file
max_stack_depth            | 2MB                | environment variable
shared_buffers             | 32MB               | configuration file
TimeZone                   | US/Central         | configuration file
(13 rows)

I did absolutely no tuning from the default configuration.

So, what timings do you get if you run the identical script? Is
there something you can do to the above script (maybe in terms of
populating data) which will cause the performance you see?

-Kevin


From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Tim Uckun <timuckun(at)gmail(dot)com>
Cc: Kevin Grittner <kgrittn(at)mail(dot)com>, Richard Huxton <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Running update in chunks?
Date: 2013-01-22 18:57:16
Message-ID: 50FEE10C.3080307@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 01/21/2013 06:21 PM, Kevin Grittner wrote:
> Kevin Grittner wrote:
> ...
> shared_buffers | 32MB | configuration file
> ...
>
> I did absolutely no tuning from the default configuration.
>
But Tim has his shared_buffers set to 1600kB
("shared_buffers";"1600kB";"configuration file") or roughly 1/20 of the
typical default value, which is a very low starting value anyway, on a
machine populated with 8GB RAM.

I'd like to see how it runs with a more reasonable shared_buffers
setting. At a very minimum the 32MB default.

Cheers,
Steve


From: Tim Uckun <timuckun(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)mail(dot)com>
Cc: Richard Huxton <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Running update in chunks?
Date: 2013-01-25 08:53:01
Message-ID: CAGuHJrMMD_2uQ+wQUwAOUyrp88Vh_O5S63oD=EKk+mvCpghxjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Sorry I haven't been responsive for a little while.

I ran your script but creating a new schema instead of my existing
schema. My timings were similar to yours (more or less) except fo the
vacuums which took roughly 147891 ms.

On Tue, Jan 22, 2013 at 3:21 PM, Kevin Grittner <kgrittn(at)mail(dot)com> wrote:
> Kevin Grittner wrote:
>
>> update imports set make_id = 0
>>
>> Query returned successfully: 98834 rows affected, 45860 ms execution time.
>
> For difficult problems, there is nothing like a self-contained test
> case, that someone else can run to see the issue. Here's a starting
> point:
>
> create extension if not exists hstore;
> create schema cars;
> drop table if exists cars.imports;
> CREATE TABLE cars.imports
> (
> id serial NOT NULL,
> target_id integer,
> batch_id integer,
> make_id integer,
> model_id integer,
> date timestamp without time zone,
> division_id integer,
> dealer_id integer,
> data hstore,
> created_at timestamp without time zone NOT NULL,
> updated_at timestamp without time zone NOT NULL,
> CONSTRAINT imports_pkey PRIMARY KEY (id)
> );
> CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
> CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
> CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
> CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
> CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
> insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x;
> vacuum freeze analyze;
> \timing on
> update cars.imports set make_id = 0;
> vacuum analyze;
> update cars.imports set make_id = 0;
>
> ... and here's what I get when I run it on my desktop computer with
> ordinary disk drives and a completely default configuration:
>
> test=# create extension if not exists hstore;
> CREATE EXTENSION
> Time: 48.032 ms
> test=# create schema cars;
> CREATE SCHEMA
> Time: 8.150 ms
> test=# drop table if exists cars.imports;
> NOTICE: table "imports" does not exist, skipping
> DROP TABLE
> Time: 0.205 ms
> test=# CREATE TABLE cars.imports
> test-# (
> test(# id serial NOT NULL,
> test(# target_id integer,
> test(# batch_id integer,
> test(# make_id integer,
> test(# model_id integer,
> test(# date timestamp without time zone,
> test(# division_id integer,
> test(# dealer_id integer,
> test(# data hstore,
> test(# created_at timestamp without time zone NOT NULL,
> test(# updated_at timestamp without time zone NOT NULL,
> test(# CONSTRAINT imports_pkey PRIMARY KEY (id)
> test(# );
> NOTICE: CREATE TABLE will create implicit sequence "imports_id_seq" for serial column "imports.id"
> NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" for table "imports"
> CREATE TABLE
> Time: 152.677 ms
> test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
> CREATE INDEX
> Time: 6.391 ms
> test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
> CREATE INDEX
> Time: 64.668 ms
> test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
> CREATE INDEX
> Time: 65.573 ms
> test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
> CREATE INDEX
> Time: 64.959 ms
> test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
> CREATE INDEX
> Time: 64.906 ms
> test=# insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x;
> INSERT 0 100000
> Time: 2516.559 ms
> test=# vacuum freeze analyze;
> VACUUM
> Time: 3357.778 ms
> test=# \timing on
> Timing is on.
> test=# update cars.imports set make_id = 0;
> UPDATE 100000
> Time: 2937.241 ms
> test=# vacuum analyze;
> VACUUM
> Time: 2097.426 ms
> test=# update cars.imports set make_id = 0;
> UPDATE 100000
> Time: 3935.939 ms
>
> Ubuntu 12.10
> i7-3770 CPU @ 3.40GHz with 16GB RAM
> Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1.
>
> PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit
>
> test=# SELECT name, current_setting(name), source
> test-# FROM pg_settings
> test-# WHERE source NOT IN ('default', 'override');
> name | current_setting | source
> ----------------------------+--------------------+----------------------
> application_name | psql | client
> client_encoding | UTF8 | client
> DateStyle | ISO, MDY | configuration file
> default_text_search_config | pg_catalog.english | configuration file
> lc_messages | en_US.UTF-8 | configuration file
> lc_monetary | en_US.UTF-8 | configuration file
> lc_numeric | en_US.UTF-8 | configuration file
> lc_time | en_US.UTF-8 | configuration file
> log_timezone | US/Central | configuration file
> max_connections | 100 | configuration file
> max_stack_depth | 2MB | environment variable
> shared_buffers | 32MB | configuration file
> TimeZone | US/Central | configuration file
> (13 rows)
>
> I did absolutely no tuning from the default configuration.
>
> So, what timings do you get if you run the identical script? Is
> there something you can do to the above script (maybe in terms of
> populating data) which will cause the performance you see?
>
> -Kevin