Index slow down insertions...

Lists: pgsql-novicepgsql-performance
From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: pgsql-novice(at)postgresql(dot)org
Subject: Index slow down insertions...
Date: 2012-07-15 01:14:45
Message-ID: 50021985.2000506@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

Hello,

Our postgres 9.0 DB has one table (the important one) where the bulk of
insertions is happening. We are looking more or less at around 15K to
20K insertions per minute and my measurements give me a rate of 0.60 to
1 msec per insertion. A summary of the table where the insertions are
happening is as follows:

-- Inherited from table feed_all_y2012m07.ship_a_pos_messages: msg_id
bigint NOT NULL DEFAULT
nextval('feed_all_y2012m07.messages_msg_id_seq'::regclass),
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages: msg_type
smallint NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages: obj_id
integer NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
msg_date_rec timestamp without time zone NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages: msg_text
text NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
msg_expanded boolean NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages: msg_time
time without time zone,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
pos_accuracy boolean NOT NULL DEFAULT false,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages: pos_raim
boolean NOT NULL DEFAULT false,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages: pos_lon
integer NOT NULL DEFAULT (181 * 600000),
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages: pos_lat
integer NOT NULL DEFAULT (91 * 60000),
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
pos_georef1 character varying(2) NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
pos_georef2 character varying(2) NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
pos_georef3 character varying(2) NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
pos_georef4 character varying(2) NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
pos_point geometry,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
ship_speed smallint NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
ship_course smallint NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
ship_heading smallint NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
ship_second smallint NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
ship_radio integer NOT NULL,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
ship_status ais_server.nav_status NOT NULL DEFAULT
'NOT_DEFINED'::ais_server.nav_status,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
ship_turn smallint NOT NULL DEFAULT 128,
-- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
ship_maneuver smallint NOT NULL,
CONSTRAINT ship_a_pos_messages_wk0_pkey PRIMARY KEY (msg_id )

The table is created in table space "Data" while its indexes in table
space "Index" (a different HD). Now once the database is empty the
configuration is flying but of course this is not the case always. 5
days later and around 55,000,000 rows later the insertions are literally
so slow that the application server has to drop inserts in order to keep
up. To be precise we are looking now at 1 insertion every 5 to 10,
sometimes 25 msec!!

After lots of tuning both on the postgres server and the stored procs,
after installing 18G Ram and appropriately changing the shared_buffers,
working_mem etc, we realized that our index hard disk had 100%
utilization and essentially it was talking longer to update the indexes
than to update the table. Well I took a radical approach and dropped all
the indexes and... miracle, the db got back in to life, insertion went
back to a healthy 0.70msec but of course now I have no indexes. It is my
belief that I am doing something fundamentally wrong with the index
creation as 4 indexes cannot really bring a database to a halt. Here are
the indexes I was using:

CREATE INDEX idx_ship_a_pos_messages_wk0_date_pos
ON feed_all_y2012m07.ship_a_pos_messages_wk0
USING btree
(msg_date_rec , pos_georef1 , pos_georef2 , pos_georef3 , pos_georef4 )
TABLESPACE index;

CREATE INDEX idx_ship_a_pos_messages_wk0_date_rec
ON feed_all_y2012m07.ship_a_pos_messages_wk0
USING btree
(msg_date_rec )
TABLESPACE index;

CREATE INDEX idx_ship_a_pos_messages_wk0_object
ON feed_all_y2012m07.ship_a_pos_messages_wk0
USING btree
(obj_id , msg_type , msg_text , msg_date_rec )
TABLESPACE index;

CREATE INDEX idx_ship_a_pos_messages_wk0_pos
ON feed_all_y2012m07.ship_a_pos_messages_wk0
USING btree
(pos_georef1 , pos_georef2 , pos_georef3 , pos_georef4 )
TABLESPACE index;

As I have run out of ideas any help will be really appreciated. For the
time being i can live without indexes but sooner or later people will
need to access the live data. I don't even dare to think what will
happen to the database if I only introduce a spatial GIS index that I
need. Question: Is there any possibility that I must include the primary
key into my index to "help" during indexing? If I remember well MS-SQL
has such a "feature".

Kind Regards
Yiannis


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: pgsql-novice(at)postgresql(dot)org
Subject: Re: Index slow down insertions...
Date: 2012-07-16 10:24:27
Message-ID: 5003EBDB.8030903@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 15/07/2012 02:14, Ioannis Anagnostopoulos wrote:
> Hello,
>
> Our postgres 9.0 DB has one table (the important one) where the bulk
> of insertions is happening. We are looking more or less at around 15K
> to 20K insertions per minute and my measurements give me a rate of
> 0.60 to 1 msec per insertion. A summary of the table where the
> insertions are happening is as follows:
>
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages: msg_id
> bigint NOT NULL DEFAULT
> nextval('feed_all_y2012m07.messages_msg_id_seq'::regclass),
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> msg_type smallint NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages: obj_id
> integer NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> msg_date_rec timestamp without time zone NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> msg_text text NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> msg_expanded boolean NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> msg_time time without time zone,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> pos_accuracy boolean NOT NULL DEFAULT false,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> pos_raim boolean NOT NULL DEFAULT false,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages: pos_lon
> integer NOT NULL DEFAULT (181 * 600000),
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages: pos_lat
> integer NOT NULL DEFAULT (91 * 60000),
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> pos_georef1 character varying(2) NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> pos_georef2 character varying(2) NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> pos_georef3 character varying(2) NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> pos_georef4 character varying(2) NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> pos_point geometry,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> ship_speed smallint NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> ship_course smallint NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> ship_heading smallint NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> ship_second smallint NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> ship_radio integer NOT NULL,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> ship_status ais_server.nav_status NOT NULL DEFAULT
> 'NOT_DEFINED'::ais_server.nav_status,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> ship_turn smallint NOT NULL DEFAULT 128,
> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
> ship_maneuver smallint NOT NULL,
> CONSTRAINT ship_a_pos_messages_wk0_pkey PRIMARY KEY (msg_id )
>
> The table is created in table space "Data" while its indexes in table
> space "Index" (a different HD). Now once the database is empty the
> configuration is flying but of course this is not the case always. 5
> days later and around 55,000,000 rows later the insertions are
> literally so slow that the application server has to drop inserts in
> order to keep up. To be precise we are looking now at 1 insertion
> every 5 to 10, sometimes 25 msec!!
>
> After lots of tuning both on the postgres server and the stored procs,
> after installing 18G Ram and appropriately changing the
> shared_buffers, working_mem etc, we realized that our index hard disk
> had 100% utilization and essentially it was talking longer to update
> the indexes than to update the table. Well I took a radical approach
> and dropped all the indexes and... miracle, the db got back in to
> life, insertion went back to a healthy 0.70msec but of course now I
> have no indexes. It is my belief that I am doing something
> fundamentally wrong with the index creation as 4 indexes cannot really
> bring a database to a halt. Here are the indexes I was using:
>
> CREATE INDEX idx_ship_a_pos_messages_wk0_date_pos
> ON feed_all_y2012m07.ship_a_pos_messages_wk0
> USING btree
> (msg_date_rec , pos_georef1 , pos_georef2 , pos_georef3 , pos_georef4 )
> TABLESPACE index;
>
> CREATE INDEX idx_ship_a_pos_messages_wk0_date_rec
> ON feed_all_y2012m07.ship_a_pos_messages_wk0
> USING btree
> (msg_date_rec )
> TABLESPACE index;
>
> CREATE INDEX idx_ship_a_pos_messages_wk0_object
> ON feed_all_y2012m07.ship_a_pos_messages_wk0
> USING btree
> (obj_id , msg_type , msg_text , msg_date_rec )
> TABLESPACE index;
>
> CREATE INDEX idx_ship_a_pos_messages_wk0_pos
> ON feed_all_y2012m07.ship_a_pos_messages_wk0
> USING btree
> (pos_georef1 , pos_georef2 , pos_georef3 , pos_georef4 )
> TABLESPACE index;
>
> As I have run out of ideas any help will be really appreciated. For
> the time being i can live without indexes but sooner or later people
> will need to access the live data. I don't even dare to think what
> will happen to the database if I only introduce a spatial GIS index
> that I need. Question: Is there any possibility that I must include
> the primary key into my index to "help" during indexing? If I remember
> well MS-SQL has such a "feature".
>
> Kind Regards
> Yiannis
>
>
Some more information regarding this "problem". I start to believe that
the problem is mainly due to the autovacum that happens to prevent
wraparound. As our database is heavily used with inserts, wraparounds
are happing very often. The vacuums that are triggered to deal with the
situation have an adverse effect on the index HD. In essence as the
database covers 12 months of data an autovacuum to prevent wrap around
is more or less constantly present starving the actual data insertion
process from index HD resources (especially when those indexes are quite
a lot as I said in my previous post). Now, given the fact that only the
"current" month is updated with inserts while the previous months are
essentially ready-only(static) I think that moving the indexes of the
past months to an archive HD or dropping those that are not necessary
any more would probably solve the problem. Does my theory hold any water?

Kind Regards
Yiannis


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: pgsql-novice(at)postgresql(dot)org
Subject: Re: Index slow down insertions...
Date: 2012-07-19 12:24:34
Message-ID: 5007FC82.9070007@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 16/07/2012 11:24, Ioannis Anagnostopoulos wrote:
> On 15/07/2012 02:14, Ioannis Anagnostopoulos wrote:
>> Hello,
>>
>> Our postgres 9.0 DB has one table (the important one) where the bulk
>> of insertions is happening. We are looking more or less at around 15K
>> to 20K insertions per minute and my measurements give me a rate of
>> 0.60 to 1 msec per insertion. A summary of the table where the
>> insertions are happening is as follows:
>>
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages: msg_id
>> bigint NOT NULL DEFAULT
>> nextval('feed_all_y2012m07.messages_msg_id_seq'::regclass),
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> msg_type smallint NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages: obj_id
>> integer NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> msg_date_rec timestamp without time zone NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> msg_text text NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> msg_expanded boolean NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> msg_time time without time zone,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> pos_accuracy boolean NOT NULL DEFAULT false,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> pos_raim boolean NOT NULL DEFAULT false,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> pos_lon integer NOT NULL DEFAULT (181 * 600000),
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> pos_lat integer NOT NULL DEFAULT (91 * 60000),
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> pos_georef1 character varying(2) NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> pos_georef2 character varying(2) NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> pos_georef3 character varying(2) NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> pos_georef4 character varying(2) NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> pos_point geometry,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> ship_speed smallint NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> ship_course smallint NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> ship_heading smallint NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> ship_second smallint NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> ship_radio integer NOT NULL,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> ship_status ais_server.nav_status NOT NULL DEFAULT
>> 'NOT_DEFINED'::ais_server.nav_status,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> ship_turn smallint NOT NULL DEFAULT 128,
>> -- Inherited from table feed_all_y2012m07.ship_a_pos_messages:
>> ship_maneuver smallint NOT NULL,
>> CONSTRAINT ship_a_pos_messages_wk0_pkey PRIMARY KEY (msg_id )
>>
>> The table is created in table space "Data" while its indexes in table
>> space "Index" (a different HD). Now once the database is empty the
>> configuration is flying but of course this is not the case always. 5
>> days later and around 55,000,000 rows later the insertions are
>> literally so slow that the application server has to drop inserts in
>> order to keep up. To be precise we are looking now at 1 insertion
>> every 5 to 10, sometimes 25 msec!!
>>
>> After lots of tuning both on the postgres server and the stored
>> procs, after installing 18G Ram and appropriately changing the
>> shared_buffers, working_mem etc, we realized that our index hard disk
>> had 100% utilization and essentially it was talking longer to update
>> the indexes than to update the table. Well I took a radical approach
>> and dropped all the indexes and... miracle, the db got back in to
>> life, insertion went back to a healthy 0.70msec but of course now I
>> have no indexes. It is my belief that I am doing something
>> fundamentally wrong with the index creation as 4 indexes cannot
>> really bring a database to a halt. Here are the indexes I was using:
>>
>> CREATE INDEX idx_ship_a_pos_messages_wk0_date_pos
>> ON feed_all_y2012m07.ship_a_pos_messages_wk0
>> USING btree
>> (msg_date_rec , pos_georef1 , pos_georef2 , pos_georef3 ,
>> pos_georef4 )
>> TABLESPACE index;
>>
>> CREATE INDEX idx_ship_a_pos_messages_wk0_date_rec
>> ON feed_all_y2012m07.ship_a_pos_messages_wk0
>> USING btree
>> (msg_date_rec )
>> TABLESPACE index;
>>
>> CREATE INDEX idx_ship_a_pos_messages_wk0_object
>> ON feed_all_y2012m07.ship_a_pos_messages_wk0
>> USING btree
>> (obj_id , msg_type , msg_text , msg_date_rec )
>> TABLESPACE index;
>>
>> CREATE INDEX idx_ship_a_pos_messages_wk0_pos
>> ON feed_all_y2012m07.ship_a_pos_messages_wk0
>> USING btree
>> (pos_georef1 , pos_georef2 , pos_georef3 , pos_georef4 )
>> TABLESPACE index;
>>
>> As I have run out of ideas any help will be really appreciated. For
>> the time being i can live without indexes but sooner or later people
>> will need to access the live data. I don't even dare to think what
>> will happen to the database if I only introduce a spatial GIS index
>> that I need. Question: Is there any possibility that I must include
>> the primary key into my index to "help" during indexing? If I
>> remember well MS-SQL has such a "feature".
>>
>> Kind Regards
>> Yiannis
>>
>>
> Some more information regarding this "problem". I start to believe
> that the problem is mainly due to the autovacum that happens to
> prevent wraparound. As our database is heavily used with inserts,
> wraparounds are happing very often. The vacuums that are triggered to
> deal with the situation have an adverse effect on the index HD. In
> essence as the database covers 12 months of data an autovacuum to
> prevent wrap around is more or less constantly present starving the
> actual data insertion process from index HD resources (especially when
> those indexes are quite a lot as I said in my previous post). Now,
> given the fact that only the "current" month is updated with inserts
> while the previous months are essentially ready-only(static) I think
> that moving the indexes of the past months to an archive HD or
> dropping those that are not necessary any more would probably solve
> the problem. Does my theory hold any water?
>
> Kind Regards
> Yiannis
>
>
Hello again, sorry for topping up the thread but I think that the more
information I provide you the more likely it is to get an answer. So as
I go along, I have stripped completely the database from additional
indexes, those that possible delay the insertion process, of course
maintaining the pkey and 2 or three absolutely mandatory indexes for my
select queries. As a result I have a sleek and steady performance of
around 0.70 msec per insertion. However I have now closed a full circle
as I have a fast database but when I try to "select", making optimum
usage of the left over indexes, the insertion process slows down. Yes my
selections are huge (they are not slow, just huge as it is about
geographical points etc) but I am asking if there is anyway that I can
"prioritise" the insertions over the "selections". These "selections"
are happening anyway as batch process during night so I don't really
mind if they will take 2 or 5 hours, as long as they are ready at 9.00am
next day. Again any advice will be highly appreciated.

Kind Regards
Yiannis


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: A very long running query....
Date: 2012-07-20 21:19:22
Message-ID: 5009CB5A.7070801@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

Hello,
the following query seems to take ages to get executed. However I am
more than sure (as you can see from the explain analyse) that uses all
the correct indexes. In general I have serious issues with joins in my
database. This is a Postgres ver. 9.0 running postgis with the
"_int.sql" contrib enabled. Further more I think that the execution of
this query seriously degrades the performance of the database. I had to
device this query and run it like an overnight batch to populate a
table as I couldn't afford users to execute it over and over in a "need
to do" base. Unfortunately it is still slow and some times it either
brings down the whole database (my insertions are buffered on the app
server) or it never completes before morning.

SELECT
src_id,
date_trunc('day', message_copies.msg_date_rec) as date_count,
message_copies.pos_georef1,
message_copies.pos_georef2,
message_copies.pos_georef3,
message_copies.pos_georef4,
ais_server.array_accum(CASE WHEN msg_type BETWEEN 1 and 3
THEN message_copies.msg_id END) as msgA_array,
ais_server.array_accum(CASE WHEN msg_type = 18 THEN
message_copies.msg_id END) as msgB_std_array,
ais_server.array_accum(CASE WHEN msg_type = 19 THEN
message_copies.msg_id END) as msgB_ext_array,
uniq
(
ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_A'
THEN obj_mmsi END)
) as mmsi_type_A_array,
uniq
(
ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_B'
THEN obj_mmsi END)
) as mmsi_type_B_array,
avg(ship_speed) / 10.0 as avg_speed,
avg(ship_heading) as avg_heading,
avg(ship_course) / 10.0 as avg_course,
ST_Multi(ST_Collect(ship_pos_messages.pos_point)) as geom
from
feed_all_y2012m07.message_copies join
(feed_all_y2012m07.ship_pos_messages join
ais_server.ship_objects on (ship_pos_messages.obj_id =
ship_objects.obj_id))
on (message_copies.msg_id = ship_pos_messages.msg_id)
where
extract('day' from message_copies.msg_date_rec) = 17
and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'
and message_copies.src_id = 1
and (message_copies.pos_georef1 || message_copies.pos_georef2
|| message_copies.pos_georef3 || message_copies.pos_georef4) <> ''
and not (message_copies.pos_georef1 ||
message_copies.pos_georef2 || message_copies.pos_georef3 ||
message_copies.pos_georef4) is null
and extract('day' from ship_pos_messages.msg_date_rec) = 17
group by src_id, date_count, message_copies.pos_georef1,
message_copies.pos_georef2, message_copies.pos_georef3,
message_copies.pos_georef4;

What follows is the Explain Analyze:
"HashAggregate (cost=21295.20..21298.51 rows=53 width=148) (actual
time=17832235.321..17832318.546 rows=2340 loops=1)"
" -> Nested Loop (cost=0.00..21293.21 rows=53 width=148) (actual
time=62.188..17801780.764 rows=387105 loops=1)"
" -> Nested Loop (cost=0.00..20942.93 rows=53 width=144)
(actual time=62.174..17783236.718 rows=387105 loops=1)"
" Join Filter: (feed_all_y2012m07.message_copies.msg_id =
feed_all_y2012m07.ship_pos_messages.msg_id)"
" -> Append (cost=0.00..19057.93 rows=53 width=33)
(actual time=62.124..5486473.545 rows=387524 loops=1)"
" -> Seq Scan on message_copies (cost=0.00..0.00
rows=1 width=68) (actual time=0.000..0.000 rows=0 loops=1)"
" Filter: ((src_id = 1) AND
(date_trunc('day'::text, msg_date_rec) = '2012-07-17
00:00:00'::timestamp without time zone) AND (date_part('day'::text,
msg_date_rec) = 17::double precision) AND (NOT (((((pos_georef1)::text
|| (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
IS NULL)) AND (((((pos_georef1)::text || (pos_georef2)::text) ||
(pos_georef3)::text) || (pos_georef4)::text) <> ''::text))"
" -> Index Scan using
idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
message_copies (cost=0.00..19057.93 rows=52 width=32) (actual
time=62.124..5486270.845 rows=387524 loops=1)"
" Index Cond: ((date_trunc('day'::text,
msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone) AND
(src_id = 1))"
" Filter: ((date_part('day'::text,
msg_date_rec) = 17::double precision) AND (NOT (((((pos_georef1)::text
|| (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
IS NULL)) AND (((((pos_georef1)::text || (pos_georef2)::text) ||
(pos_georef3)::text) || (pos_georef4)::text) <> ''::text))"
" -> Append (cost=0.00..35.50 rows=5 width=93) (actual
time=31.684..31.724 rows=1 loops=387524)"
" -> Seq Scan on ship_pos_messages (cost=0.00..0.00
rows=1 width=52) (actual time=0.001..0.001 rows=0 loops=387524)"
" Filter: (date_part('day'::text,
feed_all_y2012m07.ship_pos_messages.msg_date_rec) = 17::double precision)"
" -> Seq Scan on ship_a_pos_messages
ship_pos_messages (cost=0.00..0.00 rows=1 width=52) (actual
time=0.000..0.000 rows=0 loops=387524)"
" Filter: (date_part('day'::text,
feed_all_y2012m07.ship_pos_messages.msg_date_rec) = 17::double precision)"
" -> Index Scan using ship_b_std_pos_messages_pkey
on ship_b_std_pos_messages ship_pos_messages (cost=0.00..9.03 rows=1
width=120) (actual time=0.008..0.008 rows=0 loops=387524)"
" Index Cond:
(feed_all_y2012m07.ship_pos_messages.msg_id =
feed_all_y2012m07.message_copies.msg_id)"
" Filter: (date_part('day'::text,
feed_all_y2012m07.ship_pos_messages.msg_date_rec) = 17::double precision)"
" -> Index Scan using ship_b_ext_pos_messages_pkey
on ship_b_ext_pos_messages ship_pos_messages (cost=0.00..7.90 rows=1
width=120) (actual time=0.004..0.004 rows=0 loops=387524)"
" Index Cond:
(feed_all_y2012m07.ship_pos_messages.msg_id =
feed_all_y2012m07.message_copies.msg_id)"
" Filter: (date_part('day'::text,
feed_all_y2012m07.ship_pos_messages.msg_date_rec) = 17::double precision)"
" -> Index Scan using ship_a_pos_messages_wk2_pkey
on ship_a_pos_messages_wk2 ship_pos_messages (cost=0.00..18.57 rows=1
width=120) (actual time=31.670..31.710 rows=1 loops=387524)"
" Index Cond:
(feed_all_y2012m07.ship_pos_messages.msg_id =
feed_all_y2012m07.message_copies.msg_id)"
" Filter: (date_part('day'::text,
feed_all_y2012m07.ship_pos_messages.msg_date_rec) = 17::double precision)"
" -> Index Scan using ship_objects_pkey on ship_objects
(cost=0.00..6.59 rows=1 width=12) (actual time=0.041..0.044 rows=1
loops=387105)"
" Index Cond: (ship_objects.obj_id =
feed_all_y2012m07.ship_pos_messages.obj_id)"
"Total runtime: 17832338.082 ms"

A few more information: feed_all_y2012m07.message_copies_wk2 has 24.5
million rows only for the 17th of July and more or less the same amount
for rows per day since the 15th that I started populating it. So I guess
we are looking around 122million rows. The tables are populated with
around 16K rows per minute.

As always any help will be greatly appreciated.
Kind Regards
Yiannis


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-20 21:23:52
Message-ID: CAGTBQpa07PMrLA3qsM9Zf9QXcPvgntyVX7qL9CqEwQtK9LH1hA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos
<ioannis(at)anatec(dot)com> wrote:
> " -> Nested Loop (cost=0.00..20942.93 rows=53 width=144) (actual
> time=62.174..17783236.718 rows=387105 loops=1)"
> " Join Filter: (feed_all_y2012m07.message_copies.msg_id =
> feed_all_y2012m07.ship_pos_messages.msg_id)"
> " -> Append (cost=0.00..19057.93 rows=53 width=33) (actual
> time=62.124..5486473.545 rows=387524 loops=1)"

Misestimated row counts... did you try running an analyze, or upping
statistic targets?


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-20 21:27:31
Message-ID: 5009CD43.8000300@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 20/07/2012 22:23, Claudio Freire wrote:
> On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos
> <ioannis(at)anatec(dot)com> wrote:
>> " -> Nested Loop (cost=0.00..20942.93 rows=53 width=144) (actual
>> time=62.174..17783236.718 rows=387105 loops=1)"
>> " Join Filter: (feed_all_y2012m07.message_copies.msg_id =
>> feed_all_y2012m07.ship_pos_messages.msg_id)"
>> " -> Append (cost=0.00..19057.93 rows=53 width=33) (actual
>> time=62.124..5486473.545 rows=387524 loops=1)"
> Misestimated row counts... did you try running an analyze, or upping
> statistic targets?
I have run analyse every so often. I think the problem is that as I get
16K new rows every minutes, the "stats" are always out... Possible?


From: Rosser Schwarz <rosser(dot)schwarz(at)gmail(dot)com>
To: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-20 21:33:03
Message-ID: CAFnxYwhbi6NmdNBiguwA6ObHiD=v7ayaVzrGWMZ=qfeTxZiWEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On Fri, Jul 20, 2012 at 2:27 PM, Ioannis Anagnostopoulos
<ioannis(at)anatec(dot)com> wrote:
> On 20/07/2012 22:23, Claudio Freire wrote:
>> Misestimated row counts... did you try running an analyze, or upping
>> statistic targets?
> I have run analyse every so often. I think the problem is that as I get 16K
> new rows every minutes, the "stats" are always out... Possible?

It may not help much with any skew in your data that results from
divergent data appearing, but you can update the statistics targets
for those columns and analyze again, and the planner should have much
better information about the distributions of their data. The max
stats target is 10000, but the default is 100. Increasing it even
just to 500 or 1000 should help the planner significantly.

rls

--
:wq


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-20 21:33:58
Message-ID: CAGTBQpYY1cVYZtVNt9mdHjhZy=wQZac-aERAcmkeXs2D+Ym7cA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On Fri, Jul 20, 2012 at 6:27 PM, Ioannis Anagnostopoulos
<ioannis(at)anatec(dot)com> wrote:
> On 20/07/2012 22:23, Claudio Freire wrote:
>>
>> On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos
>> <ioannis(at)anatec(dot)com> wrote:
>>>
>>> " -> Nested Loop (cost=0.00..20942.93 rows=53 width=144) (actual
>>> time=62.174..17783236.718 rows=387105 loops=1)"
>>> " Join Filter: (feed_all_y2012m07.message_copies.msg_id =
>>> feed_all_y2012m07.ship_pos_messages.msg_id)"
>>> " -> Append (cost=0.00..19057.93 rows=53 width=33) (actual
>>> time=62.124..5486473.545 rows=387524 loops=1)"
>>
>> Misestimated row counts... did you try running an analyze, or upping
>> statistic targets?
>
> I have run analyse every so often. I think the problem is that as I get 16K
> new rows every minutes, the "stats" are always out... Possible?

Looking at this:

" -> Index Scan using
idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
message_copies (cost=0.00..19057.93 rows=52 width=32) (actual
time=62.124..5486270.845 rows=387524 loops=1)"
" Index Cond: ((date_trunc('day'::text,
msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
AND (src_id = 1))"
" Filter: ((date_part('day'::text,
msg_date_rec) = 17::double precision) AND (NOT (((((pos_georef1)::text
|| (pos_georef2)::text) || (pos_georef3)::text) ||
(pos_georef4)::text) IS NULL)) AND (((((pos_georef1)::text ||
(pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
<> ''::text))"

It's very possible.

I think pg 9.1 had a fix for that, but I'm not sure it will help in
your case, I'd have to know what that index looks like.


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-20 21:52:21
Message-ID: 5009D315.4020409@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 20/07/2012 22:33, Claudio Freire wrote:
> On Fri, Jul 20, 2012 at 6:27 PM, Ioannis Anagnostopoulos
> <ioannis(at)anatec(dot)com> wrote:
>> On 20/07/2012 22:23, Claudio Freire wrote:
>>> On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos
>>> <ioannis(at)anatec(dot)com> wrote:
>>>> " -> Nested Loop (cost=0.00..20942.93 rows=53 width=144) (actual
>>>> time=62.174..17783236.718 rows=387105 loops=1)"
>>>> " Join Filter: (feed_all_y2012m07.message_copies.msg_id =
>>>> feed_all_y2012m07.ship_pos_messages.msg_id)"
>>>> " -> Append (cost=0.00..19057.93 rows=53 width=33) (actual
>>>> time=62.124..5486473.545 rows=387524 loops=1)"
>>> Misestimated row counts... did you try running an analyze, or upping
>>> statistic targets?
>> I have run analyse every so often. I think the problem is that as I get 16K
>> new rows every minutes, the "stats" are always out... Possible?
>
> Looking at this:
>
> " -> Index Scan using
> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
> message_copies (cost=0.00..19057.93 rows=52 width=32) (actual
> time=62.124..5486270.845 rows=387524 loops=1)"
> " Index Cond: ((date_trunc('day'::text,
> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
> AND (src_id = 1))"
> " Filter: ((date_part('day'::text,
> msg_date_rec) = 17::double precision) AND (NOT (((((pos_georef1)::text
> || (pos_georef2)::text) || (pos_georef3)::text) ||
> (pos_georef4)::text) IS NULL)) AND (((((pos_georef1)::text ||
> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
> <> ''::text))"
>
> It's very possible.
>
> I think pg 9.1 had a fix for that, but I'm not sure it will help in
> your case, I'd have to know what that index looks like.
Here is the index:

CREATE INDEX idx_message_copies_wk2_date_src_pos_partial
ON feed_all_y2012m07.message_copies_wk2
USING btree
(date_trunc('day'::text, msg_date_rec), src_id, (((pos_georef1::text
|| pos_georef2::text) || pos_georef3::text) || pos_georef4::text))
TABLESPACE archive
WHERE (((pos_georef1::text || pos_georef2::text) ||
pos_georef3::text) || pos_georef4::text) IS NOT NULL OR NOT
(((pos_georef1::text || pos_georef2::text) || pos_georef3::text) ||
pos_georef4::text) = ''::text;


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: Rosser Schwarz <rosser(dot)schwarz(at)gmail(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-20 21:53:46
Message-ID: 5009D36A.10506@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 20/07/2012 22:33, Rosser Schwarz wrote:
> On Fri, Jul 20, 2012 at 2:27 PM, Ioannis Anagnostopoulos
> <ioannis(at)anatec(dot)com> wrote:
>> On 20/07/2012 22:23, Claudio Freire wrote:
>>> Misestimated row counts... did you try running an analyze, or upping
>>> statistic targets?
>> I have run analyse every so often. I think the problem is that as I get 16K
>> new rows every minutes, the "stats" are always out... Possible?
> It may not help much with any skew in your data that results from
> divergent data appearing, but you can update the statistics targets
> for those columns and analyze again, and the planner should have much
> better information about the distributions of their data. The max
> stats target is 10000, but the default is 100. Increasing it even
> just to 500 or 1000 should help the planner significantly.
>
> rls
>
I suppose that this is some kind of postgres.conf tweak?


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-20 22:19:15
Message-ID: 5009D963.9040005@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 20/07/2012 22:53, Ioannis Anagnostopoulos wrote:
> On 20/07/2012 22:33, Rosser Schwarz wrote:
>> On Fri, Jul 20, 2012 at 2:27 PM, Ioannis Anagnostopoulos
>> <ioannis(at)anatec(dot)com> wrote:
>>> On 20/07/2012 22:23, Claudio Freire wrote:
>>>> Misestimated row counts... did you try running an analyze, or upping
>>>> statistic targets?
>>> I have run analyse every so often. I think the problem is that as I
>>> get 16K
>>> new rows every minutes, the "stats" are always out... Possible?
>> It may not help much with any skew in your data that results from
>> divergent data appearing, but you can update the statistics targets
>> for those columns and analyze again, and the planner should have much
>> better information about the distributions of their data. The max
>> stats target is 10000, but the default is 100. Increasing it even
>> just to 500 or 1000 should help the planner significantly.
>>
>> rls
>>
> I suppose that this is some kind of postgres.conf tweak?
>
On this Ubuntu installation the default_statistics_target = 1000 and not
100. Do you think that this might be an issue?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-20 23:10:02
Message-ID: 3639.1342825802@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
> Looking at this:

> " -> Index Scan using
> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
> message_copies (cost=0.00..19057.93 rows=52 width=32) (actual
> time=62.124..5486270.845 rows=387524 loops=1)"
> " Index Cond: ((date_trunc('day'::text,
> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
> AND (src_id = 1))"
> " Filter: ((date_part('day'::text,
> msg_date_rec) = 17::double precision) AND (NOT (((((pos_georef1)::text
> || (pos_georef2)::text) || (pos_georef3)::text) ||
> (pos_georef4)::text) IS NULL)) AND (((((pos_georef1)::text ||
> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
> <> ''::text))"

I think the real problem is that the planner has no hope of doing
anything very accurate with such an unwieldy filter condition. I'd look
at ways of making the filter conditions simpler, perhaps by recasting
the data representation. In particular, that's a horridly bad way of
asking whether some columns are empty, which I gather is the intent.
If you really want to do it just like that, creating an index on the
concatenation expression would guide ANALYZE to collect some stats about
it, but it would probably be a lot more efficient to put together an AND
or OR of tests on the individual columns.

regards, tom lane


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-20 23:56:34
Message-ID: 5009F032.40002@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 21/07/2012 00:10, Tom Lane wrote:
> Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
>> Looking at this:
>> " -> Index Scan using
>> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
>> message_copies (cost=0.00..19057.93 rows=52 width=32) (actual
>> time=62.124..5486270.845 rows=387524 loops=1)"
>> " Index Cond: ((date_trunc('day'::text,
>> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
>> AND (src_id = 1))"
>> " Filter: ((date_part('day'::text,
>> msg_date_rec) = 17::double precision) AND (NOT (((((pos_georef1)::text
>> || (pos_georef2)::text) || (pos_georef3)::text) ||
>> (pos_georef4)::text) IS NULL)) AND (((((pos_georef1)::text ||
>> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
>> <> ''::text))"
> I think the real problem is that the planner has no hope of doing
> anything very accurate with such an unwieldy filter condition. I'd look
> at ways of making the filter conditions simpler, perhaps by recasting
> the data representation. In particular, that's a horridly bad way of
> asking whether some columns are empty, which I gather is the intent.
> If you really want to do it just like that, creating an index on the
> concatenation expression would guide ANALYZE to collect some stats about
> it, but it would probably be a lot more efficient to put together an AND
> or OR of tests on the individual columns.
>
> regards, tom lane
So what you suggest is to forget all together the concatenation of the
georef1/2/3/4 and instead alter my query with something like:

georef1 is not null and not georeg1 = ''....etc for georef2 3 and 4

That would require to alter my index and have the four georef columns
separately in it and not as a concatenation and so on for the partial
index part. And a final thing, you seem to imply that the indexes are
used by the analyser to collect statistics even if they are not used. So
an index serves not only as a way to speed up targeted queries but also
to provide better statistics to the analyzer?

Kind Regards
Yiannis


From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 08:02:21
Message-ID: 500A620D.4030505@ringerc.id.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 07/21/2012 06:19 AM, Ioannis Anagnostopoulos wrote:

> On this Ubuntu installation the default_statistics_target = 1000 and
> not 100. Do you think that this might be an issue?

Nope. You should generally avoid setting default_statistics_target too
high anyway; leave it where it is and use ALTER TABLE ... ALTER COLUMN
... SET STATISTICS to raise the targets on columns where you're seeing
bad statistics estimates.

http://www.postgresql.org/docs/9.1/static/sql-altertable.html

Also make sure autovaccum is running frequently so it keeps the stats up
to date.

--
Craig Ringer


From: "Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To: "Ioannis Anagnostopoulos" <ioannis(at)anatec(dot)com>
Cc: "Claudio Freire" <klaussfreire(at)gmail(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: A very long running query....
Date: 2012-07-21 09:16:16
Message-ID: C4DAC901169B624F933534A26ED7DF310F96B02D@JENMAIL01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

Hello,
isn't the first test superfluous here ?

> where extract('day' from message_copies.msg_date_rec) = 17
> and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'

> Here is the index:
>
> CREATE INDEX idx_message_copies_wk2_date_src_pos_partial
> ON feed_all_y2012m07.message_copies_wk2
> USING btree
> (date_trunc('day'::text, msg_date_rec),
> src_id,
> (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text))
> TABLESPACE archive
> WHERE (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text) IS NOT NULL
> OR NOT (((pos_georef1::text || pos_georef2::text) || pos_georef3::text) || pos_georef4::text) = ''::text;

the georef test can be simplified using coalesce:

> and (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4) <> ''
> and not (message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4) is null
=>
and coaesce (
(message_copies.pos_georef1 || message_copies.pos_georef2 || message_copies.pos_georef3 || message_copies.pos_georef4),
'') <> ''

In order to avoid this test at query time you might add a boolean column message_copies.pos.has_georef,
and keep it up to date with a before insert or update trigger. This will allow to shorten your index definition and simplify the planner task a little bit.
Moreover it will fasten your query in cases when the index don't get used.

As Tom already mentioned it, it may make sense not to concatenate the georef within the index, but keep them separated, or even keep them in different indexes.
Which is the best depend on the other queries running against this table

HTH,

Marc Mamin

-----Original Message-----
From: pgsql-performance-owner(at)postgresql(dot)org on behalf of Ioannis Anagnostopoulos
Sent: Sat 7/21/2012 1:56 AM
To: Tom Lane
Cc: Claudio Freire; pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] A very long running query....

On 21/07/2012 00:10, Tom Lane wrote:
> Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
>> Looking at this:
>> " -> Index Scan using
>> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
>> message_copies (cost=0.00..19057.93 rows=52 width=32) (actual
>> time=62.124..5486270.845 rows=387524 loops=1)"
>> " Index Cond: ((date_trunc('day'::text,
>> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
>> AND (src_id = 1))"
>> " Filter: ((date_part('day'::text,
>> msg_date_rec) = 17::double precision) AND (NOT (((((pos_georef1)::text
>> || (pos_georef2)::text) || (pos_georef3)::text) ||
>> (pos_georef4)::text) IS NULL)) AND (((((pos_georef1)::text ||
>> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
>> <> ''::text))"
> I think the real problem is that the planner has no hope of doing
> anything very accurate with such an unwieldy filter condition. I'd look
> at ways of making the filter conditions simpler, perhaps by recasting
> the data representation. In particular, that's a horridly bad way of
> asking whether some columns are empty, which I gather is the intent.
> If you really want to do it just like that, creating an index on the
> concatenation expression would guide ANALYZE to collect some stats about
> it, but it would probably be a lot more efficient to put together an AND
> or OR of tests on the individual columns.
>
> regards, tom lane
So what you suggest is to forget all together the concatenation of the
georef1/2/3/4 and instead alter my query with something like:

georef1 is not null and not georeg1 = ''....etc for georef2 3 and 4

That would require to alter my index and have the four georef columns
separately in it and not as a concatenation and so on for the partial
index part. And a final thing, you seem to imply that the indexes are
used by the analyser to collect statistics even if they are not used. So
an index serves not only as a way to speed up targeted queries but also
to provide better statistics to the analyzer?

Kind Regards
Yiannis


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: Marc Mamin <M(dot)Mamin(at)intershop(dot)de>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 09:22:09
Message-ID: 500A74C1.8020109@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 21/07/2012 10:16, Marc Mamin wrote:
> RE: [PERFORM] A very long running query....
>
> Hello,
> isn't the first test superfluous here ?
>
> > where extract('day' from message_copies.msg_date_rec) = 17
> > and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'
>
>
> > Here is the index:
> >
> > CREATE INDEX idx_message_copies_wk2_date_src_pos_partial
> > ON feed_all_y2012m07.message_copies_wk2
> > USING btree
> > (date_trunc('day'::text, msg_date_rec),
> > src_id,
> > (((pos_georef1::text || pos_georef2::text) || pos_georef3::text)
> || pos_georef4::text))
> > TABLESPACE archive
> > WHERE (((pos_georef1::text || pos_georef2::text) ||
> pos_georef3::text) || pos_georef4::text) IS NOT NULL
> > OR NOT (((pos_georef1::text || pos_georef2::text) ||
> pos_georef3::text) || pos_georef4::text) = ''::text;
>
>
> the georef test can be simplified using coalesce:
>
> > and (message_copies.pos_georef1 || message_copies.pos_georef2
> || message_copies.pos_georef3 || message_copies.pos_georef4) <> ''
> > and not (message_copies.pos_georef1 || message_copies.pos_georef2
> || message_copies.pos_georef3 || message_copies.pos_georef4) is null
> =>
> and coaesce (
> (message_copies.pos_georef1 || message_copies.pos_georef2 ||
> message_copies.pos_georef3 || message_copies.pos_georef4),
> '') <> ''
>
> In order to avoid this test at query time you might add a boolean
> column message_copies.pos.has_georef,
> and keep it up to date with a before insert or update trigger. This
> will allow to shorten your index definition and simplify the planner
> task a little bit.
> Moreover it will fasten your query in cases when the index don't get used.
>
> As Tom already mentioned it, it may make sense not to concatenate the
> georef within the index, but keep them separated, or even keep them in
> different indexes.
> Which is the best depend on the other queries running against this table
>
> HTH,
>
> Marc Mamin
>
>
>
> -----Original Message-----
> From: pgsql-performance-owner(at)postgresql(dot)org on behalf of Ioannis
> Anagnostopoulos
> Sent: Sat 7/21/2012 1:56 AM
> To: Tom Lane
> Cc: Claudio Freire; pgsql-performance(at)postgresql(dot)org
> Subject: Re: [PERFORM] A very long running query....
>
> On 21/07/2012 00:10, Tom Lane wrote:
> > Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
> >> Looking at this:
> >> " -> Index Scan using
> >> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
> >> message_copies (cost=0.00..19057.93 rows=52 width=32) (actual
> >> time=62.124..5486270.845 rows=387524 loops=1)"
> >> " Index Cond: ((date_trunc('day'::text,
> >> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
> >> AND (src_id = 1))"
> >> " Filter: ((date_part('day'::text,
> >> msg_date_rec) = 17::double precision) AND (NOT (((((pos_georef1)::text
> >> || (pos_georef2)::text) || (pos_georef3)::text) ||
> >> (pos_georef4)::text) IS NULL)) AND (((((pos_georef1)::text ||
> >> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
> >> <> ''::text))"
> > I think the real problem is that the planner has no hope of doing
> > anything very accurate with such an unwieldy filter condition. I'd look
> > at ways of making the filter conditions simpler, perhaps by recasting
> > the data representation. In particular, that's a horridly bad way of
> > asking whether some columns are empty, which I gather is the intent.
> > If you really want to do it just like that, creating an index on the
> > concatenation expression would guide ANALYZE to collect some stats about
> > it, but it would probably be a lot more efficient to put together an AND
> > or OR of tests on the individual columns.
> >
> > regards, tom lane
> So what you suggest is to forget all together the concatenation of the
> georef1/2/3/4 and instead alter my query with something like:
>
> georef1 is not null and not georeg1 = ''....etc for georef2 3 and 4
>
> That would require to alter my index and have the four georef columns
> separately in it and not as a concatenation and so on for the partial
> index part. And a final thing, you seem to imply that the indexes are
> used by the analyser to collect statistics even if they are not used. So
> an index serves not only as a way to speed up targeted queries but also
> to provide better statistics to the analyzer?
>
> Kind Regards
> Yiannis
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
>
No because it is used to select a partition. Otherwise it will go
through the whole hierarchy...


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
Cc: Marc Mamin <M(dot)Mamin(at)intershop(dot)de>, Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 16:58:20
Message-ID: 20579.1342889900@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

[ Please try to trim quotes when replying. People don't want to re-read
the entire thread in every message. ]

Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com> writes:
> On 21/07/2012 10:16, Marc Mamin wrote:
>> isn't the first test superfluous here ?
>>
>>> where extract('day' from message_copies.msg_date_rec) = 17
>>> and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'

> No because it is used to select a partition. Otherwise it will go
> through the whole hierarchy...

You're using extract(day...) to define partitions? You might want to
rethink that. The planner has got absolutely no intelligence about
the behavior of extract, and in particular doesn't realize that the
date_trunc condition implies the extract condition; so that's another
part of the cause of the estimation error here.

What's usually recommended for partitioning is simple equality or
range constraints, such as "msg_date_rec >= 'date1' AND
msg_date_rec < 'date2'", which the planner does have a fair amount
of intelligence about.

Now, you can generalize that to equality or range constraints using
an expression; for instance there'd be no problem to partition on
date_trunc('day', msg_date_rec) rather than msg_date_rec directly,
so long as your queries always use that same expression. But you
should not expect that the planner can deduce very much about the
correlations between results of different functions.

regards, tom lane


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marc Mamin <M(dot)Mamin(at)intershop(dot)de>, Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 17:42:31
Message-ID: 500AEA07.1070602@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 21/07/2012 17:58, Tom Lane wrote:
> [ Please try to trim quotes when replying. People don't want to re-read
> the entire thread in every message. ]
>
> Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com> writes:
>> On 21/07/2012 10:16, Marc Mamin wrote:
>>> isn't the first test superfluous here ?
>>>
>>>> where extract('day' from message_copies.msg_date_rec) = 17
>>>> and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'
>> No because it is used to select a partition. Otherwise it will go
>> through the whole hierarchy...
> You're using extract(day...) to define partitions? You might want to
> rethink that. The planner has got absolutely no intelligence about
> the behavior of extract, and in particular doesn't realize that the
> date_trunc condition implies the extract condition; so that's another
> part of the cause of the estimation error here.
>
> What's usually recommended for partitioning is simple equality or
> range constraints, such as "msg_date_rec >= 'date1' AND
> msg_date_rec < 'date2'", which the planner does have a fair amount
> of intelligence about.
>
> Now, you can generalize that to equality or range constraints using
> an expression; for instance there'd be no problem to partition on
> date_trunc('day', msg_date_rec) rather than msg_date_rec directly,
> so long as your queries always use that same expression. But you
> should not expect that the planner can deduce very much about the
> correlations between results of different functions.
>
> regards, tom lane
I think you got this wrong here. If you see the query again you will see
that I do use equality. The problem is that my "equality" occurs
by extracting the date from the msg_date_rec column. To put it in other
words, for not using the "extract" I should have an additional
column only with the "date" number to perform the equality. Don't you
feel that this is not right since I have the actual date? The constrain
within the table that defines the partition is as follows:

CONSTRAINT message_copies_wk0_date CHECK (date_part('day'::text,
msg_date_rec) >= 1::double precision AND date_part('day'::text,
msg_date_rec) <= 7::double precision)

I see not problem at this. The planner gets it right and "hits" the
correct table every time. So unless if there is a technique here that I
completely miss,
where is the problem?

Regards
Yiannis


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 19:16:36
Message-ID: 500B0014.5070600@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 21/07/2012 00:10, Tom Lane wrote:
> Claudio Freire <klaussfreire(at)gmail(dot)com> writes:
>> Looking at this:
>> " -> Index Scan using
>> idx_message_copies_wk2_date_src_pos_partial on message_copies_wk2
>> message_copies (cost=0.00..19057.93 rows=52 width=32) (actual
>> time=62.124..5486270.845 rows=387524 loops=1)"
>> " Index Cond: ((date_trunc('day'::text,
>> msg_date_rec) = '2012-07-17 00:00:00'::timestamp without time zone)
>> AND (src_id = 1))"
>> " Filter: ((date_part('day'::text,
>> msg_date_rec) = 17::double precision) AND (NOT (((((pos_georef1)::text
>> || (pos_georef2)::text) || (pos_georef3)::text) ||
>> (pos_georef4)::text) IS NULL)) AND (((((pos_georef1)::text ||
>> (pos_georef2)::text) || (pos_georef3)::text) || (pos_georef4)::text)
>> <> ''::text))"
> I think the real problem is that the planner has no hope of doing
> anything very accurate with such an unwieldy filter condition. I'd look
> at ways of making the filter conditions simpler, perhaps by recasting
> the data representation. In particular, that's a horridly bad way of
> asking whether some columns are empty, which I gather is the intent.
> If you really want to do it just like that, creating an index on the
> concatenation expression would guide ANALYZE to collect some stats about
> it, but it would probably be a lot more efficient to put together an AND
> or OR of tests on the individual columns.
>
> regards, tom lane
OK regarding the index I use... I follow your second advice about
efficiency with individual columns and changed it to:

CREATE INDEX idx_message_copies_wk2_date_src_pos_partial
ON feed_all_y2012m07.message_copies_wk2
USING btree
(date_trunc('day'::text, msg_date_rec), src_id, pos_georef1,
pos_georef2, pos_georef3, pos_georef4)
TABLESPACE "index"
WHERE
pos_georef1 IS NOT NULL
AND NOT pos_georef1::text = ''::text
AND pos_georef2 IS NOT NULL
AND NOT pos_georef2::text = ''::text
AND pos_georef3 IS NOT NULL
AND NOT pos_georef3::text = ''::text
AND pos_georef4 IS NOT NULL
AND NOT pos_georef4::text = ''::text;

The query has been changed as well as follows now:

SELECT
src_id,
date_trunc('day', message_copies.msg_date_rec) as date_count,
message_copies.pos_georef1,
message_copies.pos_georef2,
message_copies.pos_georef3,
message_copies.pos_georef4,
ais_server.array_accum(CASE WHEN msg_type BETWEEN 1 and 3
THEN message_copies.msg_id END) as msgA_array,
ais_server.array_accum(CASE WHEN msg_type = 18 THEN
message_copies.msg_id END) as msgB_std_array,
ais_server.array_accum(CASE WHEN msg_type = 19 THEN
message_copies.msg_id END) as msgB_ext_array,
uniq
(
ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_A'
THEN obj_mmsi END)
) as mmsi_type_A_array,
uniq
(
ais_server.array_accum(CASE WHEN obj_type = 'SHIP_TYPE_B'
THEN obj_mmsi END)
) as mmsi_type_B_array,
avg(ship_speed) / 10.0 as avg_speed,
avg(ship_heading) as avg_heading,
avg(ship_course) / 10.0 as avg_course,
ST_Multi(ST_Collect(ship_pos_messages.pos_point)) as geom
from
feed_all_y2012m07.message_copies join
(feed_all_y2012m07.ship_pos_messages join
ais_server.ship_objects on (ship_pos_messages.obj_id =
ship_objects.obj_id))
on (message_copies.msg_id = ship_pos_messages.msg_id)
where
extract('day' from message_copies.msg_date_rec) = 17
and date_trunc('day', message_copies.msg_date_rec) = '2012-07-17'
and message_copies.src_id = 5
and not message_copies.pos_georef1 = '' and not
message_copies.pos_georef2 = '' and not message_copies.pos_georef3 = ''
and not message_copies.pos_georef4 = ''
and message_copies.pos_georef1 is not null and
message_copies.pos_georef2 is not null and message_copies.pos_georef3 is
not null and message_copies.pos_georef4 is not null
and extract('day' from ship_pos_messages.msg_date_rec) = 17
group by src_id, date_count, message_copies.pos_georef1,
message_copies.pos_georef2, message_copies.pos_georef3,
message_copies.pos_georef4;

I am not sure that I can see an improvement, at least on src_id that
have lots of msg_id per day the query never returned even 5 hours later
running "exaplain analyze". For smaller src_id
(message wise) there might be some improvement or it was just the
analyse that I run. As I said the stats goes quickly out of scope
because of the big number of updates. So it looks like that
it is not the "funny" "where" concatenation or some kind of index
construction problem. Which brings us back to the issue of the
"statistics_target" on per column. My problem is that given the
query plan I provided you yesterday, I am not sure which columns
statistics_target to touch and what short of number to introduce. Is
there any rule of thumb?

Kind regards
Yiannis


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 19:19:20
Message-ID: CAGTBQpYTtCsvOL7APqwo2EM+A4GBAB3uGg2gnS1Mkp27b51iYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On Sat, Jul 21, 2012 at 4:16 PM, Ioannis Anagnostopoulos
<ioannis(at)anatec(dot)com> wrote:
> I am not sure that I can see an improvement, at least on src_id that have
> lots of msg_id per day the query never returned even 5 hours later running
> "exaplain analyze". For smaller src_id
> (message wise) there might be some improvement or it was just the analyse
> that I run. As I said the stats goes quickly out of scope because of the big
> number of updates. So it looks like that
> it is not the "funny" "where" concatenation or some kind of index
> construction problem. Which brings us back to the issue of the
> "statistics_target" on per column. My problem is that given the
> query plan I provided you yesterday, I am not sure which columns
> statistics_target to touch and what short of number to introduce. Is there
> any rule of thumb?

What's the size of your index, tables, and such?
In GB I mean, not tuples.


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 19:24:24
Message-ID: 500B01E8.1000405@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 21/07/2012 20:19, Claudio Freire wrote:
> On Sat, Jul 21, 2012 at 4:16 PM, Ioannis Anagnostopoulos
> <ioannis(at)anatec(dot)com> wrote:
>> I am not sure that I can see an improvement, at least on src_id that have
>> lots of msg_id per day the query never returned even 5 hours later running
>> "exaplain analyze". For smaller src_id
>> (message wise) there might be some improvement or it was just the analyse
>> that I run. As I said the stats goes quickly out of scope because of the big
>> number of updates. So it looks like that
>> it is not the "funny" "where" concatenation or some kind of index
>> construction problem. Which brings us back to the issue of the
>> "statistics_target" on per column. My problem is that given the
>> query plan I provided you yesterday, I am not sure which columns
>> statistics_target to touch and what short of number to introduce. Is there
>> any rule of thumb?
> What's the size of your index, tables, and such?
> In GB I mean, not tuples.
The message_copies_wk2 that I currently hit is 13GB and 11 the Indexes, the
ship_a_pos_messages_wk2 is 17GB and 2.5MB the index and the ship_objects
is 150MB table and index approx.

Yiannis


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 20:10:51
Message-ID: CAGTBQpY6Ma44cvGmbcsjc_BBcScec1KOv7kd2_VRU1Ss9OM3RQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On Fri, Jul 20, 2012 at 6:19 PM, Ioannis Anagnostopoulos
<ioannis(at)anatec(dot)com> wrote:
> (feed_all_y2012m07.ship_pos_messages join
> ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id))
> on (message_copies.msg_id = ship_pos_messages.msg_id)

It's this part of the query that's taking 3.2 hours.

Move the filtered message_copies to a CTE, and the filtered
ship_pos_messages join to another CTE. That should (in my experience)
get you better performance.


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 20:11:27
Message-ID: CAGTBQpb7XZ3Q_tkC86qB9jy0AEoy-BZZAZPQ+qRQhsgOsN3K_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On Sat, Jul 21, 2012 at 5:10 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> <ioannis(at)anatec(dot)com> wrote:
>> (feed_all_y2012m07.ship_pos_messages join
>> ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id))
>> on (message_copies.msg_id = ship_pos_messages.msg_id)
>
> It's this part of the query that's taking 3.2 hours.
>
> Move the filtered message_copies to a CTE, and the filtered
> ship_pos_messages join to another CTE. That should (in my experience)
> get you better performance.

Btw... did you try the hash thing?


From: Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: A very long running query....
Date: 2012-07-21 20:29:28
Message-ID: 500B1128.6020001@anatec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

On 21/07/2012 21:11, Claudio Freire wrote:
> On Sat, Jul 21, 2012 at 5:10 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> <ioannis(at)anatec(dot)com> wrote:
>>> (feed_all_y2012m07.ship_pos_messages join
>>> ais_server.ship_objects on (ship_pos_messages.obj_id = ship_objects.obj_id))
>>> on (message_copies.msg_id = ship_pos_messages.msg_id)
>> It's this part of the query that's taking 3.2 hours.
>>
>> Move the filtered message_copies to a CTE, and the filtered
>> ship_pos_messages join to another CTE. That should (in my experience)
>> get you better performance.
> Btw... did you try the hash thing?
Not yet as I am trying at present to simplify the index getting the
georefs out of it. Don't know if this is a good idea but I though that
since I am not testing (yet) any equality other than making sure that
the georefs are not null or empty, I could avoid having it in the index,
thus reducing its size a lot... At least for now.....


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Ioannis Anagnostopoulos" <ioannis(at)anatec(dot)com>, <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Index slow down insertions...
Date: 2012-07-30 22:43:13
Message-ID: 5016C7B1020000250004928E@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice pgsql-performance

Ioannis Anagnostopoulos <ioannis(at)anatec(dot)com> wrote:

> I have stripped completely the database from additional indexes,
> those that possible delay the insertion process, of course
> maintaining the pkey and 2 or three absolutely mandatory indexes
> for my select queries. As a result I have a sleek and steady
> performance of around 0.70 msec per insertion.

Not bad!

> However I have now closed a full circle as I have a fast database
> but when I try to "select", making optimum usage of the left over
> indexes, the insertion process slows down. Yes my selections are
> huge (they are not slow, just huge as it is about geographical
> points etc) but I am asking if there is anyway that I can
> "prioritise" the insertions over the "selections". These
> "selections" are happening anyway as batch process during night so
> I don't really mind if they will take 2 or 5 hours, as long as
> they are ready at 9.00am next day.

You could try adding back indexes on the most critical columns, one
at a time. You might want to try single-column indexes, rather than
the wide ones you had before. The narrower keys may cut the cost of
maintaining the indexes enough to tolerate a few, and PostgreSQL can
often combine multiple indexes using "bitmap index scans".

You could also play with "nice" and "ionice" to reduce priority of
the "select" processes, but watch any such attempt very carefully
until you see what the impact really is.

Since you seem to be relatively satisfied with where you are now,
you should make small changes and be prepared to revert them if
insert performance drops off too much.

-Kevin