Rewriting DISTINCT and losing performance

From: "Chuck D(dot)" <pgsql-performance(at)nullmx(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Rewriting DISTINCT and losing performance
Date: 2007-05-21 04:28:30
Message-ID: 200705202228.32129.pgsql-performance@nullmx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all,

I know we've covered this before but I'm having trouble with it today.

I have some geographic data in tables that I'm working with. I have a
country, state and city table. I was selecting the country_name out of the
country table but discovered that some countries (like Antarctica) didn't
have cities in the city table.

I resolved to query the country table for only country_name's which had
country_id's in the city table - meaning the country had cities listed.

The problem was I had a couple different sources (in separate tables) with
some extraneous column data so I chose to consolidate the city tables from
the different sources and column data that I don't need because I don't have
the hardware to support it.

That was the end of my query time.

Here's the original table and query:

# \d geo.world_city
Table "geo.world_city"
Column | Type | Modifiers
------------+------------------------+-----------
city_id | integer | not null
state_id | smallint |
country_id | smallint |
rc | smallint |
latitude | numeric(9,7) |
longitude | numeric(10,7) |
dsg | character(5) |
cc1 | character(2) |
adm1 | character(2) |
city_name | character varying(200) |
Indexes:
"world_city_pk" PRIMARY KEY, btree (city_id)
"idx_world_city_cc1" btree (cc1)
"idx_world_city_cc1_adm1" btree (cc1, adm1)
"idx_world_city_country_id" btree (country_id)
"idx_world_city_name_first_letter" btree
(state_id, "substring"(lower(city_name::text), 1, 1))
"idx_world_city_state_id" btree (state_id)

explain analyze
SELECT country_id, country_name
FROM geo.country
WHERE country_id IN
(select country_id FROM geo.world_city)
;

QUERY
PLAN
-----------------------------------------------------------------------------
--------------------------------------------------------------------------
Nested Loop IN Join (cost=0.00..167.97 rows=155 width=15) (actual
time=85.502..3479.449 rows=231 loops=1)
-> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) (actual
time=0.089..0.658 rows=244 loops=1)
-> Index Scan using idx_world_city_country_id on world_city
(cost=0.00..8185.05 rows=12602 width=2) (actual time=14.250..14.250 rows=1
loops=244)
Index Cond: (country.country_id = world_city.country_id)
Total runtime: 3479.921 ms

Odd that it took 3 seconds because every previous run has been much quicker.
The next run was:

QUERY
PLAN
-----------------------------------------------------------------------------
------------------------------------------------------------------------
Nested Loop IN Join (cost=0.00..167.97 rows=155 width=15) (actual
time=0.087..6.967 rows=231 loops=1)
-> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) (actual
time=0.028..0.158 rows=244 loops=1)
-> Index Scan using idx_world_city_country_id on world_city
(cost=0.00..8185.05 rows=12602 width=2) (actual time=0.026..0.026 rows=1
loops=244)
Index Cond: (country.country_id = world_city.country_id)
Total runtime: 7.132 ms
(5 rows)

But that was irrelevant. I created a new table and eliminated the data and
it looks like this:

# \d geo.city
Table "geo.city"
Column | Type | Modifiers
------------+------------------------+-----------
city_id | integer | not null
state_id | smallint |
country_id | smallint |
latitude | numeric(9,7) |
longitude | numeric(10,7) |
city_name | character varying(100) |
Indexes:
"city_pk" PRIMARY KEY, btree (city_id)
"idx_city_country_id" btree (country_id) CLUSTER
Foreign-key constraints:
"city_state_id_fk" FOREIGN KEY (state_id) REFERENCES geo.state(state_id)
ON UPDATE CASCADE ON DELETE CASCADE

explain analyze
SELECT country_id, country_name
FROM geo.country
WHERE country_id IN
(select country_id FROM geo.city)
;

-- won't complete in a reasonable amount of time.

This one won't use the country_id index. The two tables have almost the same
number of rows:

cmi=# select count(*) from geo.world_city;
count
---------
1953314
(1 row)

cmi=# select count(*) from geo.city;
count
---------
2122712
(1 row)

I tried to force it and didn't see any improvement. I've vacuummed,
analyzed, clustered. Can someone help me to get only the countries who have
cities in the city table in a reasonable amount of time?

-------------------------------------------------------

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Ralph Mason 2007-05-21 05:17:42 Re: Ever Increasing IOWAIT
Previous Message Tom Lane 2007-05-20 21:55:50 Re: Ever Increasing IOWAIT