Rewriting DISTINCT and losing performance

Lists: pgsql-performance
From: "Chuck D(dot)" <pgsql-performance(at)nullmx(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Rewriting DISTINCT and losing performance
Date: 2007-05-21 04:28:30
Message-ID: 200705202228.32129.pgsql-performance@nullmx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi all,

I know we've covered this before but I'm having trouble with it today.

I have some geographic data in tables that I'm working with. I have a
country, state and city table. I was selecting the country_name out of the
country table but discovered that some countries (like Antarctica) didn't
have cities in the city table.

I resolved to query the country table for only country_name's which had
country_id's in the city table - meaning the country had cities listed.

The problem was I had a couple different sources (in separate tables) with
some extraneous column data so I chose to consolidate the city tables from
the different sources and column data that I don't need because I don't have
the hardware to support it.

That was the end of my query time.

Here's the original table and query:

# \d geo.world_city
Table "geo.world_city"
Column | Type | Modifiers
------------+------------------------+-----------
city_id | integer | not null
state_id | smallint |
country_id | smallint |
rc | smallint |
latitude | numeric(9,7) |
longitude | numeric(10,7) |
dsg | character(5) |
cc1 | character(2) |
adm1 | character(2) |
city_name | character varying(200) |
Indexes:
"world_city_pk" PRIMARY KEY, btree (city_id)
"idx_world_city_cc1" btree (cc1)
"idx_world_city_cc1_adm1" btree (cc1, adm1)
"idx_world_city_country_id" btree (country_id)
"idx_world_city_name_first_letter" btree
(state_id, "substring"(lower(city_name::text), 1, 1))
"idx_world_city_state_id" btree (state_id)

explain analyze
SELECT country_id, country_name
FROM geo.country
WHERE country_id IN
(select country_id FROM geo.world_city)
;

QUERY
PLAN
-----------------------------------------------------------------------------
--------------------------------------------------------------------------
Nested Loop IN Join (cost=0.00..167.97 rows=155 width=15) (actual
time=85.502..3479.449 rows=231 loops=1)
-> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) (actual
time=0.089..0.658 rows=244 loops=1)
-> Index Scan using idx_world_city_country_id on world_city
(cost=0.00..8185.05 rows=12602 width=2) (actual time=14.250..14.250 rows=1
loops=244)
Index Cond: (country.country_id = world_city.country_id)
Total runtime: 3479.921 ms

Odd that it took 3 seconds because every previous run has been much quicker.
The next run was:

QUERY
PLAN
-----------------------------------------------------------------------------
------------------------------------------------------------------------
Nested Loop IN Join (cost=0.00..167.97 rows=155 width=15) (actual
time=0.087..6.967 rows=231 loops=1)
-> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) (actual
time=0.028..0.158 rows=244 loops=1)
-> Index Scan using idx_world_city_country_id on world_city
(cost=0.00..8185.05 rows=12602 width=2) (actual time=0.026..0.026 rows=1
loops=244)
Index Cond: (country.country_id = world_city.country_id)
Total runtime: 7.132 ms
(5 rows)

But that was irrelevant. I created a new table and eliminated the data and
it looks like this:

# \d geo.city
Table "geo.city"
Column | Type | Modifiers
------------+------------------------+-----------
city_id | integer | not null
state_id | smallint |
country_id | smallint |
latitude | numeric(9,7) |
longitude | numeric(10,7) |
city_name | character varying(100) |
Indexes:
"city_pk" PRIMARY KEY, btree (city_id)
"idx_city_country_id" btree (country_id) CLUSTER
Foreign-key constraints:
"city_state_id_fk" FOREIGN KEY (state_id) REFERENCES geo.state(state_id)
ON UPDATE CASCADE ON DELETE CASCADE

explain analyze
SELECT country_id, country_name
FROM geo.country
WHERE country_id IN
(select country_id FROM geo.city)
;

-- won't complete in a reasonable amount of time.

This one won't use the country_id index. The two tables have almost the same
number of rows:

cmi=# select count(*) from geo.world_city;
count
---------
1953314
(1 row)

cmi=# select count(*) from geo.city;
count
---------
2122712
(1 row)

I tried to force it and didn't see any improvement. I've vacuummed,
analyzed, clustered. Can someone help me to get only the countries who have
cities in the city table in a reasonable amount of time?

-------------------------------------------------------


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-performance(at)nullmx(dot)com
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Rewriting DISTINCT and losing performance
Date: 2007-05-21 09:14:14
Message-ID: 465162E6.4030205@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Chuck,

> explain analyze
> SELECT country_id, country_name
> FROM geo.country
> WHERE country_id IN
> (select country_id FROM geo.city)
> ;
>
> -- won't complete in a reasonable amount of time.

Can we see the plan?

--Josh


From: Richard Huxton <dev(at)archonet(dot)com>
To: pgsql-performance(at)nullmx(dot)com
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Rewriting DISTINCT and losing performance
Date: 2007-05-21 11:40:21
Message-ID: 46518525.9010407@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Chuck D. wrote:
> Table "geo.city"
> Column | Type | Modifiers
> ------------+------------------------+-----------
> city_id | integer | not null
> state_id | smallint |
> country_id | smallint |
> latitude | numeric(9,7) |
> longitude | numeric(10,7) |
> city_name | character varying(100) |
> Indexes:
> "city_pk" PRIMARY KEY, btree (city_id)
> "idx_city_country_id" btree (country_id) CLUSTER
> Foreign-key constraints:
> "city_state_id_fk" FOREIGN KEY (state_id) REFERENCES geo.state(state_id)
> ON UPDATE CASCADE ON DELETE CASCADE

Any good reason why country_id is NULLable?

--
Richard Huxton
Archonet Ltd


From: "Chuck D(dot)" <pgsql-performance(at)nullmx(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Rewriting DISTINCT and losing performance
Date: 2007-05-21 15:09:43
Message-ID: 200705210909.44335.pgsql-performance@nullmx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Monday 21 May 2007 03:14, Josh Berkus wrote:
> Chuck,
>
> Can we see the plan?
>
> --Josh
>

Sorry Josh, I guess I could have just used EXPLAIN instead of EXPLAIN
ANALYZE.

# explain
SELECT country_id, country_name
FROM geo.country
WHERE country_id IN
(select country_id FROM geo.city)
;
QUERY PLAN
--------------------------------------------------------------------
Nested Loop IN Join (cost=0.00..1252.60 rows=155 width=15)
Join Filter: (country.country_id = city.country_id)
-> Seq Scan on country (cost=0.00..6.44 rows=244 width=15)
-> Seq Scan on city (cost=0.00..43409.12 rows=2122712 width=2)
(4 rows)

Versus the same query using the older, larger world_city table:

# explain
SELECT country_id, country_name
FROM geo.country
WHERE country_id IN
(select country_id FROM geo.world_city)
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Nested Loop IN Join (cost=0.00..23.16 rows=155 width=15)
-> Seq Scan on country (cost=0.00..6.44 rows=244 width=15)
-> Index Scan using idx_world_city_country_id on world_city
(cost=0.00..706.24 rows=12602 width=2)
Index Cond: (country.country_id = world_city.country_id)
(4 rows)


From: "Chuck D(dot)" <pgsql-performance(at)nullmx(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Rewriting DISTINCT and losing performance
Date: 2007-05-21 17:13:23
Message-ID: 200705211113.24689.pgsql-performance@nullmx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Monday 21 May 2007 05:40, Richard Huxton wrote:
> Chuck D. wrote:
>
> Any good reason why country_id is NULLable?

It has been a while since I imported the data so it took some time to examine
it but here is what I found.

In the original data, some cities do not have coutries. Strange huh? Most
were in the Gaza Strip, No Man's Land or disputed territory where several
countries claimed ownership. This is according to USGS and the board of
names.

Recognizing that this did me no good in my application I decided to repair
that data so that country_id could have a NOT NULL modifier.


From: Richard Huxton <dev(at)archonet(dot)com>
To: pgsql-performance(at)nullmx(dot)com
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Rewriting DISTINCT and losing performance
Date: 2007-05-21 17:34:11
Message-ID: 4651D813.5050100@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Chuck D. wrote:
> On Monday 21 May 2007 03:14, Josh Berkus wrote:
>> Chuck,
>>
>> Can we see the plan?
>>
>> --Josh
>>
>
> Sorry Josh, I guess I could have just used EXPLAIN instead of EXPLAIN
> ANALYZE.
>
> # explain
> SELECT country_id, country_name
> FROM geo.country
> WHERE country_id IN
> (select country_id FROM geo.city)
> ;
> QUERY PLAN
> --------------------------------------------------------------------
> Nested Loop IN Join (cost=0.00..1252.60 rows=155 width=15)
> Join Filter: (country.country_id = city.country_id)
> -> Seq Scan on country (cost=0.00..6.44 rows=244 width=15)
> -> Seq Scan on city (cost=0.00..43409.12 rows=2122712 width=2)

The only thing I can think of is that the CLUSTERing on city.country_id
makes the system think it'll be cheaper to seq-scan the whole table.

I take it you have got 2 million rows in "city"?
--
Richard Huxton
Archonet Ltd


From: "Chuck D(dot)" <pgsql-performance(at)nullmx(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Rewriting DISTINCT and losing performance
Date: 2007-05-21 18:17:44
Message-ID: 200705211217.45205.pgsql-performance@nullmx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Monday 21 May 2007 11:34, Richard Huxton wrote:
> Chuck D. wrote:
>
> The only thing I can think of is that the CLUSTERing on city.country_id
> makes the system think it'll be cheaper to seq-scan the whole table.
>
> I take it you have got 2 million rows in "city"?

Well here is where it gets strange. The CLUSTER was just one thing I tried to
do to enhance the performance. I had the same result prior to cluster.

However, after updating that country_id column to NOT NULL and eliminating
NULL values it will use the country_id index and perform quickly. Oddly
enough, the original table, world_city still has NULL values in the
country_id column and it has always used the country_id index.

Doesn't that seem a bit strange? Does it have to do with the smaller size of
the new table maybe?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-performance(at)nullmx(dot)com
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Rewriting DISTINCT and losing performance
Date: 2007-05-21 21:32:05
Message-ID: 29045.1179783125@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Chuck D." <pgsql-performance(at)nullmx(dot)com> writes:
> Doesn't that seem a bit strange? Does it have to do with the smaller size of
> the new table maybe?

No, it seems to be a planner bug:
http://archives.postgresql.org/pgsql-hackers/2007-05/msg00920.php

I imagine that your table statistics are close to the critical point
where a bitmap scan looks cheaper or more expensive than a plain index
scan, and so the chosen plan varies depending on more-or-less chance
factors. Certainly getting rid of NULLs shouldn't have had any direct
impact on this choice.

regards, tom lane