From: | Thom Brown <thombrown(at)gmail(dot)com> |
---|---|
To: | David Jarvis <thangalin(at)gmail(dot)com> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Optimize date query for large child tables: GiST or GIN? |
Date: | 2010-05-20 08:33:15 |
Message-ID: | AANLkTilud7Qc25XZDziBMfzRnn8XCaNCo2jKHtW2-vdF@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On 20 May 2010 06:06, David Jarvis <thangalin(at)gmail(dot)com> wrote:
> Hi,
>
> I recently switched to PostgreSQL from MySQL so that I can use PL/R for data
> analysis. The query in MySQL form (against a more complex table structure)
> takes ~5 seconds to run. The query in PostgreSQL I have yet to let finish,
> as it takes over a minute. I think I have the correct table structure in
> place (it is much simpler than the former structure in MySQL), however the
> query executes a full table scan against the parent table's 273 million
> rows.
>
> Questions
>
> What is the proper way to index the dates to avoid full table scans?
>
> Options I have considered:
>
> GIN
> GiST
> Rewrite the WHERE clause
> Separate year_taken, month_taken, and day_taken columns to the tables
>
> Details
>
> The HashAggregate from the plan shows a cost of 10006220141.11, which is, I
> suspect, on the astronomically huge side. There is a full table scan on the
> measurement table (itself having neither data nor indexes) being performed.
> The table aggregates 237 million rows from its child tables. The
> sluggishness comes from this part of the query:
>
> m.taken BETWEEN
> /* Start date. */
> (extract( YEAR FROM m.taken )||'-01-01')::date AND
> /* End date. Calculated by checking to see if the end date wraps
> into the next year. If it does, then add 1 to the current year.
> */
> (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
> sign(
> (extract( YEAR FROM m.taken )||'-12-31')::date -
> (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
> ) AS text)||'-12-31')::date
>
> There are 72 child tables, each having a year index and a station index,
> which are defined as follows:
>
> CREATE TABLE climate.measurement_12_013 (
> -- Inherited from table climate.measurement_12_013: id bigint NOT NULL
> DEFAULT nextval('climate.measurement_id_seq'::regclass),
> -- Inherited from table climate.measurement_12_013: station_id integer
> NOT NULL,
> -- Inherited from table climate.measurement_12_013: taken date NOT
> NULL,
> -- Inherited from table climate.measurement_12_013: amount numeric(8,2)
> NOT NULL,
> -- Inherited from table climate.measurement_12_013: category_id
> smallint NOT NULL,
> -- Inherited from table climate.measurement_12_013: flag character
> varying(1) NOT NULL DEFAULT ' '::character varying,
> CONSTRAINT measurement_12_013_category_id_check CHECK (category_id =
> 7),
> CONSTRAINT measurement_12_013_taken_check CHECK
> (date_part('month'::text, taken)::integer = 12)
> )
> INHERITS (climate.measurement)
>
> CREATE INDEX measurement_12_013_s_idx
> ON climate.measurement_12_013
> USING btree
> (station_id);
> CREATE INDEX measurement_12_013_y_idx
> ON climate.measurement_12_013
> USING btree
> (date_part('year'::text, taken));
>
> (Foreign key constraints to be added later.)
>
> The following query runs abysmally slow due to a full table scan:
>
> SELECT
> count(1) AS measurements,
> avg(m.amount) AS amount
> FROM
> climate.measurement m
> WHERE
> m.station_id IN (
> SELECT
> s.id
> FROM
> climate.station s,
> climate.city c
> WHERE
> /* For one city... */
> c.id = 5182 AND
>
> /* Where stations are within an elevation range... */
> s.elevation BETWEEN 0 AND 3000 AND
>
> /* and within a specific radius... */
> 6371.009 * SQRT(
> POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
> (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
> POW(RADIANS(c.longitude_decimal - s.longitude_decimal),
> 2))
> ) <= 50
> ) AND
>
> /* Data before 1900 is shaky; insufficient after 2009. */
> extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND
>
> /* Whittled down by category... */
> m.category_id = 1 AND
>
> /* Between the selected days and years... */
> m.taken BETWEEN
> /* Start date. */
> (extract( YEAR FROM m.taken )||'-01-01')::date AND
> /* End date. Calculated by checking to see if the end date wraps
> into the next year. If it does, then add 1 to the current year.
> */
> (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
> sign(
> (extract( YEAR FROM m.taken )||'-12-31')::date -
> (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
> ) AS text)||'-12-31')::date
> GROUP BY
> extract( YEAR FROM m.taken )
>
> What are your thoughts?
>
> Thank you!
>
>
Could you provide the EXPLAIN output for that slow query?
Thom
From | Date | Subject | |
---|---|---|---|
Next Message | Matthew Wakeling | 2010-05-20 13:03:11 | Re: Optimize date query for large child tables: GiST or GIN? |
Previous Message | Yeb Havinga | 2010-05-20 08:20:42 | Re: Optimize date query for large child tables: GiST or GIN? |