optimizing daily data storage in Pg

From: P Kishor <punk(dot)kish(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: optimizing daily data storage in Pg
Date: 2010-07-22 14:41:12
Message-ID: AANLkTimfy5zntg4mJbK184AT-bpBSoqovtI8BvNRFQ17@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I have been struggling with this for a while now, have even gone down
a few paths but struck out, so I turn now to the community for ideas.
First, the problem: Store six daily variables for ~ 25 years for cells
in a grid.

* Number of vars = 6
* Number of cells ~ 13 million
* Number of days ~ 9125 (25 * 365)

Optimize the store for two different kinds of queries:

Query one: Retrieve the value of a single var for all or a portion of
the cells for a single day. This is analogous to an image where every
pixel is the value of a single var.

Query two: Retrieve values for all the days or a duration of days for
a single var for a single cell. This is like grabbing a column out of
a table in which each row holds all the vars for a single day.

So, I set about designing the db. The "grid" is in a table with 13 million rows

CREATE TABLE cells (
cell_id INTEGER,
other_data ..
)
WITH (
OIDS=FALSE
)

A single table *where every row is one day's values for one cell* looks like so

CREATE TABLE d (
yr SMALLINT,
yday SMALLINT,
a SMALLINT,
b SMALLINT,
d SMALLINT,
e SMALLINT,
f SMALLINT,
g SMALLINT,
cell_id INTEGER
)
WITH (
OIDS=FALSE
)

The data would look like so

yr yday a b c d e f g cell_id
----------------------------------------------------
1980 1 x x x x x x x 1
..
1980 365 x x x x x x x 1
...
1981 1 x x x x x x x 1
..
1981 365 x x x x x x x 1
...
...
2005 1 x x x x x x x 1
..
2005 365 x x x x x x x 1
......
1980 1 x x x x x x x 2
..
1980 365 x x x x x x x 2
...

I could now (theoretically) conduct my queries like so:

Query 1a: Retrieve the value of a single var for all the cells for a
single day. This is analogous to an image where every pixel is the
value of a single var.

SELECT <var> FROM d WHERE yr = ? AND yday = ?;

I assuming I would need an index on yr and yday, or perhaps even a
compound index on (yr, yday).

Query 1b: Retrieve the value of a single var for a portion of the
cells for a single day. This is analogous to an image where every
pixel is the value of a single var.

SELECT <var> FROM d WHERE yr = ? AND yday = ? AND cell_id IN (?,?,?...);

I assuming I would need an index on yr and yday, or perhaps even a
compound index on (yr, yday) AND an index on cell_id.

Query 2: Retrieve values for all the days or a duration of days for a
single var for a single cell. This is like grabbing a column out of a
table in which each row holds all the vars for a single day.

SELECT <var> FROM d WHERE cell_id = ?;
SELECT <var> FROM d WHERE cell_id IN (?,?,?...);

Once again, an index on cell_id would assist in the above.

The problem: The above table would have 13 M * 9125 rows ~ 118 billion
rows. Huge indexes, slow queries, etc. In fact, major issues loading
the data in the first place. Since I am loading data in batches, I
drop the indexes (takes time), COPY data into the table (takes time),
build the indexes (takes time), experiment with improving the
performance (takes time), fail, rinse, lather, repeat. I actually
tried the above with a subset of data (around 100 M rows) and
experienced all of the above. I don't remember the query times, but
they were awful.

So, I partitioned the table into years like so

CREATE TABLE d_<yyyy> (
CHECK ( yr = <yyyy> )
) INHERITS (d)

Hmmm... still no satisfaction. I ended up with 1 master table + 25
inherited tables. Each of the year tables now had ~ 4.75 billion rows
(13 M * 365), and the queries were still very slow.

So, I partitioned it all by years and days like so

CREATE TABLE d_<yyyy>_<yday> (
CHECK ( yr = <yyyy> AND yday = <yday> )
) INHERITS (d)

Each table now has 13 million rows, and is reasonably fast (although
still not satisfactorily fast), but now I have 9K tables. That has its
own problems. I can't query the master table anymore as Pg tries to
lock all the tables and runs out of memory. Additionally, I can't
anymore conduct query two above. I could do something like

SELECT a FROM d_1980_1 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_2 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_3 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_4 WHERE cell_id = 1
UNION
...

But the above is hardly optimal.

Any suggestions, ideas, brainstorms would be appreciated. Perhaps Pg,
or even a RDBMS, is not the right tool for this problem, in which
case, suggestion for alternatives would be welcome as well.

Right now I am testing this on a dual Xeon dual core 3 GHz Xserve with
12 GB RAM. The PGDATA directory is located on an attached RAID that is
configured as RAID5. Reasonable time for a query would be under 500
ms, although ultimately I would love to have the query be done under
250 ms, perhaps with RAID10, and a machine with more RAM. I have
access to a machine with dual Xeon quad core 3 GHz Xserve with 32 GB
RAM, and an internal RAID, but before I try to move the data, I want
to actually conclusively prove that Pg is the best solution (or not).
Perhaps flat files are better, perhaps blobs are better, or perhaps
Pg's array column type. Any ideas/suggestions welcome.

Oh, one more thing. Once the data are loaded and everything is
working, the data are readonly. They are historical, so they don't
need to be changed.

--
Puneet Kishor

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Matthew Wilson 2010-07-22 14:45:45 How to improve performance in reporting database?
Previous Message Peter Hunsberger 2010-07-22 14:35:08 Re: Bitmask trickiness