10 TB database

Lists: pgsql-general
From: Artur <a_wronski(at)gazeta(dot)pl>
To: pgsql-general(at)postgresql(dot)org
Subject: 10 TB database
Date: 2009-06-15 12:00:05
Message-ID: 4A3637C5.6030702@gazeta.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi!

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data every month.
This data would be in two tables. About 50.000.000 new rows every month.

We want to have access to all the date mostly for generating user
requesting reports (aggregating).
We would have about 10TB of data in three years.

Do you think is it possible to build this with postgresql and have any
idea how to start? :)

Thanks in advance,
Artur


From: Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>
To: Artur <a_wronski(at)gazeta(dot)pl>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 10 TB database
Date: 2009-06-15 12:29:48
Message-ID: 2f4958ff0906150529pd119314v84d06704288908e0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Jun 15, 2009 at 1:00 PM, Artur<a_wronski(at)gazeta(dot)pl> wrote:
> Hi!
>
> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.
>
> The problem is that we expect to have more than 250 GB of data every month.
> This data would be in two tables. About 50.000.000 new rows every month.

Well, obviously you need to decrease size of it, by doing some
normalization than.
If some information is the same across table, stick it into separate
table, and assign id to it.

If you can send me sample of that data, I could tell you where to cut size.
I have that big databases under my wings, and that's where
normalization starts to make sens, to save space (and hence speed
things up).

> We want to have access to all the date mostly for generating user requesting
> reports (aggregating).
> We would have about 10TB of data in three years.

For that sort of database you will need partitioning for sure.

Napisz do mnie, to moge pomoc prywatnie, moze za niewielka danina ;)

--
GJ


From: Whit Armstrong <armstrong(dot)whit(at)gmail(dot)com>
To: Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>
Cc: Artur <a_wronski(at)gazeta(dot)pl>, pgsql-general(at)postgresql(dot)org
Subject: Re: 10 TB database
Date: 2009-06-15 13:01:26
Message-ID: 8ec76080906150601k3313b06sbd1285a09a3d1e89@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I have a 300GB database, and I would like to look at partitioning as a
possible way to speed it up a bit.

I see the partitioning examples from the documentation:
http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html

Is anyone aware of additional examples or tutorials on partitioning?

Thanks,
Whit

2009/6/15 Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>:
> On Mon, Jun 15, 2009 at 1:00 PM, Artur<a_wronski(at)gazeta(dot)pl> wrote:
>> Hi!
>>
>> We are thinking to create some stocks related search engine.
>> It is experimental project just for fun.
>>
>> The problem is that we expect to have more than 250 GB of data every month.
>> This data would be in two tables. About 50.000.000 new rows every month.
>
> Well, obviously you need to decrease size of it, by doing some
> normalization than.
> If some information is the same across table, stick it into separate
> table, and assign id to it.
>
> If you can send me sample of that data, I could tell you where to cut size.
> I have that big databases under my wings, and that's where
> normalization starts to make sens, to save space (and hence speed
> things up).
>
>> We want to have access to all the date mostly for generating user requesting
>> reports (aggregating).
>> We would have about 10TB of data in three years.
>
> For that sort of database you will need partitioning for sure.
>
>
> Napisz do mnie, to moge pomoc prywatnie, moze za niewielka danina ;)
>
> --
> GJ
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Artur <a_wronski(at)gazeta(dot)pl>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 10 TB database
Date: 2009-06-15 17:55:24
Message-ID: 20090615175524.GF7285@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Artur wrote:
> Hi!
>
> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.
>
> The problem is that we expect to have more than 250 GB of data every month.
> This data would be in two tables. About 50.000.000 new rows every month.

Sounds a bit like what Truviso does ...

--
Alvaro Herrera


From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Artur" <a_wronski(at)gazeta(dot)pl>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: 10 TB database
Date: 2009-06-15 18:51:09
Message-ID: D425483C2C5C9F49B5B7A41F89441547029622A7@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org [mailto:pgsql-general-
> owner(at)postgresql(dot)org] On Behalf Of Artur
> Sent: Monday, June 15, 2009 5:00 AM
> To: pgsql-general(at)postgresql(dot)org
> Subject: [GENERAL] 10 TB database
>
> Hi!
>
> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.
>
> The problem is that we expect to have more than 250 GB of data every
> month.
> This data would be in two tables. About 50.000.000 new rows every
> month.
>
> We want to have access to all the date mostly for generating user
> requesting reports (aggregating).
> We would have about 10TB of data in three years.
>
> Do you think is it possible to build this with postgresql and have any
> idea how to start? :)

Consider summarization of this data into a data warehouse.
Most of the data will be historical and therefore the vast majority of
the data will be read-mostly (with the rare write operations probably
consisting mostly of corrections).
You won't want to scan the whole 10TB every time you make a
summarization query.

I have an idea that might make an interesting experiment:
Create tables that are a combination of year and month.
Create views that combine all 12 months into one yearly table.
Create a view that combines all the yearly views into one global view.
The reason that I think this suggestion may have some merit is that the
historical trends will not need to be recalculated on a daily basis (but
it would be nice if you could perform calculations against the whole
pile at will on rare occasions). By maintaining separate tables by
month, it will reduce the average depth of the b-trees. I guess that
for the most part, the active calculations will be only against recent
data (e.g. the past 6 months to one year or so). It could also be
interesting to create a view that combines the N most recent months of
data, where N is supplied on the fly (I do not know how difficult it
would be to create this view or even if it is possible).

If you are going to collect a terrific volume of data like this, I
suggest that a mathematics package might be coupled with the data like
SAS, R, Octave, SciLab, Maxima, etc. so that you can support decisions
derived from the data effectively.

You are also going to need high-end hardware to support a database like
this. Just some ideas you might like to test when you start fooling
around with this data.

IMO-YMMV


From: Michelle Konzack <linux4michelle(at)tamay-dogan(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: 10 TB database
Date: 2009-06-16 08:20:16
Message-ID: 20090616082016.GE17083@tamay-dogan.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi Artur,

I am owner of a database about War, Worcrime and Terroism with more then
1,6 TByte and I am already fscked...

Am 2009-06-15 14:00:05, schrieb Artur:
> Hi!
>
> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.
>
> The problem is that we expect to have more than 250 GB of data every month.

I have only 500 MByte per month...

> This data would be in two tables. About 50.000.000 new rows every month.

arround 123.000 new rows per month

> We want to have access to all the date mostly for generating user
> requesting reports (aggregating).
> We would have about 10TB of data in three years.
>
> Do you think is it possible to build this with postgresql and have any
> idea how to start? :)

You have to use a physical cluster like me. Searches in a Database of
more then 1 TByte even under using "tablespace" and "tablepartitioning"
let you run into performance issues...

I have now splited my Database in chunks of 250 GByte using a Cluster of
1U Servers from Sun Microsystems. Currently I run 8 servers with one
proxy. Each server cost me 2.300 Euro.

Note: On Friday I have a meeting with a Sun Partner in
Germany because a bigger project... where I have
to increase the performance of my database servers.
I have to calculate with 150.000 customers.

Thanks, Greetings and nice Day/Evening
Michelle Konzack
Systemadministrator
Tamay Dogan Network
Debian GNU/Linux Consultant

--
Linux-User #280138 with the Linux Counter, http://counter.li.org/
##################### Debian GNU/Linux Consultant #####################
<http://www.tamay-dogan.net/> Michelle Konzack
<http://www.can4linux.org/> c/o Vertriebsp. KabelBW
<http://www.flexray4linux.org/> Blumenstrasse 2
Jabber linux4michelle(at)jabber(dot)ccc(dot)de 77694 Kehl/Germany
IRC #Debian (irc.icq.com) Tel. DE: +49 177 9351947
ICQ #328449886 Tel. FR: +33 6 61925193


From: Pedro Doria Meunier <pdoria(at)netmadeira(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: 10 TB database
Date: 2009-06-16 08:54:49
Message-ID: 4A375DD9.9070101@netmadeira.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Arthur,

We have a database that has a table growing ~1,5M rows each month.
The overall growth for the db is ~1GB/month.
PostgreSQL 8.2.9 on x86_64 - a very modest Dell R200 with 4GB of ram.

Although poor planning was made in the beginning (i.e. no clustering,
no partitioning...) - we weren't expecting the boom :] - that
particular server runs like clockwork with hundreds of queries per
minute and still doing so without any noticeable speed loss.

We're, of course, planning for load balancing in the beginning of next
year ... :)

BR,

Pedro Doria Meunier
GSM: +351 96 17 20 188
Skype: pdoriam

Michelle Konzack wrote:
> Hi Artur,
>
> I am owner of a database about War, Worcrime and Terroism with more
> then 1,6 TByte and I am already fscked...
>
> Am 2009-06-15 14:00:05, schrieb Artur:
>> Hi!
>>
>> We are thinking to create some stocks related search engine. It
>> is experimental project just for fun.
>>
>> The problem is that we expect to have more than 250 GB of data
>> every month.
>
> I have only 500 MByte per month...
>
>> This data would be in two tables. About 50.000.000 new rows every
>> month.
>
> arround 123.000 new rows per month
>
>> We want to have access to all the date mostly for generating user
>> requesting reports (aggregating). We would have about 10TB of
>> data in three years.
>>
>> Do you think is it possible to build this with postgresql and
>> have any idea how to start? :)
>
> You have to use a physical cluster like me. Searches in a
> Database of more then 1 TByte even under using "tablespace" and
> "tablepartitioning" let you run into performance issues...
>
> I have now splited my Database in chunks of 250 GByte using a
> Cluster of 1U Servers from Sun Microsystems. Currently I run 8
> servers with one proxy. Each server cost me 2.300 Euro.
>
> Note: On Friday I have a meeting with a Sun Partner in Germany
> because a bigger project... where I have to increase the
> performance of my database servers. I have to calculate with
> 150.000 customers.
>
> Thanks, Greetings and nice Day/Evening Michelle Konzack
> Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFKN13P2FH5GXCfxAsRAkIiAJ95GvbQhBrOglzK2d57F5Qv7E5NdgCfcKga
bFpRiWf2vSY0oMOD40PgSsg=
=4OB3
-----END PGP SIGNATURE-----


From: "Todd Lieberman" <tlieberman(at)marchex(dot)com>
To: <pgsql-general(at)postgresql(dot)org>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: 10 TB database
Date: 2009-06-16 15:24:07
Message-ID: 9B4191CF8F73B04BB4F437E1F7F86ED512B5F5DC@exchbe1sea.windows.marchex.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> The problem is that we expect to have more than 250 GB of data every month.

Sounds like Terradata or Netezza teritory


From: Martin Gainty <mgainty(at)hotmail(dot)com>
To: <tlieberman(at)marchex(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: 10 TB database
Date: 2009-06-16 15:34:21
Message-ID: BLU142-W2463502046D8814ABD1A87AE3F0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


would suggest Oracle 11 for DB of 10TB or greater
http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html

Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité

Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.

Subject: Re: [GENERAL] 10 TB database
Date: Tue, 16 Jun 2009 08:24:07 -0700
From: tlieberman(at)marchex(dot)com
To: pgsql-general(at)postgresql(dot)org; pgsql-general(at)postgresql(dot)org

RE: [GENERAL] 10 TB database

> The problem is that we expect to have more than 250 GB of data every month.

Sounds like Terradata or Netezza teritory

_________________________________________________________________
Insert movie times and more without leaving Hotmail®.
http://windowslive.com/Tutorial/Hotmail/QuickAdd?ocid=TXT_TAGLM_WL_HM_Tutorial_QuickAdd_062009


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Artur <a_wronski(at)gazeta(dot)pl>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 10 TB database
Date: 2009-06-16 16:13:20
Message-ID: alpine.GSO.2.01.0906161142140.17014@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, 15 Jun 2009, Artur wrote:

> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.

I hope your idea of fun involves spending a bunch of money on hardware and
endless tweaking to get data loading every day with appropriate
corrections, because that's just the first round of "fun" on a job like
this.

> The problem is that we expect to have more than 250 GB of data every
> month. This data would be in two tables. About 50.000.000 new rows every
> month. We want to have access to all the date mostly for generating user
> requesting reports (aggregating). We would have about 10TB of data in
> three years. Do you think is it possible to build this with postgresql
> and have any idea how to start? :)

You start by figuring out what sort of business model is going to justify
this very expensive adventure in today's market where buyers of financial
market products are pretty rare, but that's probably not the question you
wanted an answer to.

You can certainly build a server capable of handling this job with
PostgreSQL here in 2009. Get 8 cores, a stack of 24 1TB disks and a RAID
card with a write cache, and you'll have a big enough system to handle the
job. Basic database design isn't too terribly difficult either. Stock
data is trivial to partition up into tiny pieces at the database level
(each day can be its own 250GB partition), and any system capable of
holding that much data is going to have a giant stack of drives spreading
out the disk I/O too.

The first level of problems you'll run into are how to keep up with
loading data every day. The main way to get bulk data in PostgreSQL,
COPY, isn't particularly fast, and you'll be hard pressed to keep up with
250GB/day unless you write a custom data loader that keeps multiple cores
going with that load. Commercial databases have some better solutions to
solve this problem in the base product, or easily available from third
party sources.

The much, much bigger problem here is how exactly you're going to provide
a user interface to this data. You can't just give people access to the
whole thing and let them run queries; the first person who executes
something like "select symbol,avg(price) from tickdata group by symbol"
because they want to see the average price of some stock over its lifetime
is going to kill the whole server. You really need to generate the
aggregated reports ahead of time, using an approach like materialized
views, and then only let people grab those. It's possible to manually
create materialized views in PostgreSQL, but that will be yet another bit
of custom development here.

The third level of issue is how you scale the app up if you're actually
successful. It's hard enough to get 250GB of daily data loaded into a
single database and storing 10TB of data somewhere; doing the job across a
replicated set of servers, so you can spread the queries out, is even more
"fun" than that.

P.S. If you're not already familiar with how to aggregate properly over a
trading calendar that includes holidays and spots where the market is only
open part of the day, give up now; that's the hardest issue specific to
this particular type of application to get right, and a lot of people
don't realize that early enough in the design process to properly plan for
it.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Whit Armstrong <armstrong(dot)whit(at)gmail(dot)com>
Cc: Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>, Artur <a_wronski(at)gazeta(dot)pl>, pgsql-general(at)postgresql(dot)org
Subject: Re: 10 TB database
Date: 2009-06-16 16:20:25
Message-ID: alpine.GSO.2.01.0906161218280.17014@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, 15 Jun 2009, Whit Armstrong wrote:

> Is anyone aware of additional examples or tutorials on partitioning?

http://www.pgcon.org/2007/schedule/events/41.en.html
http://blog.mozilla.com/webdev/2007/05/15/partitioning-fun-in-postgresql/
http://benjamin.smedbergs.us/blog/2007-05-12/when-partitioning-database-tables-explain-your-queries/

In that order really; those go from general commentary down to focusing on
specific issues people tend to run into.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Michelle Konzack <linux4michelle(at)tamay-dogan(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: 10 TB database
Date: 2009-06-16 18:37:09
Message-ID: 20090616183709.GF17083@tamay-dogan.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi Greg,

Am 2009-06-16 12:13:20, schrieb Greg Smith:
> The first level of problems you'll run into are how to keep up with
> loading data every day. The main way to get bulk data in PostgreSQL,
> COPY, isn't particularly fast, and you'll be hard pressed to keep up with
> 250GB/day unless you write a custom data loader that keeps multiple cores

AFAIK he was talking about 250 GByte/month which are around 8 GByte a
day or 300 MByte per hour

Thanks, Greetings and nice Day/Evening
Michelle Konzack
Systemadministrator
Tamay Dogan Network
Debian GNU/Linux Consultant

--
Linux-User #280138 with the Linux Counter, http://counter.li.org/
##################### Debian GNU/Linux Consultant #####################
<http://www.tamay-dogan.net/> Michelle Konzack
<http://www.can4linux.org/> c/o Vertriebsp. KabelBW
<http://www.flexray4linux.org/> Blumenstrasse 2
Jabber linux4michelle(at)jabber(dot)ccc(dot)de 77694 Kehl/Germany
IRC #Debian (irc.icq.com) Tel. DE: +49 177 9351947
ICQ #328449886 Tel. FR: +33 6 61925193


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: 10 TB database
Date: 2009-06-16 19:33:19
Message-ID: alpine.GSO.2.01.0906161519400.120@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 16 Jun 2009, Michelle Konzack wrote:

> Am 2009-06-16 12:13:20, schrieb Greg Smith:
>> you'll be hard pressed to keep up with 250GB/day unless you write a
>> custom data loader that keeps multiple cores
>
> AFAIK he was talking about 250 GByte/month which are around 8 GByte a
> day or 300 MByte per hour

Right, that was just a typo in my response, the comments reflected what he
meant. Note that your averages here presume you can spread that out over
a full 24 hour period--which you often can't, as this type of data tends
to come in a big clump after market close and needs to be loaded ASAP for
it to be useful.

It's harder than most people would guess to sustain that sort of rate
against real-world data (which even fails to import some days) in
PostgreSQL without running into a bottleneck in COPY, WAL traffic, or
database disk I/O (particularly if there's any random access stuff going
on concurrently with the load). Just because your RAID array can write at
hundreds of MB/s does not mean you'll be able to sustain anywhere close to
that during your loading.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>
To: Martin Gainty <mgainty(at)hotmail(dot)com>
Cc: tlieberman(at)marchex(dot)com, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: 10 TB database
Date: 2009-06-16 19:59:37
Message-ID: 2f4958ff0906161259x574c37b8g10f3931af4546ad3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

2009/6/16 Martin Gainty <mgainty(at)hotmail(dot)com>:
> would suggest Oracle 11 for DB of 10TB or greater
> http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html

You are joking, right ?
Better invest that money in paying someone from -hackers to add
features required, if there will be any !
Or buy for that heftier RAID, with more disks...

--
GJ


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>
Cc: Martin Gainty <mgainty(at)hotmail(dot)com>, tlieberman(at)marchex(dot)com, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: 10 TB database
Date: 2009-06-16 20:06:17
Message-ID: dcc563d10906161306o3b7a6e1cwfce810906f06b71d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

2009/6/16 Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>:
> 2009/6/16 Martin Gainty <mgainty(at)hotmail(dot)com>:
>> would suggest Oracle 11 for DB of 10TB or greater
>> http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html
>
> You are joking, right ?
> Better invest that money in paying someone from -hackers to add
> features required, if there will be any !
> Or buy for that heftier RAID, with more disks...

You can throw a metric ton of hardware and development at a problem
for the cost of an Oracle license.