Re: Stock Market Price Data & postgreSQL? HELLPPP Please

Lists: pgsql-general
From: Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Stock Market Price Data & postgreSQL? HELLPPP Please
Date: 2009-08-19 20:22:04
Message-ID: 7e41ba8f0908191322k4d3b6227r129c521bb8270b0b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hello All

I want to store stock market price data in postgreSQL and some of my
associates are saying this is not workable, not only in postgreSQL but
in any database. Arguments given are listed below. Kindly comment and
if possible suggest solutions and pointers to further reading.

- This is Time Series Data (I don't know what that is except that it
relates to data marked/related to time) and not suited to a RDBMS.
- You need it in Esper (a CEP engine used by Marketcetera, an open
source trading platform) which needs to get a data stream, whereas an
RDBMS will return a recordset.
- A RDBMS including postgreSQL simply cannot handle the amount of
real-time data coming in from a stock exchange.
- A RDBMS simply cannot provide data, even if you build streams from
recordsets returned, at the speeds required by any technical analysis
charting program.

My level of knowledge is basic SysAdmin & Programming Concepts, not
actual programming. I will be getting someone to build a solution for
me. Need the correct direction please. Again, shoestring budget, all
open source pointers please, unless they are for reading & general
knowledge.

I am hoping for an adapted postgreSQL solution, can't spend more than
few hundred dollars for this....so please...HELLLPPP ;-)

With best regards.
Sanjay.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Stock Market Price Data & postgreSQL? HELLPPP Please
Date: 2009-08-19 20:34:39
Message-ID: 407d949e0908191334p22961590q9f25271e675bae32@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Aug 19, 2009 at 9:22 PM, Sanjay Arora<sanjay(dot)k(dot)arora(at)gmail(dot)com> wrote:
> - This is Time Series Data (I don't know what that is except that it
> relates to data marked/related to time) and not suited to a RDBMS.
> - You need it in Esper (a CEP engine used by Marketcetera, an open
> source trading platform) which needs to get a data stream, whereas an
> RDBMS will return a recordset.
> - A RDBMS including postgreSQL simply cannot handle the amount of
> real-time data coming in from a stock exchange.
> - A RDBMS simply cannot provide data, even if you build streams from
> recordsets returned, at the speeds required by any technical analysis
> charting program.

There are whole companies doing nothing but providing streaming RDBMS
used by traders, airline reservation systems, etc. c.f Streambase, and
Truviso amongst others.

I'm not aware of any open source streaming databaes so for a
shoestring budget you're going to be limited to using existing tools.
There are certainly people who scale up Postgres to high oltp traffic
but it does require time and money as you scale up, TANSTAAFL... I'm
sure it can be done, Postgres is very flexible, though whether it's a
perfect fit with your needs or not I can't say.

You might also consider whether one of the existing streaming database
system providers would give you a development license for free or on a
trial basis if you hope to be making big money down the road.

--
greg
http://mit.edu/~gsstark/resume.pdf


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Stock Market Price Data & postgreSQL? HELLPPP Please
Date: 2009-08-19 21:15:33
Message-ID: b42b73150908191415q42351485t3a134f3c77aa9066@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Aug 19, 2009 at 4:22 PM, Sanjay Arora<sanjay(dot)k(dot)arora(at)gmail(dot)com> wrote:
> Hello All
>
> I want to store stock market price data in postgreSQL and some of my
> associates are saying this is not workable, not only in postgreSQL but
> in any database. Arguments given are listed below. Kindly comment and
> if possible suggest solutions and pointers to further reading.
>
> - This is Time Series Data (I don't know what that is except that it
> relates to data marked/related to time) and not suited to a RDBMS.
> - You need it in Esper (a CEP engine used by Marketcetera, an open
> source trading platform) which needs to get a data stream, whereas an
> RDBMS will return a recordset.
> - A RDBMS including postgreSQL simply cannot handle the amount of
> real-time data coming in from a stock exchange.
> - A RDBMS simply cannot provide data, even if you build streams from
> recordsets returned, at the speeds required by any technical analysis
> charting program.
>
> My level of knowledge is basic SysAdmin & Programming Concepts, not
> actual programming. I will be getting someone to build a solution for
> me. Need the correct direction please. Again, shoestring budget, all
> open source pointers please, unless they are for reading & general
> knowledge.
>
> I am hoping for an adapted postgreSQL solution, can't spend more than
> few hundred dollars for this....so please...HELLLPPP ;-)

I think your first step is to stand back, think about the problem as a
whole and see if you can't define your requirements better:

*) how much data comes in each day? how much rolls out?
*) what types of analysis of the data would you like to do?
*) what are your performance expectations?

For example, try and describe what you would like to do in detail.
The data you would import, how you would like it structured, and a
typical query. While the amount of price data a stock market produces
is immense, these types of problems tend to scale very well
horizontally. So let's define the basic amount of work you'd like to
do, then build off that.

merlin


From: Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Stock Market Price Data & postgreSQL? HELLPPP Please
Date: 2009-08-19 21:43:01
Message-ID: 7e41ba8f0908191443n1e6a9674v2ec8939eb89c45c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Aug 20, 2009 at 2:45 AM, Merlin Moncure<mmoncure(at)gmail(dot)com> wrote:

>
> I think your first step is to stand back, think about the problem as a
> whole and see if you can't define your requirements better:
>
> *) how much data comes in each day? how much rolls out?

Need to get some data...will have this tomorrow.

> *) what types of analysis of the data would you like to do?

Basically, what I need to do is feed the tick data to the CEP engine
as a data stream. The CEP engine creates various data bars streams say
2 minute OHLC (open/high/low/close), 3 min, 5 min, 15 min, 30 min or
20 tick, 2000 tick, 5000 tick etc. & so on. These are to be fed to
strategies which will trade. At least this is how it is done if data
arrives from the online data provider.

I am sure that this could be done on database. I had originally
thought of populated views of prebuilt data bars so building the bars
should not be required. They should be built using triggers as soon as
the data arrives in postgres, but then everyone advised that it was
exactly due to slow results in this type of data that CEP engines were
built. I was told that if I used this type of thing, so many triggers
would be running that I would have my processors and RAM maxing out,
not to say the exponential increase in disk space requirements.

And this is for one symbol. On an average 2000 plus scrips need to be analysed.

Again the analysis is backtesting trade simulation results according
to various strategies, so that means at least three years plus data,
if one wants to be a stickler 5-6 years but one year is the bare
minimum.

> *) what are your performance expectations?
>
Well, needfully the simulations should finish overnight with three
years worth of data.

> For example, try and describe what you would like to do in detail.
> The data you would import, how you would like it structured, and a
> typical query.  While the amount of price data a stock market produces
> is immense, these types of problems tend to scale very well
> horizontally.  So let's define the basic amount of work you'd like to
> do, then build off that.
>

I'll get back on this by tomorrow. Thanks.

With best regards.
Sanjay.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Stock Market Price Data & postgreSQL? HELLPPP Please
Date: 2009-08-19 22:04:21
Message-ID: 29790.1250719461@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com> writes:
> I want to store stock market price data in postgreSQL and some of my
> associates are saying this is not workable, not only in postgreSQL but
> in any database. Arguments given are listed below. Kindly comment and
> if possible suggest solutions and pointers to further reading.

Actually, the way I came to use Postgres was from a company that was
working on technical analysis of market data. I would recommend looking
at a hybrid approach. Storing raw tick data in a DBMS at one row per
tick is indeed not very bright. You could possibly make it work if
you throw beefy hardware at the problem, but there's not much point
because you're really not playing to a DBMS's strengths when you do
that. Once it's arrived the data is static (barring very-infrequent
corrections, which in practice you might never do at all anyway).
And the access patterns for it (at least if you're doing the same type
of technical analysis we were doing) are extremely predictable and
linear. So you aren't doing anything wherein SQL will shine. On the
other hand, there are definitely components of the big picture where
SQL *can* shine. I'd strongly recommend tracking your live trading
positions in an RDBMS, for example. It might be worth managing your
tick data using DBMS entries that represent, say, monthly tick data
files. Activities like removing old data would get a lot easier that
way. (I wish this had occurred to me twelve years ago, but it's water
over the dam now.) Any performance issues can be avoided by having
the technical analysis processes read the static tick data files
directly. This is generally pretty similar to the frequently-asked
question "should I store a lot of big image or document files directly
in the database, or just use it as an index of external files?".
If you dig around in the PG list archives you'll find more than you want
to read about that on both sides. If you've aggregated the ticks in
suitable volumes you can solve it either way, really, but an external
file is a bit easier to append to during raw data collection.

Don't fall into the trap of assuming that all your requirements must be
solved by a single tool. You'll spend all your time building the
perfect tool, and go broke before you finish it. Use a DBMS for the
parts of the problem it's good for, and go outside it for what it isn't.

BTW, we were doing full-speed tick data collection and real-time trading
analysis in the nineties, on machines that my current cell phone would
leave in the dust. The market data volume has grown a lot since then
of course, but the price of hardware has fallen a lot more. I'd not
recommend designing your system on the assumption that raw per-tick
speed is the be-all and end-all.

regards, tom lane


From: Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Stock Market Price Data & postgreSQL? HELLPPP Please
Date: 2009-08-19 22:17:17
Message-ID: 7e41ba8f0908191517q3720ff23s113fab88d324aa03@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Thanks Tom

For this reply and many earlier ones that have guided me around postgres ;-)

On Thu, Aug 20, 2009 at 3:34 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com> writes:

>  It might be worth managing your
> tick data using DBMS entries that represent, say, monthly tick data
> files.  Activities like removing old data would get a lot easier that
> way.  (I wish this had occurred to me twelve years ago, but it's water
> over the dam now.)

Do you mean something like storing one month worth tick data in a blob
type field and giving the contents to the CEP engine for further
building of required data streams? This approach would allow getting
big chunks of data to be handled with not caring location, naming &
indexing etc. of tick data files? Anything else you can suggest?

Best Regards.
Sanjay.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Stock Market Price Data & postgreSQL? HELLPPP Please
Date: 2009-08-19 22:53:21
Message-ID: 941.1250722401@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com> writes:
> Do you mean something like storing one month worth tick data in a blob
> type field and giving the contents to the CEP engine for further
> building of required data streams?

Either that or put the tick data in an external file and store that
file's pathname in the database row. As I said, you can find plenty
of argumentation on both sides of that in the PG archives.

When SSS were doing this, we had the raw tick data in one set of files
and pre-aggregated bar data in other files (I think we stored 5-min
and daily bars, but it was a long time ago). The analysis functions
would automatically build the bar width they wanted from the widest
stored form that divided the desired width, so as to minimize what they
had to read from disk.

It does sound like you are doing pretty much exactly what we were
doing. One thing to think about is that the real-time case is actually
much slower and easier to deal with than back-testing. When you are
back-testing, you'd like to test a trading strategy over say a year's
worth of data, and you'd like that to require a bit less than a year
to run, no? So the path you have to optimize is the one feeding stored,
pre-aggregated data into your test engine. The people hollering about
real-time performance are just betraying that they've never built one of
these things. I'm not familiar with this CEP software, but it sounds to
me like you want that as part of the real-time data collection process
and nowhere near your back-testing data path.

Another little tip: if your trading strategies are like ours were,
they need to ramp up for actual trading by pre-scanning some amount of
historical data. So you're going to need a mechanism that starts by
reading the historical data at high speed and smoothly segues into
reading the real-time feed (at which the passage of time in the model
suddenly slows to one-to-one with real time). Also consider what
happens when your machine crashes (it happens) and you need to not only
redo that startup process, but persuade the model that its actual
trading position is whatever you had open. Or you changed the model a
bit and restart as above. The model might now wish it was in a different
position, but it has to cope with your actual position.

The crash reliability of a DBMS is a strong reason why you track your
live positions in one, btw ... your brokerage isn't going to forget you
were short IBM, even if your model crashes.

regards, tom lane


From: Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Stock Market Price Data & postgreSQL? HELLPPP Please
Date: 2009-08-19 23:24:17
Message-ID: 7e41ba8f0908191624g4501b5f7mcbe29ad2c8139acc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Aug 20, 2009 at 4:23 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> It does sound like you are doing pretty much exactly what we were
> doing.  One thing to think about is that the real-time case is actually
> much slower and easier to deal with than back-testing.  When you are
> back-testing, you'd like to test a trading strategy over say a year's
> worth of data, and you'd like that to require a bit less than a year
> to run, no?  So the path you have to optimize is the one feeding stored,
> pre-aggregated data into your test engine.  The people hollering about
> real-time performance are just betraying that they've never built one of
> these things.  I'm not familiar with this CEP software, but it sounds to
> me like you want that as part of the real-time data collection process
> and nowhere near your back-testing data path.
>
CEP (Continuous Event Processing...just check out the explanation para
at http://esper.codehaus.org/ ) is basically running queries on data
streams in memory (i got that as rdbms tables in memory)...on time
series data (whatever that is)..... no concept of storage etc.

BTW, would you please look back a bit in this thread, a post (by
Adrian Klaver see..
http://archives.postgresql.org/pgsql-general/2009-08/msg00745.php )
about a database built by a berkeley computer science deptt. project
on postgres 7.2, that stores & queries, time series data just like
this and returns data in a stream instead of a recordset. It is said
that this type of databases are made for analyzing this type of data.
And it has normal postgres mode for returning records and streams mode
for returning results in a stream....which I think is a nothing but a
TCP-IP socket which receives results on an ongoing basis

I have also posted a question in its reply (see ..
http://archives.postgresql.org/pgsql-general/2009-08/msg00748.php ),
about why postgres itself does not have this (whatever this time
series data is in mathematical terms) when that project itself extends
postgres 7.2. Will you please have a look at these posts and this
project which is using postgres itself.

And please explain in layman terms what these guys are doing different
that we cannot have a time series data types (we do have GIS & tons of
other data types) & stream data returns. Maybe not immediately of use
to me but I understand ODBMS, RDBMS and I want to understand this at a
conceptual level, at least.
>
> Another little tip: if your trading strategies are like ours were,
> they need to ramp up for actual trading by pre-scanning some amount of
> historical data.  So you're going to need a mechanism that starts by
> reading the historical data at high speed and smoothly segues into
> reading the real-time feed (at which the passage of time in the model
> suddenly slows to one-to-one with real time).  Also consider what
> happens when your machine crashes (it happens) and you need to not only
> redo that startup process, but persuade the model that its actual
> trading position is whatever you had open.  Or you changed the model a
> bit and restart as above.  The model might now wish it was in a different
> position, but it has to cope with your actual position.
>
> The crash reliability of a DBMS is a strong reason why you track your
> live positions in one, btw ... your brokerage isn't going to forget you
> were short IBM, even if your model crashes.

Thats very good advise....we had already planned it that way. And yes,
the brokerages are not going to care that we ran out of money while
our server was restarting ;-)

>
>                        regards, tom lane
>


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Sanjay Arora <sanjay(dot)k(dot)arora(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Stock Market Price Data & postgreSQL? HELLPPP Please
Date: 2009-08-20 00:27:31
Message-ID: alpine.GSO.2.01.0908192017260.25972@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, 19 Aug 2009, Tom Lane wrote:

> BTW, we were doing full-speed tick data collection and real-time trading
> analysis in the nineties, on machines that my current cell phone would
> leave in the dust. The market data volume has grown a lot since then
> of course, but the price of hardware has fallen a lot more.

Trading volume always grows to where it's just possible to keep up with it
using a well designed app on the fastest hardware available. If you're
always faster than someone else there's money to be made from them using
that fact. The continuous arms race for the fastest scanning and
execution platform keeps volume moving upward in lock step with what
hardware is capable of. The last mainstream news article on this topic
was http://www.nytimes.com/2009/07/24/business/24trading.html

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD