Re: Real-life range datasets

Lists: pgsql-hackers
From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Real-life range datasets
Date: 2011-12-20 12:48:14
Message-ID: CAPpHfdt5cEu-=kHwC=2aNQuMQB8ScrH8pK80AdafC20fjp7veQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers,

For better GiST indexing of range types it's important to have real-life
datasets for testing on. Real-life range datasets would help to proof (or
reject) some concepts and get more realistic benchmarks. Also, it would be
nice to know what queries you expect to run fast on that datasets. Ideally
it should be real-life set of queries, but it also could be your
presentation of what are typical queries for such datasets.
Thanks!

-----
With best regards,
Alexander Korotkov.


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Real-life range datasets
Date: 2011-12-22 08:51:31
Message-ID: 20111222085131.GP7768@ldn-qws-004.delacy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

We have a table in a postgres 8.4 database that would make use of date
ranges and exclusion constraints if they were available. Sadly I cannot
give you the data as it is based on data we are paying for and as part
of the relevant licenses we are obliqued to not give the data to third
parties.

Basically the tables keep meta data about financial instruments on
a given day. Thanks to corporate actions (companies merging, splitting
up, etc...) that meta data can be different from one day to the next.

One way to model such a table is:

identifier, date, payload columns...

unique index on (date, identifier)

(and in fact some of the payload columns are unique per day indices
as well).

But because there are a large number of rows per day and most don't
change this is a very wasteful representation.

Instead we use this

identifier, effective_from, effective_until, payload columns...

And we have some clever plpgsql functions that merge a days snapshot
into that representation. That happens only a few times per day and
is currently quite slow mostly because a lot of time is spend in
validation triggers to check that there are no overlapping entries
for effective_from,effective_until for jane_symbol and a few other
identifiers.

The most common operations are:

Get all or most rows of a given day

(select ... from instruments where :date between effective_from and effective_until)

left join of the instruments (again in the normal case constrained to
one day but in same cases periods of a week or a few month)

select ... from t left join instruments on
t.jane_symbol = instruments.jane_symbol
t.date between instruments.effective_from and t.effective_until
where t.date = X
and additional constraint on the number of rows from t

With t a huge table clustered on date with roughly 500,000 to 2,000,000
entries per day. The left join would work most of the time (my guess is
more than 90%). But there are entries in t where the jane_symbol would
not be in instruments (sadly).

Current size (immediately after a cluster):

table toast (all indices) total
| 1268 MB | 900 MB | 693 MB | 2861 MB

=> select min(effective_from), max(effective_from) from instruments;
min | max
------------+------------
2011-05-30 | 2011-12-21
(1 row)

b=> select count(*) from instruments where current_date - 1 between effective_from and effective_until ;
count
--------
358741
(1 row)

I should be able to give you a table with the same characteristics as
the instruments table but bogus data by replacing all entries in the
table with random strings of the same length or something like that.
I can probably take a little bit of time during this or the next week
to generate such "fake" real world data ;-) Is there an ftp site to
upload the gzipped pg_dump file to?

Cheers,

Bene

On 20/12/11 16:48, Alexander Korotkov wrote:
> Hackers,
>
> For better GiST indexing of range types it's important to have real-life
> datasets for testing on. Real-life range datasets would help to proof (or
> reject) some concepts and get more realistic benchmarks. Also, it would be
> nice to know what queries you expect to run fast on that datasets. Ideally
> it should be real-life set of queries, but it also could be your
> presentation of what are typical queries for such datasets.
> Thanks!
>
> -----
> With best regards,
> Alexander Korotkov.


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Real-life range datasets
Date: 2011-12-22 15:48:03
Message-ID: Pine.LNX.4.64.1112221945580.14072@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bene,

we have pgfoundry project http://pgfoundry.org/projects/dbsamples/.
Since your sample database is very important (for me also), I suggest to use
this site.

Oleg
On Thu, 22 Dec 2011, Benedikt Grundmann wrote:

> Hello,
>
> We have a table in a postgres 8.4 database that would make use of date
> ranges and exclusion constraints if they were available. Sadly I cannot
> give you the data as it is based on data we are paying for and as part
> of the relevant licenses we are obliqued to not give the data to third
> parties.
>
> Basically the tables keep meta data about financial instruments on
> a given day. Thanks to corporate actions (companies merging, splitting
> up, etc...) that meta data can be different from one day to the next.
>
> One way to model such a table is:
>
> identifier, date, payload columns...
>
>
> unique index on (date, identifier)
>
> (and in fact some of the payload columns are unique per day indices
> as well).
>
> But because there are a large number of rows per day and most don't
> change this is a very wasteful representation.
>
> Instead we use this
>
> identifier, effective_from, effective_until, payload columns...
>
> And we have some clever plpgsql functions that merge a days snapshot
> into that representation. That happens only a few times per day and
> is currently quite slow mostly because a lot of time is spend in
> validation triggers to check that there are no overlapping entries
> for effective_from,effective_until for jane_symbol and a few other
> identifiers.
>
> The most common operations are:
>
> Get all or most rows of a given day
>
> (select ... from instruments where :date between effective_from and effective_until)
>
> left join of the instruments (again in the normal case constrained to
> one day but in same cases periods of a week or a few month)
>
> select ... from t left join instruments on
> t.jane_symbol = instruments.jane_symbol
> t.date between instruments.effective_from and t.effective_until
> where t.date = X
> and additional constraint on the number of rows from t
>
> With t a huge table clustered on date with roughly 500,000 to 2,000,000
> entries per day. The left join would work most of the time (my guess is
> more than 90%). But there are entries in t where the jane_symbol would
> not be in instruments (sadly).
>
> Current size (immediately after a cluster):
>
> table toast (all indices) total
> | 1268 MB | 900 MB | 693 MB | 2861 MB
>
> => select min(effective_from), max(effective_from) from instruments;
> min | max
> ------------+------------
> 2011-05-30 | 2011-12-21
> (1 row)
>
> b=> select count(*) from instruments where current_date - 1 between effective_from and effective_until ;
> count
> --------
> 358741
> (1 row)
>
> I should be able to give you a table with the same characteristics as
> the instruments table but bogus data by replacing all entries in the
> table with random strings of the same length or something like that.
> I can probably take a little bit of time during this or the next week
> to generate such "fake" real world data ;-) Is there an ftp site to
> upload the gzipped pg_dump file to?
>
> Cheers,
>
> Bene
>
> On 20/12/11 16:48, Alexander Korotkov wrote:
>> Hackers,
>>
>> For better GiST indexing of range types it's important to have real-life
>> datasets for testing on. Real-life range datasets would help to proof (or
>> reject) some concepts and get more realistic benchmarks. Also, it would be
>> nice to know what queries you expect to run fast on that datasets. Ideally
>> it should be real-life set of queries, but it also could be your
>> presentation of what are typical queries for such datasets.
>> Thanks!
>>
>> -----
>> With best regards,
>> Alexander Korotkov.
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Real-life range datasets
Date: 2011-12-22 17:58:22
Message-ID: 2E0FF105-8AA2-43E9-BCAD-28DC631B7191@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec 22, 2011, at 7:48 AM, Oleg Bartunov wrote:

> we have pgfoundry project http://pgfoundry.org/projects/dbsamples/.
> Since your sample database is very important (for me also), I suggest to use
> this site.

Or PGXN.

http://pgxn.org/

You can register an account to upload extensions like you describe here:

http://manager.pgxn.org/account/register

Details on what’s required to distribute on PGXN are here:

http://manager.pgxn.org/howto

Best,

David


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Real-life range datasets
Date: 2011-12-23 08:56:39
Message-ID: CAPpHfdsu3dzfcV6g2OxQUb4mX+rdJySaFpyio5HHuSWUcr-EDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

On Thu, Dec 22, 2011 at 12:51 PM, Benedikt Grundmann <
bgrundmann(at)janestreet(dot)com> wrote:

> I should be able to give you a table with the same characteristics as
> the instruments table but bogus data by replacing all entries in the
> table with random strings of the same length or something like that.
> I can probably take a little bit of time during this or the next week
> to generate such "fake" real world data ;-) Is there an ftp site to
> upload the gzipped pg_dump file to?
>

Thank you very much for your response! I'm going to send you accessories
for upload soon.

-----
With best regards,
Alexander Korotkov.


From: Stefan Keller <sfkeller(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Real-life range datasets
Date: 2012-01-11 02:16:07
Message-ID: CAFcOn2-Bq-tzAfdFVn4wt8mPEcLNHNHebGQO3CRqtUOjHFGCrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi

I'm proposing OpenStreetMap which is of variable size up to >250 GB
XML Data for whole world.
It's downloadable from CloudMade.com or Geofabrik.de and can be
imported into PostgreSQL using osm2pgsql.
It's a key/value schema literally of the real world. I'm using hstore
option of osm2pgsql and hstore index is based on GIST.

I'm running a database instance called "PostGIS Terminal" which is a
daily extract of Switzerland:
http://labs.geometa.info/postgisterminal/?xapi=node%5Bname%3DHochschule%20Rapperswil%5D

This is a typical query which extracts 'real' highways (being of
geometry linestring, aka line/way) with a speed limit >= 100 km/h:

SELECT ST_AsText(way) geom
FROM osm_line
WHERE tags @> '"highway"=>"motorway"'
AND coalesce(substring((tags->'maxspeed') FROM E'[0-9]+')::int,0) >= 100

Yours, Stefan

2011/12/23 Alexander Korotkov <aekorotkov(at)gmail(dot)com>:
> Hello,
>
> On Thu, Dec 22, 2011 at 12:51 PM, Benedikt Grundmann
> <bgrundmann(at)janestreet(dot)com> wrote:
>>
>> I should be able to give you a table with the same characteristics as
>> the instruments table but bogus data by replacing all entries in the
>> table with random strings of the same length or something like that.
>> I can probably take a little bit of time during this or the next week
>> to generate such "fake" real world data ;-)   Is there an ftp site to
>> upload the gzipped pg_dump file to?
>
>
> Thank you very much for your response! I'm going to send you accessories for
> upload soon.
>
> -----
> With best regards,
> Alexander Korotkov.