Re: PostgreSQL Columnar Store for Analytic Workloads

Lists: pgsql-hackers
From: Hadi Moshayedi <hadi(at)citusdata(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: PostgreSQL Columnar Store for Analytic Workloads
Date: 2014-04-03 16:43:07
Message-ID: CAK=1=WrL42oFuF3Mm3i9CaUyzZznO68UXEQEWA-PJQhqpfwJFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dear Hackers,

We at Citus Data have been developing a columnar store extension for
PostgreSQL. Today we are excited to open source it under the Apache v2.0
license.

This columnar store extension uses the Optimized Row Columnar (ORC) format
for its data layout, which improves upon the RCFile format developed at
Facebook, and brings the following benefits:

* Compression: Reduces in-memory and on-disk data size by 2-4x. Can be
extended to support different codecs. We used the functions in
pg_lzcompress.h for compression and decompression.
* Column projections: Only reads column data relevant to the query.
Improves performance for I/O bound queries.
* Skip indexes: Stores min/max statistics for row groups, and uses them to
skip over unrelated rows.

We used the PostgreSQL FDW APIs to make this work. The extension doesn't
implement the writable FDW API, but it uses the process utility hook to
enable COPY command for the columnar tables.

This extension uses PostgreSQL's internal data type representation to store
data in the table, so this columnar store should support all data types
that PostgreSQL supports.

We tried the extension on TPC-H benchmark with 4GB scale factor on a
m1.xlarge Amazon EC2 instance, and the query performance improved by 2x-3x
compared to regular PostgreSQL table. Note that we flushed the page cache
before each test to see the impact on disk I/O.

When data is cached in memory, the performance of cstore_fdw tables were
close to the performance of regular PostgreSQL tables.

For more information, please visit:
* our blog post:
http://citusdata.com/blog/76-postgresql-columnar-store-for-analytics
* our github page: https://github.com/citusdata/cstore_fdw

Feedback from you is really appreciated.

Thanks,
-- Hadi


From: Stefan Keller <sfkeller(at)gmail(dot)com>
To: Hadi Moshayedi <hadi(at)citusdata(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL Columnar Store for Analytic Workloads
Date: 2014-04-08 06:28:09
Message-ID: CAFcOn2_CUt8hkDyEH2tk=wY4EAP1ZiKkauySdYmitykR=VmiTg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Hadi

Do you think that cstore_fd*w* is also welll suited for storing and
retrieving linked data (RDF)?

-S.

2014-04-03 18:43 GMT+02:00 Hadi Moshayedi <hadi(at)citusdata(dot)com>:

> Dear Hackers,
>
> We at Citus Data have been developing a columnar store extension for
> PostgreSQL. Today we are excited to open source it under the Apache v2.0
> license.
>
> This columnar store extension uses the Optimized Row Columnar (ORC) format
> for its data layout, which improves upon the RCFile format developed at
> Facebook, and brings the following benefits:
>
> * Compression: Reduces in-memory and on-disk data size by 2-4x. Can be
> extended to support different codecs. We used the functions in
> pg_lzcompress.h for compression and decompression.
> * Column projections: Only reads column data relevant to the query.
> Improves performance for I/O bound queries.
> * Skip indexes: Stores min/max statistics for row groups, and uses them to
> skip over unrelated rows.
>
> We used the PostgreSQL FDW APIs to make this work. The extension doesn't
> implement the writable FDW API, but it uses the process utility hook to
> enable COPY command for the columnar tables.
>
> This extension uses PostgreSQL's internal data type representation to
> store data in the table, so this columnar store should support all data
> types that PostgreSQL supports.
>
> We tried the extension on TPC-H benchmark with 4GB scale factor on a
> m1.xlarge Amazon EC2 instance, and the query performance improved by 2x-3x
> compared to regular PostgreSQL table. Note that we flushed the page cache
> before each test to see the impact on disk I/O.
>
> When data is cached in memory, the performance of cstore_fdw tables were
> close to the performance of regular PostgreSQL tables.
>
> For more information, please visit:
> * our blog post:
> http://citusdata.com/blog/76-postgresql-columnar-store-for-analytics
> * our github page: https://github.com/citusdata/cstore_fdw
>
> Feedback from you is really appreciated.
>
> Thanks,
> -- Hadi
>
>


From: Hadi Moshayedi <hadi(at)citusdata(dot)com>
To: Stefan Keller <sfkeller(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL Columnar Store for Analytic Workloads
Date: 2014-04-08 14:20:29
Message-ID: CAK=1=Wo7Jj=WtvaSWHcXF0ePC5ob5pOH0wHbUzzxO7sEAMFxQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Stefan,

On Tue, Apr 8, 2014 at 9:28 AM, Stefan Keller <sfkeller(at)gmail(dot)com> wrote:

> Hi Hadi
>
> Do you think that cstore_fd*w* is also welll suited for storing and
> retrieving linked data (RDF)?
>

I am not very familiar with RDF. Note that cstore_fdw doesn't change the
query language of PostgreSQL, so if your queries are expressible in SQL,
they can be answered using cstore_fdw too. If your data is huge and doesn't
fit in memory, then using cstore_fdw can be beneficial for you.

Can you give some more information about your use case? For example, what
are some of your queries? do you have sample data? how much memory do you
have? how large is the data?

-- Hadi