Quick Links

Re: [rfc] overhauling pgstat.stat

Lists:	pgsql-hackers

From:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	[rfc] overhauling pgstat.stat
Date:	2013-09-04 01:10:31
Message-ID:	52268887.9010509@uptime.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I'm considering overhauling pgstat.stat, and would like to know how many
people are interested in this topic.

As you may know, this file could be handreds of MB in size, because
pgstat.stat holds all access statistics in each database, and it needs
to read/write an entire pgstat.stat frequently.

As a result, pgstat.stat often generates massive I/O operation,
particularly when having a large number of tables in the database.

To support multi-tenancy or just a large number of tables (up to 10k
tables in single database), I think pgstat.stat needs to be overhauled.

I think using heap and btree in pgstat.stat would be preferred to reduce
read/write and to allow updating access statistics for specific tables
in pgstat.stat file.

Is this good for us?

Any comments or suggestions?

Regards,
--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 03:52:13
Message-ID:	CAOeZVifbqgQFMk1iwu18DEtDSXBxP5jUEERwz2mk=qE3LZavyg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:
> Hi,
>
> I'm considering overhauling pgstat.stat, and would like to know how many
> people are interested in this topic.
>
> As you may know, this file could be handreds of MB in size, because
> pgstat.stat holds all access statistics in each database, and it needs
> to read/write an entire pgstat.stat frequently.
>
> As a result, pgstat.stat often generates massive I/O operation,
> particularly when having a large number of tables in the database.
>
> To support multi-tenancy or just a large number of tables (up to 10k
> tables in single database), I think pgstat.stat needs to be overhauled.
>
> I think using heap and btree in pgstat.stat would be preferred to reduce
> read/write and to allow updating access statistics for specific tables
> in pgstat.stat file.
>
> Is this good for us?

Hi,

Nice thought. I/O reduction in pgstat can be really helpful.

I am trying to think of our aim here. Would we be looking to split
pgstat per table, so that the I/O write happens for only a portion of
pgstat? Or reduce the I/O in general?

If the later, how would using BTree help us? I would rather go for a
range tree or something. But again, I may be completely wrong.

Please elaborate a bit more on the solution we are trying to
achieve.It seems really interesting.

Regards,

Atri

--
Regards,

Atri
l'apprenant

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 04:07:54
Message-ID:	20130904040754.GD5227@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Satoshi Nagayasu wrote:

> As you may know, this file could be handreds of MB in size, because
> pgstat.stat holds all access statistics in each database, and it needs
> to read/write an entire pgstat.stat frequently.
>
> As a result, pgstat.stat often generates massive I/O operation,
> particularly when having a large number of tables in the database.

We already changed it:

commit 187492b6c2e8cafc5b39063ca3b67846e8155d24
Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Date: Mon Feb 18 17:56:08 2013 -0300

Split pgstat file in smaller pieces

We now write one file per database and one global file, instead of
having the whole thing in a single huge file. This reduces the I/O that
must be done when partial data is required -- which is all the time,
because each process only needs information on its own database anyway.
Also, the autovacuum launcher does not need data about tables and
functions in each database; having the global stats for all DBs is
enough.

Catalog version bumped because we have a new subdir under PGDATA.

Author: Tomas Vondra. Some rework by Álvaro
Testing by Jeff Janes
Other discussion by Heikki Linnakangas, Tom Lane.

I don't oppose further tweaking, of course, but I wonder if you are
considering these changes.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 05:09:00
Message-ID:	5226C06C.1010103@uptime.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

(2013/09/04 13:07), Alvaro Herrera wrote:
> Satoshi Nagayasu wrote:
>
>> As you may know, this file could be handreds of MB in size, because
>> pgstat.stat holds all access statistics in each database, and it needs
>> to read/write an entire pgstat.stat frequently.
>>
>> As a result, pgstat.stat often generates massive I/O operation,
>> particularly when having a large number of tables in the database.
>
> We already changed it:
>
> commit 187492b6c2e8cafc5b39063ca3b67846e8155d24
> Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> Date: Mon Feb 18 17:56:08 2013 -0300
>
> Split pgstat file in smaller pieces

Thanks for the comments. I forgot to mention that.

Yes, we have already split single pgstat.stat file into
several pieces.

However, we still need to read/write large amount of statistics
data when we have a large number of tables in single database
or multiple databases being accessed. Right?

I think the issue here is that it is necessary to write/read
statistics data even it's not actually changed.

So, I'm wondering how we can minimize read/write operations
on these statistics data files with using heap and btree.

Regards,

>
> commit 187492b6c2e8cafc5b39063ca3b67846e8155d24
> Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> Date: Mon Feb 18 17:56:08 2013 -0300
>
> Split pgstat file in smaller pieces
>
> We now write one file per database and one global file, instead of
> having the whole thing in a single huge file. This reduces the I/O that
> must be done when partial data is required -- which is all the time,
> because each process only needs information on its own database anyway.
> Also, the autovacuum launcher does not need data about tables and
> functions in each database; having the global stats for all DBs is
> enough.
>
> Catalog version bumped because we have a new subdir under PGDATA.
>
> Author: Tomas Vondra. Some rework by Álvaro
> Testing by Jeff Janes
> Other discussion by Heikki Linnakangas, Tom Lane.
>
>
> I don't oppose further tweaking, of course, but I wonder if you are
> considering these changes.
>

--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp

From:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 05:24:58
Message-ID:	5226C42A.9080902@uptime.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

(2013/09/04 12:52), Atri Sharma wrote:
> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:
>> Hi,
>>
>> I'm considering overhauling pgstat.stat, and would like to know how many
>> people are interested in this topic.
>>
>> As you may know, this file could be handreds of MB in size, because
>> pgstat.stat holds all access statistics in each database, and it needs
>> to read/write an entire pgstat.stat frequently.
>>
>> As a result, pgstat.stat often generates massive I/O operation,
>> particularly when having a large number of tables in the database.
>>
>> To support multi-tenancy or just a large number of tables (up to 10k
>> tables in single database), I think pgstat.stat needs to be overhauled.
>>
>> I think using heap and btree in pgstat.stat would be preferred to reduce
>> read/write and to allow updating access statistics for specific tables
>> in pgstat.stat file.
>>
>> Is this good for us?
>
> Hi,
>
> Nice thought. I/O reduction in pgstat can be really helpful.
>
> I am trying to think of our aim here. Would we be looking to split
> pgstat per table, so that the I/O write happens for only a portion of
> pgstat? Or reduce the I/O in general?

I prefer the latter.

Under the current implementation, DBA need to split single database
into many smaller databases with considering access locality of the
tables. It's difficult and could be change in future.

And splitting the statistics data into many files (per table,
for example) would cause another performance issue when
collecting/showing statistics at once. Just my guess though.

So, I'm looking for a new way to reduce I/O for the statistics data
in general.

Regards,

>
> If the later, how would using BTree help us? I would rather go for a
> range tree or something. But again, I may be completely wrong.
>
> Please elaborate a bit more on the solution we are trying to
> achieve.It seems really interesting.
>
> Regards,
>
> Atri
>
>

--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 06:23:20
Message-ID:	748895BD-08D6-4668-8320-C112CAA21963@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sent from my iPad

On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:

> Hi,
>
> (2013/09/04 12:52), Atri Sharma wrote:
>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:
>>> Hi,
>>>
>>> I'm considering overhauling pgstat.stat, and would like to know how many
>>> people are interested in this topic.
>>>
>>> As you may know, this file could be handreds of MB in size, because
>>> pgstat.stat holds all access statistics in each database, and it needs
>>> to read/write an entire pgstat.stat frequently.
>>>
>>> As a result, pgstat.stat often generates massive I/O operation,
>>> particularly when having a large number of tables in the database.
>>>
>>> To support multi-tenancy or just a large number of tables (up to 10k
>>> tables in single database), I think pgstat.stat needs to be overhauled.
>>>
>>> I think using heap and btree in pgstat.stat would be preferred to reduce
>>> read/write and to allow updating access statistics for specific tables
>>> in pgstat.stat file.
>>>
>>> Is this good for us?
>>
>> Hi,
>>
>> Nice thought. I/O reduction in pgstat can be really helpful.
>>
>> I am trying to think of our aim here. Would we be looking to split
>> pgstat per table, so that the I/O write happens for only a portion of
>> pgstat? Or reduce the I/O in general?
>
> I prefer the latter.
>
> Under the current implementation, DBA need to split single database
> into many smaller databases with considering access locality of the
> tables. It's difficult and could be change in future.
>
> And splitting the statistics data into many files (per table,
> for example) would cause another performance issue when
> collecting/showing statistics at once. Just my guess though.
>
> So, I'm looking for a new way to reduce I/O for the statistics data
> in general.
>
> Regards,
>
>>
>> If the later, how would using BTree help us? I would rather go for a
>> range tree or something. But again, I may be completely wrong.
>>
>> Please elaborate a bit more on the solution we are trying to
>> achieve.It seems really interesting.
>>
>> Regards,
>>
>> Atri
>>
>>
>
>

Right,thanks.

How would using heap and BTree help here? Are we looking at a priority queue which supports the main storage system of the stats?

Regards,

Atri

From:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 10:09:25
Message-ID:	522706D5.8040008@uptime.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

(2013/09/04 15:23), Atri Sharma wrote:
>
>
> Sent from my iPad
>
> On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:
>
>> Hi,
>>
>> (2013/09/04 12:52), Atri Sharma wrote:
>>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:
>>>> Hi,
>>>>
>>>> I'm considering overhauling pgstat.stat, and would like to know how many
>>>> people are interested in this topic.
>>>>
>>>> As you may know, this file could be handreds of MB in size, because
>>>> pgstat.stat holds all access statistics in each database, and it needs
>>>> to read/write an entire pgstat.stat frequently.
>>>>
>>>> As a result, pgstat.stat often generates massive I/O operation,
>>>> particularly when having a large number of tables in the database.
>>>>
>>>> To support multi-tenancy or just a large number of tables (up to 10k
>>>> tables in single database), I think pgstat.stat needs to be overhauled.
>>>>
>>>> I think using heap and btree in pgstat.stat would be preferred to reduce
>>>> read/write and to allow updating access statistics for specific tables
>>>> in pgstat.stat file.
>>>>
>>>> Is this good for us?
>>>
>>> Hi,
>>>
>>> Nice thought. I/O reduction in pgstat can be really helpful.
>>>
>>> I am trying to think of our aim here. Would we be looking to split
>>> pgstat per table, so that the I/O write happens for only a portion of
>>> pgstat? Or reduce the I/O in general?
>>
>> I prefer the latter.
>>
>> Under the current implementation, DBA need to split single database
>> into many smaller databases with considering access locality of the
>> tables. It's difficult and could be change in future.
>>
>> And splitting the statistics data into many files (per table,
>> for example) would cause another performance issue when
>> collecting/showing statistics at once. Just my guess though.
>>
>> So, I'm looking for a new way to reduce I/O for the statistics data
>> in general.
>>
>> Regards,
>>
>>>
>>> If the later, how would using BTree help us? I would rather go for a
>>> range tree or something. But again, I may be completely wrong.
>>>
>>> Please elaborate a bit more on the solution we are trying to
>>> achieve.It seems really interesting.
>>>
>>> Regards,
>>>
>>> Atri
>>>
>>>
>>
>>
>
> Right,thanks.
>
> How would using heap and BTree help here? Are we looking at a priority queue which supports the main storage system of the stats?

For example, when you read only a single block from your table,
then you need to write all values in your database statistics next.
It often generates large amount of i/o operation.

However, if random access is allowed in the statistics, you can
update only as single record for the specific table which you read.
It would be less than 100 bytes for each table.

I have no idea about how a priority queue can work here so far.
However, if the statistics is overhauled, PostgreSQL would be able
to host a much larger number of customers more easily.

Regards,
--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 12:13:17
Message-ID:	20130904121317.GY2706@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Satoshi,

* Satoshi Nagayasu (snaga(at)uptime(dot)jp) wrote:
> (2013/09/04 13:07), Alvaro Herrera wrote:
> >We already changed it:
> >
> > commit 187492b6c2e8cafc5b39063ca3b67846e8155d24
> > Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> > Date: Mon Feb 18 17:56:08 2013 -0300
> >
> > Split pgstat file in smaller pieces
>
> Thanks for the comments. I forgot to mention that.
>
> Yes, we have already split single pgstat.stat file into
> several pieces.
>
> However, we still need to read/write large amount of statistics
> data when we have a large number of tables in single database
> or multiple databases being accessed. Right?

Would simply also splitting per tablespace help?

> I think the issue here is that it is necessary to write/read
> statistics data even it's not actually changed.
>
> So, I'm wondering how we can minimize read/write operations
> on these statistics data files with using heap and btree.

It does sound like an interesting idea to use heap/btree instead but I
wonder about the effort involved, particularly around coordinating
access. We wouldn't want to end up introducing additional contention
points by doing this..

Thanks,

Stephen

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 12:36:51
Message-ID:	DEB493DE-7A5F-4D22-A175-B902CF3549D3@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sent from my iPad

On 04-Sep-2013, at 15:39, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:

> (2013/09/04 15:23), Atri Sharma wrote:
>>
>>
>> Sent from my iPad
>>
>> On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:
>>
>>> Hi,
>>>
>>> (2013/09/04 12:52), Atri Sharma wrote:
>>>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:
>>>>> Hi,
>>>>>
>>>>> I'm considering overhauling pgstat.stat, and would like to know how many
>>>>> people are interested in this topic.
>>>>>
>>>>> As you may know, this file could be handreds of MB in size, because
>>>>> pgstat.stat holds all access statistics in each database, and it needs
>>>>> to read/write an entire pgstat.stat frequently.
>>>>>
>>>>> As a result, pgstat.stat often generates massive I/O operation,
>>>>> particularly when having a large number of tables in the database.
>>>>>
>>>>> To support multi-tenancy or just a large number of tables (up to 10k
>>>>> tables in single database), I think pgstat.stat needs to be overhauled.
>>>>>
>>>>> I think using heap and btree in pgstat.stat would be preferred to reduce
>>>>> read/write and to allow updating access statistics for specific tables
>>>>> in pgstat.stat file.
>>>>>
>>>>> Is this good for us?
>>>>
>>>> Hi,
>>>>
>>>> Nice thought. I/O reduction in pgstat can be really helpful.
>>>>
>>>> I am trying to think of our aim here. Would we be looking to split
>>>> pgstat per table, so that the I/O write happens for only a portion of
>>>> pgstat? Or reduce the I/O in general?
>>>
>>> I prefer the latter.
>>>
>>> Under the current implementation, DBA need to split single database
>>> into many smaller databases with considering access locality of the
>>> tables. It's difficult and could be change in future.
>>>
>>> And splitting the statistics data into many files (per table,
>>> for example) would cause another performance issue when
>>> collecting/showing statistics at once. Just my guess though.
>>>
>>> So, I'm looking for a new way to reduce I/O for the statistics data
>>> in general.
>>>
>>> Regards,
>>>
>>>>
>>>> If the later, how would using BTree help us? I would rather go for a
>>>> range tree or something. But again, I may be completely wrong.
>>>>
>>>> Please elaborate a bit more on the solution we are trying to
>>>> achieve.It seems really interesting.
>>>>
>>>> Regards,
>>>>
>>>> Atri
>>
>> Right,thanks.
>>
>> How would using heap and BTree help here? Are we looking at a priority queue which supports the main storage system of the stats?
>
> For example, when you read only a single block from your table,
> then you need to write all values in your database statistics next.
> It often generates large amount of i/o operation.
>
> However, if random access is allowed in the statistics, you can
> update only as single record for the specific table which you read.
> It would be less than 100 bytes for each table.
>
> I have no idea about how a priority queue can work here so far.
> However, if the statistics is overhauled, PostgreSQL would be able
> to host a much larger number of customers

Ah, now I get it. Thanks a ton for the detailed explanation.

Yes, a BTree will sufficiently isolate per table stats here and allow for random access.

Another thing I can think of is having a write back cache which could probably be used for a buffer before the actual stats write. I am just musing here though.

Regards,

Atri

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 12:41:44
Message-ID:	CAFj8pRDEiNuyhyepfD63VLCGSZPG28br1Lwg18eyXbyeDv1PaQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2013/9/4 Atri Sharma <atri(dot)jiit(at)gmail(dot)com>

>
>
> Sent from my iPad
>
> On 04-Sep-2013, at 15:39, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:
>
> > (2013/09/04 15:23), Atri Sharma wrote:
> >>
> >>
> >> Sent from my iPad
> >>
> >> On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:
> >>
> >>> Hi,
> >>>
> >>> (2013/09/04 12:52), Atri Sharma wrote:
> >>>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga(at)uptime(dot)jp>
> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I'm considering overhauling pgstat.stat, and would like to know how
> many
> >>>>> people are interested in this topic.
> >>>>>
> >>>>> As you may know, this file could be handreds of MB in size, because
> >>>>> pgstat.stat holds all access statistics in each database, and it
> needs
> >>>>> to read/write an entire pgstat.stat frequently.
> >>>>>
> >>>>> As a result, pgstat.stat often generates massive I/O operation,
> >>>>> particularly when having a large number of tables in the database.
> >>>>>
> >>>>> To support multi-tenancy or just a large number of tables (up to 10k
> >>>>> tables in single database), I think pgstat.stat needs to be
> overhauled.
> >>>>>
> >>>>> I think using heap and btree in pgstat.stat would be preferred to
> reduce
> >>>>> read/write and to allow updating access statistics for specific
> tables
> >>>>> in pgstat.stat file.
> >>>>>
> >>>>> Is this good for us?
> >>>>
> >>>> Hi,
> >>>>
> >>>> Nice thought. I/O reduction in pgstat can be really helpful.
> >>>>
> >>>> I am trying to think of our aim here. Would we be looking to split
> >>>> pgstat per table, so that the I/O write happens for only a portion of
> >>>> pgstat? Or reduce the I/O in general?
> >>>
> >>> I prefer the latter.
> >>>
> >>> Under the current implementation, DBA need to split single database
> >>> into many smaller databases with considering access locality of the
> >>> tables. It's difficult and could be change in future.
> >>>
> >>> And splitting the statistics data into many files (per table,
> >>> for example) would cause another performance issue when
> >>> collecting/showing statistics at once. Just my guess though.
> >>>
> >>> So, I'm looking for a new way to reduce I/O for the statistics data
> >>> in general.
> >>>
> >>> Regards,
> >>>
> >>>>
> >>>> If the later, how would using BTree help us? I would rather go for a
> >>>> range tree or something. But again, I may be completely wrong.
> >>>>
> >>>> Please elaborate a bit more on the solution we are trying to
> >>>> achieve.It seems really interesting.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Atri
> >>
> >> Right,thanks.
> >>
> >> How would using heap and BTree help here? Are we looking at a priority
> queue which supports the main storage system of the stats?
> >
> > For example, when you read only a single block from your table,
> > then you need to write all values in your database statistics next.
> > It often generates large amount of i/o operation.
> >
> > However, if random access is allowed in the statistics, you can
> > update only as single record for the specific table which you read.
> > It would be less than 100 bytes for each table.
> >
> > I have no idea about how a priority queue can work here so far.
> > However, if the statistics is overhauled, PostgreSQL would be able
> > to host a much larger number of customers
>
>
> Ah, now I get it. Thanks a ton for the detailed explanation.
>
> Yes, a BTree will sufficiently isolate per table stats here and allow for
> random access.
>
> Another thing I can think of is having a write back cache which could
> probably be used for a buffer before the actual stats write. I am just
> musing here though.
>

we very successfully use a tmpfs volume for pgstat files (use a backport of
multiple statfiles from 9.3 to 9.1)

Regards

Pavel

>
> Regards,
>
> Atri
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 12:43:00
Message-ID:	CAFj8pRDKFoPTnktt3OTs4WG47=c6q-YmoBnTErq_jUt-vgLK7A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2013/9/4 Atri Sharma <atri(dot)jiit(at)gmail(dot)com>

we very successfully use a tmpfs volume for pgstat files (use a backport of
multiple statfiles from 9.3 to 9.1

>
> Regards,
>
> Atri
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 18:34:09
Message-ID:	52277D21.4060802@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 4.9.2013 07:24, Satoshi Nagayasu wrote:
> Hi,
>
> (2013/09/04 12:52), Atri Sharma wrote:
>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu <snaga(at)uptime(dot)jp>
>> wrote:
>>> Hi,
>>>
>>> I'm considering overhauling pgstat.stat, and would like to know
>>> how many people are interested in this topic.
>>>
>>> As you may know, this file could be handreds of MB in size,
>>> because pgstat.stat holds all access statistics in each database,
>>> and it needs to read/write an entire pgstat.stat frequently.
>>>
>>> As a result, pgstat.stat often generates massive I/O operation,
>>> particularly when having a large number of tables in the
>>> database.
>>>
>>> To support multi-tenancy or just a large number of tables (up to
>>> 10k tables in single database), I think pgstat.stat needs to be
>>> overhauled.
>>>
>>> I think using heap and btree in pgstat.stat would be preferred to
>>> reduce read/write and to allow updating access statistics for
>>> specific tables in pgstat.stat file.
>>>
>>> Is this good for us?
>>
>> Hi,
>>
>> Nice thought. I/O reduction in pgstat can be really helpful.
>>
>> I am trying to think of our aim here. Would we be looking to split
>> pgstat per table, so that the I/O write happens for only a portion
>> of pgstat? Or reduce the I/O in general?
>
> I prefer the latter.
>
> Under the current implementation, DBA need to split single database
> into many smaller databases with considering access locality of the
> tables. It's difficult and could be change in future.
>
> And splitting the statistics data into many files (per table, for
> example) would cause another performance issue when
> collecting/showing statistics at once. Just my guess though.
>
> So, I'm looking for a new way to reduce I/O for the statistics data
> in general.

Hi,

as one of the authors of the 9.3 patch (per database stats), I planned
to work on this a bit more in the 9.4 cycle. So a few comments / ideas.

I'm not entirely sure splitting the stats per table would be that bad.
After all we already have per-relation data files, so either the users
already have serious problems (so this won't make it noticeably worse)
or it will work fine. But I'm not saying it's the right choice either.

My idea was to keep the per-database stats, but allow some sort of
"random" access - updating / deleting the records in place, adding
records etc. The simplest way I could think of was adding a simple
"index" - a mapping of OID to position in the stat file.

I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or
something like that. This would make it quite simple to access existing
record

1: get position from the index
2: read sizeof(Entry) from the file
3: if it's update, just overwrite the bytes, for delete set isdeleted
flag (needs to be added to all entries)

or reading all the records (just read the whole file as today).

regards
Tomas

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 18:42:52
Message-ID:	52277F2C.508@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 4.9.2013 14:43, Pavel Stehule wrote:
>
>
>
> 2013/9/4 Atri Sharma <atri(dot)jiit(at)gmail(dot)com <mailto:atri(dot)jiit(at)gmail(dot)com>>
>
>
>
> Sent from my iPad
>
> On 04-Sep-2013, at 15:39, Satoshi Nagayasu <snaga(at)uptime(dot)jp
> <mailto:snaga(at)uptime(dot)jp>> wrote:
>
> > (2013/09/04 15:23), Atri Sharma wrote:
> >>
> >>
> >> Sent from my iPad
> >>
> >> On 04-Sep-2013, at 10:54, Satoshi Nagayasu <snaga(at)uptime(dot)jp
> <mailto:snaga(at)uptime(dot)jp>> wrote:
> >>
> >>> Hi,
> >>>
> >>> (2013/09/04 12:52), Atri Sharma wrote:
> >>>> On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu
> <snaga(at)uptime(dot)jp <mailto:snaga(at)uptime(dot)jp>> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I'm considering overhauling pgstat.stat, and would like to
> know how many
> >>>>> people are interested in this topic.
> >>>>>
> >>>>> As you may know, this file could be handreds of MB in size,
> because
> >>>>> pgstat.stat holds all access statistics in each database, and
> it needs
> >>>>> to read/write an entire pgstat.stat frequently.
> >>>>>
> >>>>> As a result, pgstat.stat often generates massive I/O operation,
> >>>>> particularly when having a large number of tables in the database.
> >>>>>
> >>>>> To support multi-tenancy or just a large number of tables (up
> to 10k
> >>>>> tables in single database), I think pgstat.stat needs to be
> overhauled.
> >>>>>
> >>>>> I think using heap and btree in pgstat.stat would be preferred
> to reduce
> >>>>> read/write and to allow updating access statistics for
> specific tables
> >>>>> in pgstat.stat file.
> >>>>>
> >>>>> Is this good for us?
> >>>>
> >>>> Hi,
> >>>>
> >>>> Nice thought. I/O reduction in pgstat can be really helpful.
> >>>>
> >>>> I am trying to think of our aim here. Would we be looking to split
> >>>> pgstat per table, so that the I/O write happens for only a
> portion of
> >>>> pgstat? Or reduce the I/O in general?
> >>>
> >>> I prefer the latter.
> >>>
> >>> Under the current implementation, DBA need to split single database
> >>> into many smaller databases with considering access locality of the
> >>> tables. It's difficult and could be change in future.
> >>>
> >>> And splitting the statistics data into many files (per table,
> >>> for example) would cause another performance issue when
> >>> collecting/showing statistics at once. Just my guess though.
> >>>
> >>> So, I'm looking for a new way to reduce I/O for the statistics data
> >>> in general.
> >>>
> >>> Regards,
> >>>
> >>>>
> >>>> If the later, how would using BTree help us? I would rather go
> for a
> >>>> range tree or something. But again, I may be completely wrong.
> >>>>
> >>>> Please elaborate a bit more on the solution we are trying to
> >>>> achieve.It seems really interesting.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Atri
> >>
> >> Right,thanks.
> >>
> >> How would using heap and BTree help here? Are we looking at a
> priority queue which supports the main storage system of the stats?
> >
> > For example, when you read only a single block from your table,
> > then you need to write all values in your database statistics next.
> > It often generates large amount of i/o operation.
> >
> > However, if random access is allowed in the statistics, you can
> > update only as single record for the specific table which you read.
> > It would be less than 100 bytes for each table.
> >
> > I have no idea about how a priority queue can work here so far.
> > However, if the statistics is overhauled, PostgreSQL would be able
> > to host a much larger number of customers
>
>
> Ah, now I get it. Thanks a ton for the detailed explanation.
>
> Yes, a BTree will sufficiently isolate per table stats here and
> allow for random access.
>
> Another thing I can think of is having a write back cache which
> could probably be used for a buffer before the actual stats write. I
> am just musing here though.
>
>
> we very successfully use a tmpfs volume for pgstat files (use a backport
> of multiple statfiles from 9.3 to 9.1

It works quite well as long as you have the objects (tables, indexes,
functions) spread across multiple databases. Once you have one database
with very large number of objects, tmpfs is not as effective.

It's going to help with stats I/O, but it's not going to help with high
CPU usage (you're reading and parsing the stat files over and over) and
every rewrite creates a copy of the file. So if you have 400MB stats,
you will need 800MB tmpfs + some slack (say, 200MB). That means you'll
use ~1GB tmpfs although 400MB would be just fine. And this 600MB won't
be used for page cache etc.

OTOH, it's true that if you have that many objects, 600MB of RAM is not
going to help you anyway.

Tomas

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 18:50:45
Message-ID:	CAFj8pRDA_TT0N_LeZ=mtm4ZDRCDkhuYLQ21oDBYETtGrgpKazA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
> >
> > we very successfully use a tmpfs volume for pgstat files (use a backport
> > of multiple statfiles from 9.3 to 9.1
>
> It works quite well as long as you have the objects (tables, indexes,
> functions) spread across multiple databases. Once you have one database
> with very large number of objects, tmpfs is not as effective.
>
> It's going to help with stats I/O, but it's not going to help with high
> CPU usage (you're reading and parsing the stat files over and over) and
> every rewrite creates a copy of the file. So if you have 400MB stats,
> you will need 800MB tmpfs + some slack (say, 200MB). That means you'll
> use ~1GB tmpfs although 400MB would be just fine. And this 600MB won't
> be used for page cache etc.
>
> OTOH, it's true that if you have that many objects, 600MB of RAM is not
> going to help you anyway.
>

and just idea - can we use a database for storing these files. It can be
used in unlogged tables. Second idea - hold a one bg worker as persistent
memory key value database and hold data in memory with some optimizations -
using anti cache and similar memory database fetures.

Pavel

>
> Tomas
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 18:58:11
Message-ID:	522782C3.1010409@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 4.9.2013 14:13, Stephen Frost wrote:
> * Satoshi Nagayasu (snaga(at)uptime(dot)jp) wrote:
>
>> Yes, we have already split single pgstat.stat file into several
>> pieces.
>>
>> However, we still need to read/write large amount of statistics
>> data when we have a large number of tables in single database or
>> multiple databases being accessed. Right?
>
> Would simply also splitting per tablespace help?

I don't think that's a good solution. Forcing the users to use
tablespaces just to minimize the stats overhead is not that far from
forcing them to use multiple databases.

Tablespaces have disadvantages too, so I'd suggest to look for a
solution metting the "just works" criterion.

>> I think the issue here is that it is necessary to write/read
>> statistics data even it's not actually changed.
>>
>> So, I'm wondering how we can minimize read/write operations on
>> these statistics data files with using heap and btree.
>
> It does sound like an interesting idea to use heap/btree instead but
> I wonder about the effort involved, particularly around coordinating
> access. We wouldn't want to end up introducing additional contention
> points by doing this..

Yes, this is definitely an important thing to keep in mind. Maintaining
the "index" (no matter what kind of index will be used) will cause some
contention, but I believe it will be manageable with a bit of effort.

Tomas

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 18:59:17
Message-ID:	20130904185917.GF5227@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tomas Vondra wrote:

> My idea was to keep the per-database stats, but allow some sort of
> "random" access - updating / deleting the records in place, adding
> records etc. The simplest way I could think of was adding a simple
> "index" - a mapping of OID to position in the stat file.
>
> I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or
> something like that. This would make it quite simple to access existing
> record
>
> 1: get position from the index
> 2: read sizeof(Entry) from the file
> 3: if it's update, just overwrite the bytes, for delete set isdeleted
> flag (needs to be added to all entries)
>
> or reading all the records (just read the whole file as today).

Sounds reasonable. However, I think the index should be a real index,
i.e. have a tree structure that can be walked down, not just a plain
array. If you have a 400 MB stat file, then you must have about 4
million tables, and you will not want to scan such a large array every
time you want to find an entry.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-04 20:46:55
Message-ID:	52279C3F.4090704@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 4.9.2013 20:59, Alvaro Herrera wrote:
> Tomas Vondra wrote:
>
>> My idea was to keep the per-database stats, but allow some sort of
>> "random" access - updating / deleting the records in place, adding
>> records etc. The simplest way I could think of was adding a simple
>> "index" - a mapping of OID to position in the stat file.
>>
>> I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or
>> something like that. This would make it quite simple to access existing
>> record
>>
>> 1: get position from the index
>> 2: read sizeof(Entry) from the file
>> 3: if it's update, just overwrite the bytes, for delete set isdeleted
>> flag (needs to be added to all entries)
>>
>> or reading all the records (just read the whole file as today).
>
> Sounds reasonable. However, I think the index should be a real index,
> i.e. have a tree structure that can be walked down, not just a plain
> array. If you have a 400 MB stat file, then you must have about 4
> million tables, and you will not want to scan such a large array every
> time you want to find an entry.

I was thinking about a sorted array, so a bisection would be a simple
and fast way to search. New items could be added to another small
unsorted array (say, 1000 elements) and this would be extended and
resorted only when this small one gets full.

Tomas

From:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-05 03:19:10
Message-ID:	5227F82E.8020101@uptime.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

(2013/09/05 3:50), Pavel Stehule wrote:
> > we very successfully use a tmpfs volume for pgstat files (use a
> backport
> > of multiple statfiles from 9.3 to 9.1
>
> It works quite well as long as you have the objects (tables, indexes,
> functions) spread across multiple databases. Once you have one database
> with very large number of objects, tmpfs is not as effective.
>
> It's going to help with stats I/O, but it's not going to help with high
> CPU usage (you're reading and parsing the stat files over and over) and
> every rewrite creates a copy of the file. So if you have 400MB stats,
> you will need 800MB tmpfs + some slack (say, 200MB). That means you'll
> use ~1GB tmpfs although 400MB would be just fine. And this 600MB won't
> be used for page cache etc.
>
> OTOH, it's true that if you have that many objects, 600MB of RAM is not
> going to help you anyway.
>
>
> and just idea - can we use a database for storing these files. It can be
> used in unlogged tables. Second idea - hold a one bg worker as
> persistent memory key value database and hold data in memory with some
> optimizations - using anti cache and similar memory database fetures.

Yeah, I'm interested in this idea too.

If the stat collector has a dedicated connection to the backend in
order to store statistics into dedicated tables, we can easily take
advantages of index (btree, or hash?) and heap storage.

Is this worth trying?

Regards,
--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp

From:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-05 03:28:33
Message-ID:	5227FA61.6050605@uptime.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

(2013/09/05 3:59), Alvaro Herrera wrote:
> Tomas Vondra wrote:
>
>> My idea was to keep the per-database stats, but allow some sort of
>> "random" access - updating / deleting the records in place, adding
>> records etc. The simplest way I could think of was adding a simple
>> "index" - a mapping of OID to position in the stat file.
>>
>> I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or
>> something like that. This would make it quite simple to access existing
>> record
>>
>> 1: get position from the index
>> 2: read sizeof(Entry) from the file
>> 3: if it's update, just overwrite the bytes, for delete set isdeleted
>> flag (needs to be added to all entries)
>>
>> or reading all the records (just read the whole file as today).
>
> Sounds reasonable. However, I think the index should be a real index,
> i.e. have a tree structure that can be walked down, not just a plain
> array. If you have a 400 MB stat file, then you must have about 4
> million tables, and you will not want to scan such a large array every
> time you want to find an entry.

I thought an array structure at first.

But, for now, I think we should have a real index for the
statistics data because we already have several index storages,
and it will allow us to minimize read/write operations.

BTW, what kind of index would be preferred for this purpose?
btree or hash?

If we use btree, do we need "range scan" thing on the statistics
tables? I have no idea so far.

Regards,
--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-05 03:54:57
Message-ID:	969AF419-93E5-45A3-846A-357F94F77DAB@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sent from my iPad

On 05-Sep-2013, at 8:58, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:

> (2013/09/05 3:59), Alvaro Herrera wrote:
>> Tomas Vondra wrote:
>>
>>> My idea was to keep the per-database stats, but allow some sort of
>>> "random" access - updating / deleting the records in place, adding
>>> records etc. The simplest way I could think of was adding a simple
>>> "index" - a mapping of OID to position in the stat file.
>>>
>>> I.e. a simple array of (oid, offset) pairs, stored in oid.stat.index or
>>> something like that. This would make it quite simple to access existing
>>> record
>>>
>>> 1: get position from the index
>>> 2: read sizeof(Entry) from the file
>>> 3: if it's update, just overwrite the bytes, for delete set isdeleted
>>> flag (needs to be added to all entries)
>>>
>>> or reading all the records (just read the whole file as today).
>>
>> Sounds reasonable. However, I think the index should be a real index,
>> i.e. have a tree structure that can be walked down, not just a plain
>> array. If you have a 400 MB stat file, then you must have about 4
>> million tables, and you will not want to scan such a large array every
>> time you want to find an entry.
>
> I thought an array structure at first.
>
> But, for now, I think we should have a real index for the
> statistics data because we already have several index storages,
> and it will allow us to minimize read/write operations.
>
> BTW, what kind of index would be preferred for this purpose?
> btree or hash?
>
> If we use btree, do we need "range scan" thing on the statistics
> tables? I have no idea so far.
>

The thing I am interested in is range scan. That is the reason I wish to explore range tree usage here, maybe as a secondary index.

Regards,

Atri

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-05 05:29:14
Message-ID:	20130905052914.GA6067@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Satoshi Nagayasu wrote:

> But, for now, I think we should have a real index for the
> statistics data because we already have several index storages,
> and it will allow us to minimize read/write operations.
>
> BTW, what kind of index would be preferred for this purpose?
> btree or hash?

I find it hard to get excited about using the AM interface for this
purpose. To me it makes a lot more sense to have separate, much
simpler code. We don't need any transactionality, user defined types,
user defined operators, or anything like that.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-05 07:36:35
Message-ID:	CAOeZVic+oAzt7c08-Zj4N6tUkAAVGsCD9JJoFqPMa63jP1GueA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 5, 2013 at 10:59 AM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Satoshi Nagayasu wrote:
>
>> But, for now, I think we should have a real index for the
>> statistics data because we already have several index storages,
>> and it will allow us to minimize read/write operations.
>>
>> BTW, what kind of index would be preferred for this purpose?
>> btree or hash?
>
> I find it hard to get excited about using the AM interface for this
> purpose. To me it makes a lot more sense to have separate, much
> simpler code. We don't need any transactionality, user defined types,
> user defined operators, or anything like that.

+1.

But, would not rewriting a lot of existing functionalities potentially
lead to points of contention and/or much more effort?

Regards,

Atri
--
Regards,

Atri
l'apprenant

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-07 22:57:34
Message-ID:	522BAF5E.1090001@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5.9.2013 09:36, Atri Sharma wrote:
> On Thu, Sep 5, 2013 at 10:59 AM, Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:
>> Satoshi Nagayasu wrote:
>>
>>> But, for now, I think we should have a real index for the
>>> statistics data because we already have several index storages,
>>> and it will allow us to minimize read/write operations.
>>>
>>> BTW, what kind of index would be preferred for this purpose?
>>> btree or hash?
>>
>> I find it hard to get excited about using the AM interface for
>> this purpose. To me it makes a lot more sense to have separate,
>> much simpler code. We don't need any transactionality, user
>> defined types, user defined operators, or anything like that.
>
> +1.
>
> But, would not rewriting a lot of existing functionalities
> potentially lead to points of contention and/or much more effort?

Don't forget the stats are written only by the postmaster, all the
regular backends only read it (and eventually send updates back). But
yes, contention might be a problem, because there will have to be some
kind of locking that is not needed now when the postmaster is writing
fresh copy into a new file.

But I think we need to implement something and then measure this.
Because it might even with the contention the overall performance might
actually improve.

I'd vote to try a simple approach first - adding some simple array
'index' allowing fast access to particular records etc. At least that
was my plan. But feel free to implement something more advanced (e.g. a
BTree storage) and we can compare the results.

Tomas

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-07 23:09:40
Message-ID:	522BB234.50607@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5.9.2013 07:29, Alvaro Herrera wrote:
> Satoshi Nagayasu wrote:
>
>> But, for now, I think we should have a real index for the
>> statistics data because we already have several index storages, and
>> it will allow us to minimize read/write operations.
>>
>> BTW, what kind of index would be preferred for this purpose? btree
>> or hash?
>
> I find it hard to get excited about using the AM interface for this
> purpose. To me it makes a lot more sense to have separate, much
> simpler code. We don't need any transactionality, user defined
> types, user defined operators, or anything like that.

+1 to these concerns

And I think using regular tables might actually cause more harm than
benefits. For example let's say we have a large database with many
objects (which is the aim of this thread) with high activity - sessions
accessing objects, i.e. updating many "rows" in the stats tables.

Now, the stats table is likely to bloat thanks of the MVCC
copy-on-update. Not a good think, and it might easily happen the price
for maintenance of the table will be much higher than what we saved.

There are probably other similar scenarios.

Tomas

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-08 05:01:15
Message-ID:	24272C3B-3EE3-4B7A-8C51-86893B129598@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sent from my iPad

On 08-Sep-2013, at 4:27, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> On 5.9.2013 09:36, Atri Sharma wrote:
>> On Thu, Sep 5, 2013 at 10:59 AM, Alvaro Herrera
>> <alvherre(at)2ndquadrant(dot)com> wrote:
>>> Satoshi Nagayasu wrote:
>>>
>>>> But, for now, I think we should have a real index for the
>>>> statistics data because we already have several index storages,
>>>> and it will allow us to minimize read/write operations.
>>>>
>>>> BTW, what kind of index would be preferred for this purpose?
>>>> btree or hash?
>>>
>>> I find it hard to get excited about using the AM interface for
>>> this purpose. To me it makes a lot more sense to have separate,
>>> much simpler code. We don't need any transactionality, user
>>> defined types, user defined operators, or anything like that.
>>
>> +1.
>>
>> But, would not rewriting a lot of existing functionalities
>> potentially lead to points of contention and/or much more effort?
>
> Don't forget the stats are written only by the postmaster, all the
> regular backends only read it (and eventually send updates back). But
> yes, contention might be a problem, because there will have to be some
> kind of locking that is not needed now when the postmaster is writing
> fresh copy into a new file.
>
> But I think we need to implement something and then measure this.
> Because it might even with the contention the overall performance might
> actually improve.
>
> I'd vote to try a simple approach first - adding some simple array
> 'index' allowing fast access to particular records etc. At least that
> was my plan. But feel free to implement something more advanced (e.g. a
> BTree storage) and we can compare the results.
>
>
+1 on the simple implementation and measure approach.

My focus here is to identify what kind of queries we expect to be serving from the stats.I think someone mentioned range queries upthread. I feel we should be looking at range trees as secondary index, if not the primary storage for pgstat.

Regards,

Atri

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-08 05:22:02
Message-ID:	D037D8DF-3EB6-4A6B-8374-6222D2BB5050@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sent from my iPad

On 08-Sep-2013, at 4:27, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

Another thing I would want to explore is if we could somehow prioritise the more frequently accessed records to reduce their access times even more.I am thinking on the lines of self organising lists.I am not sure if and how it would be possible to implement this idea of self organising in BTree or any other tree though.

Regards,

Atri

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-08 21:04:19
Message-ID:	CAMkU=1wVpbsy_0o0LoPaqB1oVYdrAE7mLW-0Lm5L+DOP7p+mtQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 3, 2013 at 10:09 PM, Satoshi Nagayasu <snaga(at)uptime(dot)jp> wrote:
> Hi,
>
>
> (2013/09/04 13:07), Alvaro Herrera wrote:
>>
>> Satoshi Nagayasu wrote:
>>
>>> As you may know, this file could be handreds of MB in size, because
>>> pgstat.stat holds all access statistics in each database, and it needs
>>> to read/write an entire pgstat.stat frequently.
>>>
>>> As a result, pgstat.stat often generates massive I/O operation,
>>> particularly when having a large number of tables in the database.
>>
>>
>> We already changed it:
>
>>
>> commit 187492b6c2e8cafc5b39063ca3b67846e8155d24
>> Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
>> Date: Mon Feb 18 17:56:08 2013 -0300
>>
>> Split pgstat file in smaller pieces
>
> Thanks for the comments. I forgot to mention that.
>
> Yes, we have already split single pgstat.stat file into
> several pieces.
>
> However, we still need to read/write large amount of statistics
> data when we have a large number of tables in single database
> or multiple databases being accessed. Right?

Do you have a test case for measuring this? I vaguely remember from
when I was testing the split patch, that I thought that after that
improvement the load that was left was so low that there was little
point in optimizing it further.

Cheers,

Jeff

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-08 23:19:31
Message-ID:	522D0603.4040407@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8.9.2013 23:04, Jeff Janes wrote:
> On Tue, Sep 3, 2013 at 10:09 PM, Satoshi Nagayasu <snaga(at)uptime(dot)jp>
> wrote:
>> Hi,
>>
>>
>> (2013/09/04 13:07), Alvaro Herrera wrote:
>>>
>>> Satoshi Nagayasu wrote:
>>>
>>>> As you may know, this file could be handreds of MB in size,
>>>> because pgstat.stat holds all access statistics in each
>>>> database, and it needs to read/write an entire pgstat.stat
>>>> frequently.
>>>>
>>>> As a result, pgstat.stat often generates massive I/O operation,
>>>> particularly when having a large number of tables in the
>>>> database.
>>>
>>>
>>> We already changed it:
>>
>>>
>>> commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 Author: Alvaro
>>> Herrera <alvherre(at)alvh(dot)no-ip(dot)org> Date: Mon Feb 18 17:56:08
>>> 2013 -0300
>>>
>>> Split pgstat file in smaller pieces
>>
>> Thanks for the comments. I forgot to mention that.
>>
>> Yes, we have already split single pgstat.stat file into several
>> pieces.
>>
>> However, we still need to read/write large amount of statistics
>> data when we have a large number of tables in single database or
>> multiple databases being accessed. Right?
>
> Do you have a test case for measuring this? I vaguely remember from
> when I was testing the split patch, that I thought that after that
> improvement the load that was left was so low that there was little
> point in optimizing it further.

This is actually a pretty good point. Creating a synthetic test case is
quite simple - just create 1.000.000 tables in a single database, but
I'm wondering if it's actually realistic. Do we have a real-world
example where the current "one stat file per db" is not enough?

The reason why I worked on the split patch is that our application is
slightly crazy and creates a lot of tables (+ indexes) on the fly, and
as we have up to a thousand databases on each host, we often ended up
with a huge stat file.

Splitting the stat file improved that considerably, although that's
partially because we have the stats on a tmpfs, so I/O is not a problem,
and the CPU overhead is negligible thanks to splitting the stats per
database.

But AFAIK there are operating systems where creating a filesystem in RAM
is not that simple - e.g. Windows. In such cases even a moderate number
of objects may be a significant issue I/O-wise. But then again, I can't
really think of reasonable a system creating that many objects in a
single database (except for e.g. a shared database using schemas instead
of databases).

Tomas

From:	Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-09 07:37:54
Message-ID:	522D7AD2.4060203@uptime.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

(2013/09/09 8:19), Tomas Vondra wrote:
> On 8.9.2013 23:04, Jeff Janes wrote:
>> On Tue, Sep 3, 2013 at 10:09 PM, Satoshi Nagayasu <snaga(at)uptime(dot)jp>
>> wrote:
>>> Hi,
>>>
>>>
>>> (2013/09/04 13:07), Alvaro Herrera wrote:
>>>>
>>>> Satoshi Nagayasu wrote:
>>>>
>>>>> As you may know, this file could be handreds of MB in size,
>>>>> because pgstat.stat holds all access statistics in each
>>>>> database, and it needs to read/write an entire pgstat.stat
>>>>> frequently.
>>>>>
>>>>> As a result, pgstat.stat often generates massive I/O operation,
>>>>> particularly when having a large number of tables in the
>>>>> database.
>>>>
>>>>
>>>> We already changed it:
>>>
>>>>
>>>> commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 Author: Alvaro
>>>> Herrera <alvherre(at)alvh(dot)no-ip(dot)org> Date: Mon Feb 18 17:56:08
>>>> 2013 -0300
>>>>
>>>> Split pgstat file in smaller pieces
>>>
>>> Thanks for the comments. I forgot to mention that.
>>>
>>> Yes, we have already split single pgstat.stat file into several
>>> pieces.
>>>
>>> However, we still need to read/write large amount of statistics
>>> data when we have a large number of tables in single database or
>>> multiple databases being accessed. Right?
>>
>> Do you have a test case for measuring this? I vaguely remember from
>> when I was testing the split patch, that I thought that after that
>> improvement the load that was left was so low that there was little
>> point in optimizing it further.
>
> This is actually a pretty good point. Creating a synthetic test case is
> quite simple - just create 1.000.000 tables in a single database, but
> I'm wondering if it's actually realistic. Do we have a real-world
> example where the current "one stat file per db" is not enough?

I have several assumptions for that.

- Single shared database contains thousands of customers.
- Each customer has hundreds of tables and indexes.
- Customers are separated by schemas (namespaces) in single database.
- Application server uses connection pooling for performance reason.
- Workload (locality in the table access) can not be predicted.

Looks reasonable?

> The reason why I worked on the split patch is that our application is
> slightly crazy and creates a lot of tables (+ indexes) on the fly, and
> as we have up to a thousand databases on each host, we often ended up
> with a huge stat file.
>
> Splitting the stat file improved that considerably, although that's
> partially because we have the stats on a tmpfs, so I/O is not a problem,
> and the CPU overhead is negligible thanks to splitting the stats per
> database.

I agree that splitting a single large database into several pieces,
like thousands of tiny databases, could be an option in some cases.

However, what I intend here is eliminating those limitations on
database design.

In fact, when considering connection pooling, splitting a database
is not a good idea, because AFAIK, many connection poolers manage
connections per database.

So, I'd like to support 100k tables in single database.

Any comments?

Regards,

> But AFAIK there are operating systems where creating a filesystem in RAM
> is not that simple - e.g. Windows. In such cases even a moderate number
> of objects may be a significant issue I/O-wise. But then again, I can't
> really think of reasonable a system creating that many objects in a
> single database (except for e.g. a shared database using schemas instead
> of databases).
>
> Tomas
>
>

--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-09 16:50:33
Message-ID:	CA+Tgmobkwo9miuDiMB6Z8MfEt2E+oP1ivQ+=9qa_4TfkqEZYSw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 7, 2013 at 6:57 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> Don't forget the stats are written only by the postmaster, all the
> regular backends only read it (and eventually send updates back).

The postmaster, or the stats collector?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Tomas Vondra" <tv(at)fuzzy(dot)cz>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Tomas Vondra" <tv(at)fuzzy(dot)cz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [rfc] overhauling pgstat.stat
Date:	2013-09-09 18:07:12
Message-ID:	60593e3ea639cd2ae728035100d034e2.squirrel@sq.gransy.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9 Září 2013, 18:50, Robert Haas wrote:
> On Sat, Sep 7, 2013 at 6:57 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> Don't forget the stats are written only by the postmaster, all the
>> regular backends only read it (and eventually send updates back).
>
> The postmaster, or the stats collector?

Stats collector, of course. I meant to point out that the write activity
comes from a single dedicated process, which may not be that obvious, and
I somehow managed to name the incorrect one.

Tomas