Re: 9.3 feature proposal: vacuumdb -j #

Lists: pgsql-hackers
From: Josh Berkus <josh(at)agliodbs(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-13 21:50:32
Message-ID: 4F10A728.7090403@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers,

It occurs to me that I would find it quite personally useful if the
vacuumdb utility was multiprocess capable.

For example, just today I needed to manually analyze a database with
over 500 tables, on a server with 24 cores. And I needed to know when
the analyze was done, because it was part of a downtime. I had to
resort to a python script.

I'm picturing doing this in the simplest way possible: get the list of
tables and indexes, divide them by the number of processes, and give
each child process its own list.

Any reason not to hack on this for 9.3?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Jan Lentfer <Jan(dot)Lentfer(at)web(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-13 22:03:16
Message-ID: 4F10AA24.6000608@web.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am 13.01.2012 22:50, schrieb Josh Berkus:
> It occurs to me that I would find it quite personally useful if the
> vacuumdb utility was multiprocess capable.
>
> For example, just today I needed to manually analyze a database with
> over 500 tables, on a server with 24 cores. And I needed to know when
> the analyze was done, because it was part of a downtime. I had to
> resort to a python script.
>
> I'm picturing doing this in the simplest way possible: get the list of
> tables and indexes, divide them by the number of processes, and give
> each child process its own list.
>
> Any reason not to hack on this for 9.3?

I don't see any reason not to do it, but plenty to do it.
Right now I have systems hosting many databases, I need to vacuum full
from time to time. I have wrapped vacuumdb with a shell script to
actually use all the capacity that is available. A vacuumdb -faz just
isn't that usefull on large machines anymore.

Jan


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-13 22:09:53
Message-ID: 201201132309.53263.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Friday, January 13, 2012 10:50:32 PM Josh Berkus wrote:
> Hackers,
>
> It occurs to me that I would find it quite personally useful if the
> vacuumdb utility was multiprocess capable.
>
> For example, just today I needed to manually analyze a database with
> over 500 tables, on a server with 24 cores. And I needed to know when
> the analyze was done, because it was part of a downtime. I had to
> resort to a python script.
>
> I'm picturing doing this in the simplest way possible: get the list of
> tables and indexes, divide them by the number of processes, and give
> each child process its own list.
That doesn't sound like a good idea. Its way too likely that you will end up
with one backend doing all the work because it got some big tables.

I don't think this task deserves using threads or subprocesses. Multiple
connections from one process seems way more sensible and mostly avoids the
above problem.

Andres


From: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-13 22:12:35
Message-ID: 4F10AC53.9070009@timbira.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13-01-2012 18:50, Josh Berkus wrote:
> It occurs to me that I would find it quite personally useful if the
> vacuumdb utility was multiprocess capable.
>
It is in the mid of my TODO list. reindexdb is in the plans too.

--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


From: Christopher Browne <cbbrowne(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-13 22:15:48
Message-ID: CAFNqd5U71fpJ28Xy7EvUXqAcV1ghom+XpPcW3HcTDTwSO9Xkeg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 13, 2012 at 4:50 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> It occurs to me that I would find it quite personally useful if the
> vacuumdb utility was multiprocess capable.
>
> For example, just today I needed to manually analyze a database with
> over 500 tables, on a server with 24 cores.   And I needed to know when
> the analyze was done, because it was part of a downtime.  I had to
> resort to a python script.
>
> I'm picturing doing this in the simplest way possible: get the list of
> tables and indexes, divide them by the number of processes, and give
> each child process its own list.

I think "simplest" isn't *quite* best...

There's the risk that all the big tables get tied to one child, and so
the one child is doing them serially.

Better:

Have two logical tasks:
a) A process that manages the list, and
b) Child processes doing vacuums.

Each time a child completes a table, it asks the parent for another one.

So the tendency will be that if there are 8 big tables, and 12 child
processes, it's *certain* that the 8 big tables will be spread across
the children.

It guarantees that the child processes will all be busy until there
are fewer tables left than there are child processes.

--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-13 22:16:04
Message-ID: 4F10AD24.1030104@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/13/12 2:12 PM, Euler Taveira de Oliveira wrote:
> On 13-01-2012 18:50, Josh Berkus wrote:
>> It occurs to me that I would find it quite personally useful if the
>> vacuumdb utility was multiprocess capable.
>>
> It is in the mid of my TODO list. reindexdb is in the plans too.

I'm even happier to have someone else do it. ;-)

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Susanne Ebrecht <susanne(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-17 12:18:53
Message-ID: 4F15672D.6080904@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am 13.01.2012 22:50, schrieb Josh Berkus:
> Hackers,
>
> It occurs to me that I would find it quite personally useful if the
> vacuumdb utility was multiprocess capable.
>
> For example, just today I needed to manually analyze a database with
> over 500 tables, on a server with 24 cores. And I needed to know when
> the analyze was done, because it was part of a downtime. I had to
> resort to a python script.
>
> I'm picturing doing this in the simplest way possible: get the list of
> tables and indexes, divide them by the number of processes, and give
> each child process its own list.
>
> Any reason not to hack on this for 9.3?
>

Hello,

I like the idea - but ...
I would prefer to have an option that the user is able to tell on how much
cores it should be shared. Something like --share-cores=N.

Default is total core number of the machine but users should be able to
say - ok -
my machine has 24 cores but I want that vacuumdb just will use 12 of them.

Especially on startups - you are able to find machines that aren't
database-only
machines. Often you find database and web server as single machine.

Also you could have run more cluster on same machine for offering your
business in
different languages (one cluster per language). I already saw such a setup.

There might be side businesses on the cores - so it should be possible
that the
users decides on how much cores he wants to share vacuumdb.

Susanne

--
Dipl. Inf. Susanne Ebrecht - 2ndQuadrant
PostgreSQL Development, 24x7 Support, Training and Services
www.2ndQuadrant.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Susanne Ebrecht <susanne(at)2ndquadrant(dot)com>
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-17 12:23:19
Message-ID: 201201171323.19582.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:
> I would prefer to have an option that the user is able to tell on how much
> cores it should be shared. Something like --share-cores=N.
Uhm. -j # does exactly that or am I missing your point?

Andres


From: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Susanne Ebrecht <susanne(at)2ndquadrant(dot)com>
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-17 12:33:06
Message-ID: CAJKUy5hJLSsJBAii1YaWa7jvGqZY2m78cWOc05xsQuTj0ypUYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:
>> I would prefer to have an option that the user is able to tell on how much
>> cores it should be shared. Something like --share-cores=N.
> Uhm. -j # does exactly that or am I missing your point?
>

not really.

if you have 12 cores and you say -j 12 you would have 1 process per
core, with Susanne's suggestion, AFAIUI, you can say -j 12
--shared-cores=6... so you would only use 6 cores of the 12 and have 2
processes per core

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Susanne Ebrecht <susanne(at)2ndquadrant(dot)com>
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-17 13:55:47
Message-ID: 4F157DE3.2070801@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 01/17/2012 07:33 AM, Jaime Casanova wrote:
> On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund<andres(at)anarazel(dot)de> wrote:
>> On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:
>>> I would prefer to have an option that the user is able to tell on how much
>>> cores it should be shared. Something like --share-cores=N.
>> Uhm. -j # does exactly that or am I missing your point?
>>
> not really.
>
> if you have 12 cores and you say -j 12 you would have 1 process per
> core, with Susanne's suggestion, AFAIUI, you can say -j 12
> --shared-cores=6... so you would only use 6 cores of the 12 and have 2
> processes per core
>

That looks messy. IMNSHO it should work just like pg_restore's -j.

cheers

andrew


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, Susanne Ebrecht <susanne(at)2ndquadrant(dot)com>
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-17 14:04:47
Message-ID: 201201171504.47925.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday, January 17, 2012 01:33:06 PM Jaime Casanova wrote:
> On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:
> >> I would prefer to have an option that the user is able to tell on how
> >> much cores it should be shared. Something like --share-cores=N.
> >
> > Uhm. -j # does exactly that or am I missing your point?
>
> not really.
>
> if you have 12 cores and you say -j 12 you would have 1 process per
> core, with Susanne's suggestion, AFAIUI, you can say -j 12
> --shared-cores=6... so you would only use 6 cores of the 12 and have 2
> processes per core
I don't really get what that should do. If vacuumdb itself is a limit in any
form in this we did something *very* wrong (in my opinion using processes for
this is pointless anyway. Using async queries seems to be much easier for this
special case. Especially for distributing individual commands.).
I don't really see how you could enforce sharing cores on the server side
(well, there are cpusets, but were sure not introduce usage of that just for
vacuumdb).

Andres


From: Jim Nasby <jim(at)nasby(dot)net>
To: Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-18 00:09:53
Message-ID: B6107D76-1824-400C-ADC9-F01AA1D19E7A@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jan 13, 2012, at 4:15 PM, Christopher Browne wrote:
> Have two logical tasks:
> a) A process that manages the list, and
> b) Child processes doing vacuums.
>
> Each time a child completes a table, it asks the parent for another one.

There is also a middle ground, because having the the scheduling process sounds like a lot more work than what Josh was proposing.

CREATE TEMP SEQUENCE s;
SELECT relname, s mod <number of backends> AS backend_number
FROM ( SELECT relname
FROM pg_class
ORDER BY relpages
);

Of course, having an actual scheduling process is most likely the most efficient.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Christopher Browne <cbbrowne(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.3 feature proposal: vacuumdb -j #
Date: 2012-01-18 01:24:59
Message-ID: 4F161F6B.9020808@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 01/17/2012 07:09 PM, Jim Nasby wrote:
> On Jan 13, 2012, at 4:15 PM, Christopher Browne wrote:
>> Have two logical tasks:
>> a) A process that manages the list, and
>> b) Child processes doing vacuums.
>>
>> Each time a child completes a table, it asks the parent for another one.
> There is also a middle ground, because having the the scheduling process sounds like a lot more work than what Josh was proposing.
>
> CREATE TEMP SEQUENCE s;
> SELECT relname, s mod<number of backends> AS backend_number
> FROM ( SELECT relname
> FROM pg_class
> ORDER BY relpages
> );
>
> Of course, having an actual scheduling process is most likely the most efficient.

We already have a model for this in parallel pg_restore. It would
probably not be terribly hard to adapt to parallel vacuum.

cheers

andrew