Re: Index bloat problem?

Lists: pgsql-performance
From: Bill Chandler <billybobc1210(at)yahoo(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-perform <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat problem?
Date: 2005-04-21 17:38:55
Message-ID: 20050421173855.76286.qmail@web51410.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


--- Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Bill,
>
> > What about if an out-of-the-ordinary number of
> rows
> > were deleted (say 75% of rows in the table, as
> opposed
> > to normal 5%) followed by a 'VACUUM ANALYZE'?
> Could
> > things get out of whack because of that situation?
>
> Yes. You'd want to run REINDEX after and event like
> that. As you should now.
>
> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>

Thank you. Though I must say, that is very
discouraging. REINDEX is a costly operation, timewise
and due to the fact that it locks out other processes
from proceeding. Updates are constantly coming in and
queries are occurring continuously. A REINDEX could
potentially bring the whole thing to a halt.

Honestly, this seems like an inordinate amount of
babysitting for a production application. I'm not
sure if the client will be willing to accept it.

Admittedly my knowledge of the inner workings of an
RDBMS is limited, but could somebody explain to me why
this would be so? If you delete a bunch of rows why
doesn't the index get updated at the same time? Is
this a common issue among all RDBMSs or is it
something that is PostgreSQL specific? Is there any
way around it?

thanks,

Bill

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Bill Chandler <billybobc1210(at)yahoo(dot)com>
Cc: pgsql-perform <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat problem?
Date: 2005-04-21 17:42:38
Message-ID: 200504211042.38633.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Bill,

> Honestly, this seems like an inordinate amount of
> babysitting for a production application. I'm not
> sure if the client will be willing to accept it.

Well, then, tell them not to delete 75% of the rows in a table at once. I
imagine that operation brought processing to a halt, too.

> Admittedly my knowledge of the inner workings of an
> RDBMS is limited, but could somebody explain to me why
> this would be so? If you delete a bunch of rows why
> doesn't the index get updated at the same time?

It does get updated. What doesn't happen is the space getting reclaimed. In
a *normal* data situation, those dead nodes would be replaced with new index
nodes. However, a mass-delete-in-one-go messes that system up.

> Is
> this a common issue among all RDBMSs or is it
> something that is PostgreSQL specific?

Speaking from experience, this sort of thing affects MSSQL as well, although
the maintenance routines are different.

--
Josh Berkus
Aglio Database Solutions
San Francisco


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Bill Chandler <billybobc1210(at)yahoo(dot)com>
Cc: pgsql-perform <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat problem?
Date: 2005-04-21 17:44:48
Message-ID: 200504211044.48794.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Bill,

> Honestly, this seems like an inordinate amount of
> babysitting for a production application. I'm not
> sure if the client will be willing to accept it.

Well, then, tell them not to delete 75% of the rows in a table at once. I
imagine that operation brought processing to a halt, too.

If the client isn't willing to accept the consequences of their own bad data
management, I'm not really sure what you expect us to do about it.

> Admittedly my knowledge of the inner workings of an
> RDBMS is limited, but could somebody explain to me why
> this would be so? If you delete a bunch of rows why
> doesn't the index get updated at the same time?

It does get updated. What doesn't happen is the space getting reclaimed. In
a *normal* data situation, the dead nodes are recycled for new rows. But
doing a massive delete operation upsets that, and generally needs to be
followed by a REINDEX.

> Is
> this a common issue among all RDBMSs or is it
> something that is PostgreSQL specific?

Speaking from experience, this sort of thing affects MSSQL as well, although
the maintenance routines are different.

--
Josh Berkus
Aglio Database Solutions
San Francisco


From: Michael Guerin <guerin(at)rentec(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Bill Chandler <billybobc1210(at)yahoo(dot)com>, pgsql-perform <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat problem?
Date: 2005-04-21 18:24:57
Message-ID: 4267EFF9.2000908@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


>>Is
>>this a common issue among all RDBMSs or is it
>>something that is PostgreSQL specific?
>>
>>
>
>Speaking from experience, this sort of thing affects MSSQL as well, although
>the maintenance routines are different.
>
>
>
Yes, this is true with MSSQL too, however sql server implements a defrag
index that doesn't lock up the table..

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_dbcc_30o9.asp

"DBCC INDEXDEFRAG can defragment clustered and nonclustered indexes on
tables and views. DBCC INDEXDEFRAG defragments the leaf level of an
index so that the physical order of the pages matches the left-to-right
logical order of the leaf nodes, thus improving index-scanning performance.

....Every five minutes, DBCC INDEXDEFRAG will report to the user an
estimated percentage completed. DBCC INDEXDEFRAG can be terminated at
any point in the process, and *any completed work is retained.*"

-michael


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Michael Guerin <guerin(at)rentec(dot)com>
Cc: Bill Chandler <billybobc1210(at)yahoo(dot)com>, pgsql-perform <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat problem?
Date: 2005-04-21 18:28:43
Message-ID: 200504211128.43937.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Michael,

> ....Every five minutes, DBCC INDEXDEFRAG will report to the user an
> estimated percentage completed. DBCC INDEXDEFRAG can be terminated at
> any point in the process, and *any completed work is retained.*"

Keen. Sounds like something for our TODO list.

--
Josh Berkus
Aglio Database Solutions
San Francisco


From: Alex Turner <armtuk(at)gmail(dot)com>
To: Bill Chandler <billybobc1210(at)yahoo(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-perform <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat problem?
Date: 2005-04-21 19:12:09
Message-ID: 33c6269f0504211212b0163ac@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Same thing happens in Oracle

ALTER INDEX <blah> rebuild

To force a rebuild. It will mark the free blocks as 'free' below the
PCTFREE value for the tablespace.

Basically If you build an index with 9999 entries. and each entry is
1/4 of a block, the database will write 2500 blocks to the disk. If
you delete a random 75% of the index values, you will now have 2500
blocks that have 75% free space. The database will reuse that free
space in those blocks as you insert new values, but until then, you
still have 2500 blocks worth of data on a disk, that is only 25% full.
Rebuilding the index forces the system to physically re-allocate all
that data space, and now you have just 2499 entries, that use 625
blocks.

I'm not sure that 'blocks' is the correct term in postgres, it's
segments in Oracle, but the concept remains the same.

Alex Turner
netEconomist

On 4/21/05, Bill Chandler <billybobc1210(at)yahoo(dot)com> wrote:
>
> --- Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> > Bill,
> >
> > > What about if an out-of-the-ordinary number of
> > rows
> > > were deleted (say 75% of rows in the table, as
> > opposed
> > > to normal 5%) followed by a 'VACUUM ANALYZE'?
> > Could
> > > things get out of whack because of that situation?
> >
> > Yes. You'd want to run REINDEX after and event like
> > that. As you should now.
> >
> > --
> > Josh Berkus
> > Aglio Database Solutions
> > San Francisco
> >
>
> Thank you. Though I must say, that is very
> discouraging. REINDEX is a costly operation, timewise
> and due to the fact that it locks out other processes
> from proceeding. Updates are constantly coming in and
> queries are occurring continuously. A REINDEX could
> potentially bring the whole thing to a halt.
>
> Honestly, this seems like an inordinate amount of
> babysitting for a production application. I'm not
> sure if the client will be willing to accept it.
>
> Admittedly my knowledge of the inner workings of an
> RDBMS is limited, but could somebody explain to me why
> this would be so? If you delete a bunch of rows why
> doesn't the index get updated at the same time? Is
> this a common issue among all RDBMSs or is it
> something that is PostgreSQL specific? Is there any
> way around it?
>
> thanks,
>
> Bill
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>


From: Dave Chapeskie <pgsql(at)ddm(dot)wox(dot)org>
To: pgsql-perform <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat problem?
Date: 2005-04-21 19:33:05
Message-ID: 20050421193305.GA84813@ddm.wox.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Apr 21, 2005 at 11:28:43AM -0700, Josh Berkus wrote:
> Michael,
>
> > ....Every five minutes, DBCC INDEXDEFRAG will report to the user an
> > estimated percentage completed. DBCC INDEXDEFRAG can be terminated at
> > any point in the process, and *any completed work is retained.*"
>
> Keen. Sounds like something for our TODO list.
>
> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco

See http://archives.postgresql.org/pgsql-general/2005-03/msg01465.php
for my thoughts on a non-blocking alternative to REINDEX. I got no
replies to that message. :-(

I've almost got a working solution integrated in the backend that does
correct WAL logging and everything. (Writing the code to write and
replay WAL logs for complicated operations can be very annoying!)

For now I've gone with a syntax of:

REINDEX INDEX btree_index_name INCREMENTAL;

(For now it's not a proper index AM (accessor method), instead the
generic index code knows this is only supported for btrees and directly
calls the btree_compress function.)

It's not actually a REINDEX per-se in that it doesn't rebuild the whole
index. It holds brief exclusive locks on the index while it shuffles
items around to pack the leaf pages fuller. There were issues with the
code I attached to the above message that have been resolved with the
new code. With respect to the numbers provided in that e-mail the new
code also recycles more pages than before.

Once I've finished it up I'll prepare and post a patch.

--
Dave Chapeskie
OpenPGP Key ID: 0x3D2B6B34


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Dave Chapeskie <pgsql(at)ddm(dot)wox(dot)org>
Cc: pgsql-perform <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat problem?
Date: 2005-04-21 19:55:01
Message-ID: 200504211255.01510.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Dave,

> See http://archives.postgresql.org/pgsql-general/2005-03/msg01465.php
> for my thoughts on a non-blocking alternative to REINDEX. I got no
> replies to that message. :-(

Well, sometimes you have to be pushy. Say, "Hey, comments please?"

The hackers list is about 75 posts a day, it's easy for people to lose track
of stuff they meant to comment on.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


From: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To: Dave Chapeskie <pgsql(at)ddm(dot)wox(dot)org>, pgsql-perform <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index bloat problem?
Date: 2005-04-23 03:12:38
Message-ID: 20050423031238.GA58835@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

You would be interested in
http://archives.postgresql.org/pgsql-hackers/2005-04/msg00565.php

On Thu, Apr 21, 2005 at 03:33:05PM -0400, Dave Chapeskie wrote:
> On Thu, Apr 21, 2005 at 11:28:43AM -0700, Josh Berkus wrote:
> > Michael,
> >
> > > ....Every five minutes, DBCC INDEXDEFRAG will report to the user an
> > > estimated percentage completed. DBCC INDEXDEFRAG can be terminated at
> > > any point in the process, and *any completed work is retained.*"
> >
> > Keen. Sounds like something for our TODO list.
> >
> > --
> > Josh Berkus
> > Aglio Database Solutions
> > San Francisco
>
> See http://archives.postgresql.org/pgsql-general/2005-03/msg01465.php
> for my thoughts on a non-blocking alternative to REINDEX. I got no
> replies to that message. :-(
>
>
> I've almost got a working solution integrated in the backend that does
> correct WAL logging and everything. (Writing the code to write and
> replay WAL logs for complicated operations can be very annoying!)
>
> For now I've gone with a syntax of:
>
> REINDEX INDEX btree_index_name INCREMENTAL;
>
> (For now it's not a proper index AM (accessor method), instead the
> generic index code knows this is only supported for btrees and directly
> calls the btree_compress function.)
>
> It's not actually a REINDEX per-se in that it doesn't rebuild the whole
> index. It holds brief exclusive locks on the index while it shuffles
> items around to pack the leaf pages fuller. There were issues with the
> code I attached to the above message that have been resolved with the
> new code. With respect to the numbers provided in that e-mail the new
> code also recycles more pages than before.
>
> Once I've finished it up I'll prepare and post a patch.
>
> --
> Dave Chapeskie
> OpenPGP Key ID: 0x3D2B6B34
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
>

--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"