pg_autovacuum Version 2 Design Document: Exec Summary: pg_autovacuum was initially released as a contrib module in PostgreSQL v7.4. The version in 7.4 is by design very simple. No configuration is required, and very little configuration is possible. Despite these limitations it was voted the most popular new feature of PostgreSQL v7.4 according to the survey held on postgresql.org (http://www.postgresql.org/survey.php?View=1&SurveyID=23). Despite it's popularity there is much room for improvement. This document sets out to define the most important improvements that would help pg_autovacuum to become a truly powerful asset to the suite of tools that come with PostgreSQL. Current Problems & Limitations: Based on user feedback from people using pg_autovacuum in the field, and my own observations, there are a number of problems and limitation with pg_autovacuum. They are: Inability to customize thresholds on a per table basis Inability to set default thresholds on a per database basis Inability to exclude specific databases / tables from pg_autovacuum monitoring Inability to schedule vacuums during off-peak times Lack of integration related to startup and shutdown Ignorance of VACUUM and ANALYZE operations performed outside pg_autovacuum (requires backend integration? or can listen / notify can be used?) Lack of logging options / syslog integration / log rotation options Create table fails because template1 is busy I'm not sure how to address all of these concerns, or that they all should be addressed right now. One of my big questions is backend integration. I am leaning towards leaving pg_autovacuum as a client application in contrib for one more release. During this time, I can continue to tweak and improve pg_autovacuum so that we will have a very good idea what the final product should be before we make it a standard backend process. For PostgreSQL 7.5, I plan to implement these new features: 1.Per database defaults and per table thresholds (including total exclusion) 2.Persistent data 3.Single-Pass Mode (external scheduling from cron etc...) 4.Off peak scheduling 1. Per Database defaults and Per table Thresholds: There are differing opinions as to the best way to providing these this feature. The primary debate is where to save the configuration data. I see three options: 1.Store config data inside a special pg_autovacuum table inside existing databases that wants custom settings. 2.Use a config file. This would require some additional coding to add the required parsing, but is possible. 3.Create a pg_autovacuum database inside any cluster that wants to customize their settings. Since many people do not like tools that clutter their databases by adding tables, I think option 1 (adding a pg_autovacuum table to existing databases) is right out. Using a config file would be Ok, but would require additional parsing code. My preference is option 3. Since pg_autovacuum will (hopefully) eventually become an integrated part of the backend, it will eventually be able to add required data to the system catalogs. Given these two premises, as long as pg_autovacuum remains a contrib module it could use it's own database to mimic having system tables. If this database exists, it will be used, if it does not exist, then pg_autovacuum will work just as it did in the 7.4 release with very limited options available to it. The user will be able to specify a non-default database. Table Structure for database specific defaults and table specific thresholds: databases_defaults: (will reference the pg_class system table) id serial primary key exclude_database boolean default_vacuum_scaling_factor float default_vacuum_base_value int default_analyze_scaling_factor float default_analyze_base_value int dboid oid references pg_database.oid table_thresholds id serial primary key exclude_table boolean (exclude this table) vacuum_scaling_factor float (equivalent to -v) vacuum_base_value int (equivalent to -V) vacuum_threshold float (if > 0, use this threshold) analyze_scaling_factor float (equivalent to -a) analyze_base_value int (equivalent to -A) analyze_threshold float (if > 0 use this threshold) relid oid references pg_classs.relid 2.Persistent pg_autovacuum Data: Right now pg_autovacuum has no memory of what was going on the last time it was run. So if significant changes have happened while pg_autovacuum is not running, they will not be counted in the analysis of when to perform a vacuum or analyze operation which can result in under vacuuming. So, pg_autovacuum should occasionally write down it's numbers to the database. The data will be stored in an additional table called table_data table_data id serial primary key CountAtLastAnalyze long CountAtLastVacuum long table_thresholds_id int references table_thresholds 3.Single-Pass Mode (External Scheduling): I have received requests to be able to run pg_autovacuum only on request (not as a daemon) making only one pass over all the tables (not looping indefinately). The advantage being that it will operate more like the current vacuum command except that it will only vacuum tables that need to be vacuumed. This feature could be useful as long as pg_autovacuum exists outside the backend. If pg_autovacuum gets integrated into the backend and gets automatically started as a daemon during startup, then this option will no longer make sense. Once we have persistent data (Step 2) then we can easily operate in Single-Pass Mode. 4.Off-Peak Scheduling: A fundamental advantage of our vacuum system is that the work required to reclaim table space is taken out of the critical path and can be moved to and off-peak time when cycles are less precious. One of the drawbacks of the current pg_autovacuum is that it doesn't have any way to factor this in. In it's simplest form (which I will implement first) I would add the ability to add a second set of thresholds that will be active only during an ˇ°off-peakˇ± time that can be specified in the pg_autovacuum database, perhaps in a general_settings table. A few notes about things I'm not planning on working on at the moment. Better logging options: An additional logging could would be added to the pg_autovacuum database and will log all activity (vacuums and analyzes) along with their corresponding duration. Syslog support. I'm not sure this is really needed, but a simple patch was submitted by one user and perhaps that can be reviewed / improved and applied.