Re: pg_stat_statements: calls under-estimation propagation

From: Daniel Farina <drfarina(at)acm(dot)org>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_stat_statements: calls under-estimation propagation
Date: 2012-12-29 12:21:56
Message-ID: CACN56+MZHS3rGCpWk+ieFFSWC56uQ9w3e9MOWKugwAg9irMaOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached is a cumulative patch attempting to address the below. One
can see the deltas to get there at https://github.com/fdr/postgres.git
error-prop-pg_stat_statements-v2.

On Fri, Dec 28, 2012 at 9:58 AM, Peter Geoghegan <peter(at)2ndquadrant(dot)com> wrote:
> However, with this approach, calls_underest values might appear to the
> user in what might be considered an astonishing order. Now, I'm not
> suggesting that that's a real problem - just that they may not be the
> semantics we want, particularly as we can reasonably defer assigning a
> calls_underest until a sticky entry is "unstuck", and an entry becomes
> user-visible, within pgss_store().

Fix for this as I understand it:

*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
***************
*** 1036,1041 **** pgss_store(const char *query, uint32 queryId,
--- 1036,1042 ----
e->counters.usage = USAGE_INIT;

e->counters.calls += 1;
+ e->counters.calls_underest = pgss->calls_max_underest;
e->counters.total_time += total_time;
e->counters.rows += rows;
e->counters.shared_blks_hit += bufusage->shared_blks_hit;
***************
*** 1264,1272 **** entry_alloc(pgssHashKey *key, const char *query,
int query_len, bool sticky)
/* set the appropriate initial usage count */
entry->counters.usage = sticky ? pgss->cur_median_usage : USAGE_INIT;

- /* propagate calls under-estimation bound */
- entry->counters.calls_underest = pgss->calls_max_underest;
-
/* re-initialize the mutex each time ... we assume no one using it */
SpinLockInit(&entry->mutex);
/* ... and don't forget the query text */
--- 1265,1270 ----

> Also, it seems like you should initialise pgss->calls_max_underest,
> within pgss_shmem_startup().

Easy enough. Somehow I wrongly thought zero-initialization was a thing
for the shmem functions.

*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
***************
*** 426,431 **** pgss_shmem_startup(void)
--- 426,432 ----
{
/* First time through ... */
pgss->lock = LWLockAssign();
+ pgss->calls_max_underest = 0;
pgss->query_size = pgstat_track_activity_query_size;
pgss->cur_median_usage = ASSUMED_MEDIAN_INIT;
}

> You should probably serialise the value
> to disk, and initialise it to 0 if there is no such value to begin
> with.

I prefer different approach here: just compute it while loading the
entries from disk, since the calls + underestimation can be used to
find a new pessimum underestimation global value.

*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
***************
*** 419,424 **** pgss_shmem_startup(void)
--- 419,425 ----
int query_size;
int buffer_size;
char *buffer = NULL;
+ int64 calls_max_underest = 0;

if (prev_shmem_startup_hook)
prev_shmem_startup_hook();
***************
*** 440,446 **** pgss_shmem_startup(void)
{
/* First time through ... */
pgss->lock = LWLockAssign();
! pgss->calls_max_underest = 0;
pgss->query_size = pgstat_track_activity_query_size;
pgss->cur_median_usage = ASSUMED_MEDIAN_INIT;
}
--- 441,447 ----
{
/* First time through ... */
pgss->lock = LWLockAssign();
! pgss->calls_max_underest = calls_max_underest;
pgss->query_size = pgstat_track_activity_query_size;
pgss->cur_median_usage = ASSUMED_MEDIAN_INIT;
}
***************
*** 528,533 **** pgss_shmem_startup(void)
--- 529,545 ----
temp.query_len,
query_size - 1);

+ /*
+ * Compute maxima of under-estimation over the read entries
+ * for reinitializing pgss->calls_max_underest.
+ */
+ {
+ int64 cur_underest;
+
+ cur_underest = temp.calls + temp.calls_underest;
+ calls_max_underest = Max(calls_max_underest, cur_underest);
+ }
+
/* make the hashtable entry (discards old entries if too many) */
entry = entry_alloc(&temp.key, buffer, temp.query_len, false);

***************
*** 535,540 **** pgss_shmem_startup(void)
--- 547,559 ----
entry->counters = temp.counters;
}

+ /*
+ * Reinitialize global under-estimation information from the
+ * computed maxima, if any. Otherwise, calls_max_underest should
+ * be zero.
+ */
+ pgss->calls_max_underest = calls_max_underest;
+
pfree(buffer);
FreeFile(file);

> I think you probably should have created a
> PG_STAT_STATEMENTS_COLS_V1_1 macro, since that version of the module
> is now legacy, like *V1_0 is in HEAD.

Indeed. Prepare to scroll, this change is not very complex but a bit
more bloated looking:

--- b/contrib/pg_stat_statements/pg_stat_statements.c
***************
*** 96,101 **** typedef struct pgssHashKey
--- 96,115 ----
} pgssHashKey;

/*
+ * Identifies the tuple format detected by pg_stat_statements.
+ *
+ * Used to identify features of newer formats and enable smooth
+ * upgrades: one can install a new pg_stat_statements binary while
+ * running with the old SQL function definitions.
+ */
+ typedef enum pgssTupVersion
+ {
+ PGSS_TUP_V1_0 = 1,
+ PGSS_TUP_V1_1,
+ PGSS_TUP_LATEST
+ } pgssTupVersion;
+
+ /*
* The actual stats counters kept within pgssEntry.
*/
typedef struct Counters
***************
*** 1078,1083 **** pg_stat_statements_reset(PG_FUNCTION_ARGS)
--- 1092,1098 ----
}

#define PG_STAT_STATEMENTS_COLS_V1_0 14
+ #define PG_STAT_STATEMENTS_COLS_V1_1 18
#define PG_STAT_STATEMENTS_COLS 19

/*
***************
*** 1095,1101 **** pg_stat_statements(PG_FUNCTION_ARGS)
bool is_superuser = superuser();
HASH_SEQ_STATUS hash_seq;
pgssEntry *entry;
! bool sql_supports_v1_1_counters = true;

if (!pgss || !pgss_hash)
ereport(ERROR,
--- 1110,1116 ----
bool is_superuser = superuser();
HASH_SEQ_STATUS hash_seq;
pgssEntry *entry;
! pgssTupVersion detected_version;

if (!pgss || !pgss_hash)
ereport(ERROR,
***************
*** 1116,1123 **** pg_stat_statements(PG_FUNCTION_ARGS)
/* Build a tuple descriptor for our result type */
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
if (tupdesc->natts == PG_STAT_STATEMENTS_COLS_V1_0)
! sql_supports_v1_1_counters = false;

per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
oldcontext = MemoryContextSwitchTo(per_query_ctx);
--- 1131,1158 ----
/* Build a tuple descriptor for our result type */
if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
+
+ /* Perform version detection */
if (tupdesc->natts == PG_STAT_STATEMENTS_COLS_V1_0)
! detected_version = PGSS_TUP_V1_0;
! else if (tupdesc->natts == PG_STAT_STATEMENTS_COLS_V1_1)
! detected_version = PGSS_TUP_V1_1;
! else if (tupdesc->natts == PG_STAT_STATEMENTS_COLS)
! detected_version = PGSS_TUP_LATEST;
! else
! {
! /*
! * Couldn't identify the tuple format. Raise error.
! *
! * This is an exceptional case that may only happen in bizarre
! * situations, since it is thought that every released version
! * of pg_stat_statements has a matching schema.
! */
! ereport(ERROR,
! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! errmsg("pg_stat_statements schema is not supported "
! "by its installed binary")));
! }

per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
oldcontext = MemoryContextSwitchTo(per_query_ctx);
***************
*** 1175,1196 **** pg_stat_statements(PG_FUNCTION_ARGS)
continue;

values[i++] = Int64GetDatumFast(tmp.calls);
! values[i++] = Int64GetDatumFast(tmp.calls_underest);
values[i++] = Float8GetDatumFast(tmp.total_time);
values[i++] = Int64GetDatumFast(tmp.rows);
values[i++] = Int64GetDatumFast(tmp.shared_blks_hit);
values[i++] = Int64GetDatumFast(tmp.shared_blks_read);
! if (sql_supports_v1_1_counters)
values[i++] = Int64GetDatumFast(tmp.shared_blks_dirtied);
values[i++] = Int64GetDatumFast(tmp.shared_blks_written);
values[i++] = Int64GetDatumFast(tmp.local_blks_hit);
values[i++] = Int64GetDatumFast(tmp.local_blks_read);
! if (sql_supports_v1_1_counters)
values[i++] = Int64GetDatumFast(tmp.local_blks_dirtied);
values[i++] = Int64GetDatumFast(tmp.local_blks_written);
values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
values[i++] = Int64GetDatumFast(tmp.temp_blks_written);
! if (sql_supports_v1_1_counters)
{
values[i++] = Float8GetDatumFast(tmp.blk_read_time);
values[i++] = Float8GetDatumFast(tmp.blk_write_time);
--- 1210,1232 ----
continue;

values[i++] = Int64GetDatumFast(tmp.calls);
! if (detected_version >= PGSS_TUP_LATEST)
! values[i++] = Int64GetDatumFast(tmp.calls_underest);
values[i++] = Float8GetDatumFast(tmp.total_time);
values[i++] = Int64GetDatumFast(tmp.rows);
values[i++] = Int64GetDatumFast(tmp.shared_blks_hit);
values[i++] = Int64GetDatumFast(tmp.shared_blks_read);
! if (detected_version >= PGSS_TUP_V1_1)
values[i++] = Int64GetDatumFast(tmp.shared_blks_dirtied);
values[i++] = Int64GetDatumFast(tmp.shared_blks_written);
values[i++] = Int64GetDatumFast(tmp.local_blks_hit);
values[i++] = Int64GetDatumFast(tmp.local_blks_read);
! if (detected_version >= PGSS_TUP_V1_1)
values[i++] = Int64GetDatumFast(tmp.local_blks_dirtied);
values[i++] = Int64GetDatumFast(tmp.local_blks_written);
values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
values[i++] = Int64GetDatumFast(tmp.temp_blks_written);
! if (detected_version >= PGSS_TUP_V1_1)
{
values[i++] = Float8GetDatumFast(tmp.blk_read_time);
values[i++] = Float8GetDatumFast(tmp.blk_write_time);

> I wonder if the way that pg_stat_statements throws its hands up when
> it comes to crash safety (i.e. the contents of the hash table are
> completely lost) could be a concern here. In other words, a program
> tasked with tracking execution costs over time and graphing
> time-series data from snapshots has to take responsibility for
> ensuring that there hasn't been a crash (or, indeed, a reset).

> Another issue is that I don't think that what you've done here solves
> the problem of uniquely identify each entry over time, in the same way
> that simply exposing the hash value would. I'm concerned with the
> related problem to the problem solved here - simply identifying the
> entry uniquely. As we've already discussed, the query string is an
> imperfect proxy for each entry, even with constants swapped with '?'
> characters (and even when combined with the userid and dbid values -
> it's still not the same as the hashtable key, in a way that is likely
> to bother people that use pg_stat_statements for long enough).

These were not express goals of the patch, but so long as you are
inviting features, attached is a bonus patch that exposes the queryid
and also the notion of a "statistics session" that is re-rolled
whenever the stats file could not be read or the stats are reset, able
to fully explain all obvious causes of retrograde motion in
statistics. It too is cumulative, so it includes the under-estimation
field. Notably, I also opted to nullify extra pg_stat_statements
fields when they'd also show "insufficient privileges" (that one is
spared from this censorship), because I feel as though a bit too much
information leaks from pg_stat_statement's statistics to ignore,
especially after adding the query id. Since the common theme here is
identifying queries, I have called it
"pg_stat_statements-identification", and it can be found in the git
repo above under the same name (...-v1).

--
fdr

Attachment Content-Type Size
add-pg_stat_statements-calls-underestimation-v2.patch.gz application/x-gzip 3.3 KB
pg_stat_statements-identification-v1.patch.gz application/x-gzip 6.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2012-12-29 12:23:24 inconsistent time zone formats in log
Previous Message James Cloos 2012-12-29 10:23:31 Re: ILIKE vs indices