Quick Links

[PATCH] optional cleaning queries stored in pg_stat_statements

Lists:	pgsql-hackers

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	[PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-06 01:58:00
Message-ID:	4EB5E9A8.9060309@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi everyone,

I propose a patch that would allow optional "cleaning" of queries
tracked in pg_stat_statements, compressing the result and making it more
readable.

The default behavior is that when the same query is run with different
parameter values, it's actually stored as two separate queries (the
string do differ).

A small example - when you run "pgbench -S" you'll get queries like
this

SELECT abalance FROM pgbench_accounts WHERE aid = 12433
SELECT abalance FROM pgbench_accounts WHERE aid = 2322
SELECT abalance FROM pgbench_accounts WHERE aid = 52492

and so on, and each one is listed separately in the pg_stat_statements.
This often pollutes the pg_stat_statements.

The patch implements a simple "cleaning" that replaces the parameter
values with generic strings - e.g. numbers are turned to ":n", so the
queries mentioned above are turned to

SELECT abalance FROM pgbench_accounts WHERE aid = :n

and thus tracked as a single query in pg_stat_statements.

The patch provides an enum GUC (pg_stat_statements.clean) with three
options - none, basic and aggressive. The default option is "none", the
"basic" performs the basic value replacement (described above) and
"aggressive" performs some additional cleaning - for example replaces
multiple spaces with a single one etc.

The parsing is intentionally very simple and cleans the query in a
single pass. Currently handles three literal types:

a) string (basic, C-style escaped, Unicode-escaped, $-espaced)
b) numeric (although 1.925e-3 is replaced by :n-:n)
c) boolean (true/false)

There is probably room for improvement (e.g. handling UESCAPE).

Tomas

Attachment	Content-Type	Size
pg_stat_statements_clean.diff	text/plain	10.7 KB

From:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-06 02:16:20
Message-ID:	CAEYLb_VYp8Y8LG52uL_xvonoqOBoJSvYNjjt32hZC8qJ3Utr0g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/11/6 Tomas Vondra <tv(at)fuzzy(dot)cz>:
> Hi everyone,

> The patch implements a simple "cleaning" that replaces the parameter
> values with generic strings - e.g. numbers are turned to ":n", so the
> queries mentioned above are turned to
>
> SELECT abalance FROM pgbench_accounts WHERE aid = :n
>
> and thus tracked as a single query in pg_stat_statements.

I'm a couple of days away from posting a much better principled
implementation of pg_stat_statements normalisation. To normalise, we
perform a selective serialisation of the query tree, which is hashed.
This has the additional benefit of considering queries equivalent even
when, for example, there is a different amount of whitespace, or if
queries differ only in their use of aliases, or if queries differ only
in that one uses a noise word the other doesn't. It also does things
like intelligently distinguishing between queries with different
limit/offset constant values, as these constants are deemed to be
differentiators of queries for our purposes. A guiding principal that
I've followed is that anything that could result in a different plan
is a differentiator of queries.

I intend to maintain a backwards compatible version, as this can be
expected to work with earlier versions of Postgres.

There will be additional infrastructure added to the parser to support
normalisation of query strings for the patch I'll be submitting (that
obviously won't be supported in the version that builds against
existing Postgres versions that I'll make available). Essentially,
I'll be adding a length field to certain nodes, to go alongside the
existing location field (currently just used to highlight an offending
token in an error message outside the parser). pg_stat_statements will
then use the location and length field of const nodes to swap out
constants in the query string.

It's unfortunate that there was a duplicate effort here. With respect,
the approach that you've taken probably would have turned out to have
been a bit of a tar pit - I'm reasonably sure that had you pursued it,
you'd have found yourself duplicating quite a bit of the parser as new
problems came to light.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-06 03:42:20
Message-ID:	4EB6021C.2080700@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dne 6.11.2011 03:16, Peter Geoghegan napsal(a):
> 2011/11/6 Tomas Vondra <tv(at)fuzzy(dot)cz>:
>> Hi everyone,
>
>> The patch implements a simple "cleaning" that replaces the parameter
>> values with generic strings - e.g. numbers are turned to ":n", so the
>> queries mentioned above are turned to
>>
>> SELECT abalance FROM pgbench_accounts WHERE aid = :n
>>
>> and thus tracked as a single query in pg_stat_statements.
>
> I'm a couple of days away from posting a much better principled
> implementation of pg_stat_statements normalisation. To normalise, we
> perform a selective serialisation of the query tree, which is hashed.

OK, my patch definitely is not the only possible and if there's a way to
get more info from the planner, the results surely might be better. My
goal was to provide a simple patch that solves the problem better then
I'll be more than happy to remove mine.

> This has the additional benefit of considering queries equivalent even
> when, for example, there is a different amount of whitespace, or if
> queries differ only in their use of aliases, or if queries differ only
> in that one uses a noise word the other doesn't. It also does things

Yup, that's nice. And achieving the same behavior (aliases, noise) would
require a much more complex parser than the one I wrote.

> like intelligently distinguishing between queries with different
> limit/offset constant values, as these constants are deemed to be
> differentiators of queries for our purposes. A guiding principal that
> I've followed is that anything that could result in a different plan
> is a differentiator of queries.

Not sure if I understand it correctly - do you want to keep multiple
records for a query, for each differentiator (different limit/offset
values, etc.)? That does not make much sense to me - even a different
parameter value may cause change of the plan, so I don't see much
difference between a query parameter, limit/offset constant values etc.

If it was possible to compare the actual plan (or at least a hash of the
plan), and keeping one record for each plan, that'd be extremely nice.
You'd see that the query was executed with 3 different plans, number of
calls, average duration etc.

> I intend to maintain a backwards compatible version, as this can be
> expected to work with earlier versions of Postgres.
>
> There will be additional infrastructure added to the parser to support
> normalisation of query strings for the patch I'll be submitting (that
> obviously won't be supported in the version that builds against
> existing Postgres versions that I'll make available). Essentially,
> I'll be adding a length field to certain nodes, to go alongside the
> existing location field (currently just used to highlight an offending
> token in an error message outside the parser). pg_stat_statements will
> then use the location and length field of const nodes to swap out
> constants in the query string.
>
> It's unfortunate that there was a duplicate effort here. With respect,
> the approach that you've taken probably would have turned out to have
> been a bit of a tar pit - I'm reasonably sure that had you pursued it,
> you'd have found yourself duplicating quite a bit of the parser as new
> problems came to light.

The duplicate effort is not a problem. I needed to learn something more
about the internals and how to use the executor hooks, and the
pg_stat_statements is a neat place to learn this. The patch is rather a
side effect of this effort, so no waste on my side.

Anyway you're right that the approach I've taken is not much extensible
without building a parser parallel to the actual one. It was meant as a
simple solution not solving the problem perfectly, just sufficiently for
what I need.

Tomas

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-06 12:06:58
Message-ID:	m239e1tpy5.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tomas Vondra <tv(at)fuzzy(dot)cz> writes:
> If it was possible to compare the actual plan (or at least a hash of the
> plan), and keeping one record for each plan, that'd be extremely nice.
> You'd see that the query was executed with 3 different plans, number of
> calls, average duration etc.

I like that idea. How does that integrates to current efforts?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-06 15:08:21
Message-ID:	2223.1320592101@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Geoghegan <peter(at)2ndquadrant(dot)com> writes:
> I'm a couple of days away from posting a much better principled
> implementation of pg_stat_statements normalisation. To normalise, we
> perform a selective serialisation of the query tree, which is hashed.

That seems like an interesting approach, and definitely a lot less of
a kluge than what Tomas suggested. Are you intending to hash the raw
grammar output tree, the parse analysis result, the rewriter result,
or the actual plan tree? I don't necessarily have a preconceived notion
about which is best (I can think of pluses and minuses for each), but
we should hear your explanation of which one you chose and what your
reasoning was.

> ... It also does things
> like intelligently distinguishing between queries with different
> limit/offset constant values, as these constants are deemed to be
> differentiators of queries for our purposes. A guiding principal that
> I've followed is that anything that could result in a different plan
> is a differentiator of queries.

This claim seems like bunk, unless you're hashing the plan tree,
in which case it's tautological. As an example, switching from
"where x < 1" to "where x < 2" could make the planner change plans,
if there's a sufficiently large difference in the selectivity of the two
cases (or even if there's a very small difference, but we're right at
the tipping point where the estimated costs are equal). There's no way
to know this in advance unless you care to duplicate all the planner
cost estimation logic. And I don't think you actually want to classify
"where x < 1" and "where x < 2" differently just because they *might*
give different plans in corner cases.

I'm not real sure whether it's better to classify on the basis of
similar plans or similar original queries, anyway. This seems like
something that could be open for debate about use-cases.

> There will be additional infrastructure added to the parser to support
> normalisation of query strings for the patch I'll be submitting (that
> obviously won't be supported in the version that builds against
> existing Postgres versions that I'll make available). Essentially,
> I'll be adding a length field to certain nodes,

This seems like a good way to get your patch rejected: adding overhead
to the core system for the benefit of a feature that not everyone cares
about is problematic. Why do you need it anyway? Surely it's possible
to determine the length of a literal token after the fact.

More generally, if you're hashing anything later than the raw grammar
tree, I think that generating a suitable representation of the queries
represented by a single hash entry is going to be problematic anyway.
There could be significant differences --- much more than just the
values of constants --- between query strings that end up being
semantically the same. Or for that matter we could have identical query
strings that wind up being considered different because of the action of
search_path or other context.

It might be that the path of least resistance is to document that we
select one of the actual query strings "at random" to represent all the
queries that went into the same hash entry, and not even bother with
trying to strip out constants. The effort required to do that seems
well out of proportion to the reward, if we can't do a perfect job of
representing the common aspects of the queries.

regards, tom lane

From:	"Tomas Vondra" <tv(at)fuzzy(dot)cz>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Peter Geoghegan" <peter(at)2ndquadrant(dot)com>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-06 15:57:50
Message-ID:	214eea0a7c532feeb959b7ae7169d1d7.squirrel@sq.gransy.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6 Listopad 2011, 16:08, Tom Lane wrote:
> Peter Geoghegan <peter(at)2ndquadrant(dot)com> writes:
>> I'm a couple of days away from posting a much better principled
>> implementation of pg_stat_statements normalisation. To normalise, we
>> perform a selective serialisation of the query tree, which is hashed.
>
> That seems like an interesting approach, and definitely a lot less of
> a kluge than what Tomas suggested. Are you intending to hash the raw
> grammar output tree, the parse analysis result, the rewriter result,
> or the actual plan tree? I don't necessarily have a preconceived notion
> about which is best (I can think of pluses and minuses for each), but
> we should hear your explanation of which one you chose and what your
> reasoning was.

Could you describe the pluses and minuses? My understanding is that the
later the hash is computed, the more it reflects how the queries were
actually executed. Would it make sense to turn this into a GUC and leave
the decision up to the user, something like

pg_stat_statements.hash_phase = {grammar|analysis|rewriter|plan}

So that the user could decide how coarse the output should be?

> I'm not real sure whether it's better to classify on the basis of
> similar plans or similar original queries, anyway. This seems like
> something that could be open for debate about use-cases.

Well, that's really tricky question - there can be different queries with
the same plan. I thing that grouping queries solely by the plan is not
much useful, so the original query should be involved somehow.

What about using two hashes - hash of the grammar tree (grammar_hash) and
hash of the rewriter output (rewriter_hash). The pg_stat_statements would
then group the queries by the (grammar_hash, rewriter_hash) pair and
include those two columns into the output.

So I could select the rows with the same grammar_hash to see observed
plans for the given query, or select rows with the same rewriter_hash to
see queries leading to that particular plan.

To make this actually usable it's important to provide access to the
plans, so that the user can get rewriter_hash and get the plan somehow.
This is not needed for grammar_hash, because the query string will be
there, but the actual plan might change (due to autoanalyze etc.).

But maybe this is a severe over-engineering and it's far too complicated.

> It might be that the path of least resistance is to document that we
> select one of the actual query strings "at random" to represent all the
> queries that went into the same hash entry, and not even bother with
> trying to strip out constants. The effort required to do that seems
> well out of proportion to the reward, if we can't do a perfect job of
> representing the common aspects of the queries.

Yes, once we have the hashes we can surely use a random query string with
the values included. But it'd be nice to have the actual plans stored
somewhere, so that it's possible to see them later.

Tomas

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-06 19:56:12
Message-ID:	4EB6E65C.90406@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/06/2011 01:08 PM, Tom Lane wrote:
> Peter Geoghegan<peter(at)2ndquadrant(dot)com> writes:
>
>> ... It also does things
>> like intelligently distinguishing between queries with different
>> limit/offset constant values, as these constants are deemed to be
>> differentiators of queries for our purposes. A guiding principal that
>> I've followed is that anything that could result in a different plan
>> is a differentiator of queries.
>>
> This claim seems like bunk, unless you're hashing the plan tree,
> in which case it's tautological.

Peter's patch adds a planner hook and hashes at that level. Since this
cat is rapidly escaping its bag and impacting other contributors, we
might as well share the work in progress early if anyone has a burning
desire to look at the code. A diff against the version I've been
internally reviewing in prep for community submission is at
https://github.com/greg2ndQuadrant/postgres/compare/master...pg_stat_statements_norm
Easier to browse than ask questions probing about it I think. Apologies
to Peter for sharing his work before he was completely ready; there is a
list of known problems with it. I also regret the thread hijack here.

The first chunk of code is a Python based regression test program, and
an open item here is the best way to turn that into a standard
regression test set.

>> There will be additional infrastructure added to the parser to support
>> normalisation of query strings for the patch I'll be submitting (that
>> obviously won't be supported in the version that builds against
>> existing Postgres versions that I'll make available). Essentially,
>> I'll be adding a length field to certain nodes,
>>
> This seems like a good way to get your patch rejected: adding overhead
> to the core system for the benefit of a feature that not everyone cares
> about is problematic.

Struggling around this area is the main reason this code hasn't been
submitted yet. To step back for a moment, there are basically three
options here that any code like this can take, in regards to storing the
processed query name used as the key:

1) Use the first instance of that query seen as the "reference" version
2) Use the last instance seen
3) Transform the text of the query in a way that's stable across all
duplicates of that statement, and output that

Downstream tools operating on this data, things that will sample
pg_stat_statements multiple times, need some sort of stable query text
they can operate on. It really needs to survive server restart too.
Neither (1) nor (2) seem like usable solutions. Both have been
implemented already in Peter's patch, with (2) being what's in the
current patch. How best to do (3) instead is not obvious though.

Doing the query matching operating at the planner level seems effective
at side-stepping the problem of needing to parse the SQL, which is where
most implementations of this idea get stuck doing fragile
transformations. My own first try at this back in September and Tomas's
patch both fall into that category. That part of Peter's work seems to
work as expected. That still leaves the problem of "parsed query ->
stable normalized string". I think that is an easier one to solve than
directly taking on the entirety of "query text -> stable normalized
string" though. Peter has an idea he's exploring for how to implement
that, and any amount of overhead it adds to people who don't use this
feature is an obvious concern. There are certainly use cases that don't
care about this problem, ones that would happily take (1) or (2). I'd
rather not ship yet another not quite right for everyone version of
pg_stat_statements again though; only solving too limited of a use case
is the big problem with the one that's already there.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

From:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-06 22:26:46
Message-ID:	CAEYLb_V0KRPMppgKsSEZVatOMp52DNXxV=DpuyTA_JQzd9LZiw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6 November 2011 15:08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Are you intending to hash the raw
> grammar output tree, the parse analysis result, the rewriter result,
> or the actual plan tree?

I'm hashing the Query tree from a planner plugin (function assigned to
planner_hook), immediately prior to it being passed to
standard_planner. A major consideration was backwards compatibility;
at one point, it looked like this all could be done without adding any
new infrastructure, with perhaps an officially sanctioned 9.2 version
of pg_stat_statements that could be built against earlier pg versions.
Other than that, it seems intuitively obvious that this should happen
as late as possible in the parsing stage - any transformation
performed prior to planning was considered essential to the query,
even if that potentially meant that successive calls of the same query
string were considered different due to external factors. These things
probably don't come up to often in practice, but I think that they're
nice to have, as they prevent the DBA from looking at statistics that
aren't actually representative.

>> A guiding principal that
>> I've followed is that anything that could result in a different plan
>> is a differentiator of queries.
>
> This claim seems like bunk, unless you're hashing the plan tree,
> in which case it's tautological.

I think I may have been unclear, for which I apologise. Certainly, if
the principle I (mis)described was followed to the letter, I'd end up
with something that was exactly the same as what we already have. I
merely meant to suggest, with an awareness of the fact that I was
saying something slightly controversial, that because the planner is
/invariably/ sensitive to changes in limit/offset constants,
particularly large changes, it made sense to have those constants as
differentiators, and it certainly made sense to have a limit n
differentiate the same query with and without a limit n. I do of
course appreciate that the use of a different constant in quals could
result in a different plan being generated due to their selectivity
estimates varying.

> I'm not real sure whether it's better to classify on the basis of
> similar plans or similar original queries, anyway. This seems like
> something that could be open for debate about use-cases.

Indeed - it's generally difficult to reason about what behaviour is
optimal when attempting to anticipate the needs of every Postgres DBA.
In this case, I myself lean strongly towards similar original queries,
because classifying on the basis of similar plans isn't likely to be
nearly as actionable, and there are really no obvious precedents to
draw on - how can it be turned into a useful tool?

>> There will be additional infrastructure added to the parser to support
>> normalisation of query strings for the patch I'll be submitting (that
>> obviously won't be supported in the version that builds against
>> existing Postgres versions that I'll make available). Essentially,
>> I'll be adding a length field to certain nodes,
>
> This seems like a good way to get your patch rejected: adding overhead
> to the core system for the benefit of a feature that not everyone cares
> about is problematic.

It's problematic, but I believe that it can be justified. Without
being glib, exactly no one cares about the location of tokens that
correspond to Const nodes until they have an error that occurs outside
the parser that is attributable to a node/corresponding token, which,
in a production environment, could take a long time. All I'm doing is
moving slightly more towards the default Bison representation of
YYLTYPE, where the column and row of both the beginning and end of
each token are stored. I'm storing location and length rather than
just location, which is a modest change. Not everyone cares about this
feature, but plenty do. It will be particularly useful to have the
Query representation stable, even to the point where it is stable
between pg_stat_statements_reset() calls - third party tools can rely
on the stability of the query string.

> Why do you need it anyway? Surely it's possible
> to determine the length of a literal token after the fact.

Probably, but not in a manner that I deem to be well-principled. I
want to push the onus on keeping bookkeeping for the replacement of
tokens into the parser, which is authoritative - otherwise, I'll end
up with a kludge that is liable to fall out of sync. The community
will have the opportunity to consider if that trade-off makes sense.

> More generally, if you're hashing anything later than the raw grammar
> tree, I think that generating a suitable representation of the queries
> represented by a single hash entry is going to be problematic anyway.
> There could be significant differences --- much more than just the
> values of constants --- between query strings that end up being
> semantically the same. Or for that matter we could have identical query
> strings that wind up being considered different because of the action of
> search_path or other context.

I don't see that that has to be problematic. I consider that to be a
feature, and it can be documented as such. I think that any scenario
anyone might care to describe where this is a real problem will be
mostly contrived.

Up until quite recently, this patch actually updated the query string,
so that a new set of constants was seen after each query call - that's
what Greg posted a link to. The part of the patch that normalises the
query string and the related backend infrastructure are fairly neat
adjuncts to what you see there. So while I'll be asking someone to
commit the patch with that infrastructure, because the adjunct adds
quite a lot of value to the patch, there is no need to have that be a
blocker.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-06 23:00:37
Message-ID:	22410.1320620437@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Geoghegan <peter(at)2ndquadrant(dot)com> writes:
> On 6 November 2011 15:08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Are you intending to hash the raw
>> grammar output tree, the parse analysis result, the rewriter result,
>> or the actual plan tree?

> I'm hashing the Query tree from a planner plugin (function assigned to
> planner_hook), immediately prior to it being passed to
> standard_planner.

IOW, the rewriter result. Why that choice? That seems like about the
least useful of the four possibilities, since it provides no certainty
about the plan while also being as far from the original text as you
can get (thus making the problem of what to display pretty hard).

> A major consideration was backwards compatibility;

This is not a consideration that the community is likely to weigh
heavily, or indeed at all. We aren't going to back-port this feature
into prior release branches, and we aren't going to want to contort its
definition to make that easier. If 2ndquadrant wants to ship a
back-branch version with the feature, you can certainly also back-port
a patch that adds a hook where you need it, if you need a new hook.

But frankly I'm a bit surprised that you're not hashing the query plan,
since you could get that in the ExecutorStart hook that
pg_stat_statements already uses. With a hook placed somewhere else, you
add additional implementation problems of matching up the calls to that
hook with later calls to the executor hook.

>> I'm not real sure whether it's better to classify on the basis of
>> similar plans or similar original queries, anyway. This seems like
>> something that could be open for debate about use-cases.

> Indeed - it's generally difficult to reason about what behaviour is
> optimal when attempting to anticipate the needs of every Postgres DBA.
> In this case, I myself lean strongly towards similar original queries,
> because classifying on the basis of similar plans isn't likely to be
> nearly as actionable, and there are really no obvious precedents to
> draw on - how can it be turned into a useful tool?

Hm, well, if that's what you think, why so late in the sequence?
Hashing the raw grammar output would be the best way to identify similar
original queries, ISTM. You'd also have a better chance of getting
people to hold still for the extra overhead fields, if they didn't
need to propagate further than that representation.

Basically, I think there are fairly clear arguments for using a hash of
the raw grammar output, if you lean towards the "classify by original
query" viewpoint; or a hash of the plan tree, if you lean towards the
"classify by plan" viewpoint. I don't see what use-cases would make it
optimal to hash one of the intermediate stages. I was hoping you had an
argument stronger than "it'll be easy to back-port" for that.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-07 14:03:09
Message-ID:	CA+Tgmoac9sHHPsyntCf+Y8afTdFRJN3kVR0SYjVceemq5drCTw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 6, 2011 at 2:56 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Peter's patch adds a planner hook and hashes at that level. Since this cat
> is rapidly escaping its bag and impacting other contributors, we might as
> well share the work in progress early if anyone has a burning desire to look
> at the code. A diff against the version I've been internally reviewing in
> prep for community submission is at
> https://github.com/greg2ndQuadrant/postgres/compare/master...pg_stat_statements_norm
> Easier to browse than ask questions probing about it I think. Apologies to
> Peter for sharing his work before he was completely ready; there is a list
> of known problems with it. I also regret the thread hijack here.

I think it's an established principle that the design for features
like this should, for best results, be discussed on -hackers before
writing a lot of code. That not only avoids duplication of effort
(which is nice), but also allows design decisions like "what exactly
should we hash and where should the hooks be?" to be ironed out early,
before they can force a major rewrite. Of course, there's no hard
requirement, but the archives are littered with patches that crashed
and burned because the author wrote them in isolation and then, upon
receiving feedback from the community that necessitated a full
rewrite, gave up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-07 15:27:46
Message-ID:	4EB7F8F2.1070608@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/07/2011 09:03 AM, Robert Haas wrote:
> I think it's an established principle that the design for features
> like this should, for best results, be discussed on -hackers before
> writing a lot of code.

You can see from the commit history this idea is less than a month old.
Do we need to get community approval before writing a proof of concept
now? Everyone would still be arguing about how to get started had that
path been taken. If feedback says this needs a full rewrite, fine. We
are familiar with how submitting patches works here.

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-07 15:46:17
Message-ID:	4EB7FD49.8050009@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/06/2011 06:00 PM, Tom Lane wrote:
> Peter Geoghegan<peter(at)2ndquadrant(dot)com> writes:
>
> > A major consideration was backwards compatibility;
> This is not a consideration that the community is likely to weigh
> heavily, or indeed at all. We aren't going to back-port this feature
> into prior release branches, and we aren't going to want to contort its
> definition to make that easier.

Being able to ship a better pg_stat_statements that can run against
earlier versions as an extension has significant value to the
community. Needing to parse log files to do statement profiling is a
major advocacy issue for PostgreSQL. If we can get something useful
that's possible to test as an extension earlier than the 9.2
release--and make it available to more people running older versions
too--that has some real value to users, and for early production testing
of what will go into 9.2.

The point where this crosses over to requiring server-side code to
operate at all is obviously a deal breaker on that idea. So far that
line hasn't been crossed, and we're trying to stage testing against
older versions on real-world queries. As long as it's possible to keep
that goal without making the code more difficult to deal with, I
wouldn't dismiss that as a useless distinction.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-07 15:59:20
Message-ID:	CA+TgmoYBnfYOOn-a=BFnQfVXQRWSLeiTJej0+qVb=03gEc3_Yg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 10:27 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 11/07/2011 09:03 AM, Robert Haas wrote:
>>
>> I think it's an established principle that the design for features
>> like this should, for best results, be discussed on -hackers before
>> writing a lot of code.
>
> You can see from the commit history this idea is less than a month old. Do
> we need to get community approval before writing a proof of concept now?
> Everyone would still be arguing about how to get started had that path been
> taken. If feedback says this needs a full rewrite, fine. We are familiar
> with how submitting patches works here.

Eh, obviously that didn't come off the way I meant it, since I already
got one other off-list reply suggesting that my tone there may not
have been the best. Sorry.

I suppose in a way I am taking myself to task as much as anyone, since
I have been spending a lot of time thinking about things lately, and I
haven't been as good about converting those ideas to a form suitable
for posting on -hackers, and actually doing it, as I historically have
been, and I'm feeling like I need to get back to doing that more. But
I honestly don't care one way or the other how you or Peter develop
your patches, beyond the fact that they contain a lot of good ideas
which I would like to see work their way into the tree; and of course
I do know that you are already familiar with the process.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-11-25 18:52:18
Message-ID:	CAMkU=1w6LgDCGfAkNEDLZpSu6vgNY0_jRd97MPHMdUztLzmwkQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 5, 2011 at 8:42 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> Dne 6.11.2011 03:16, Peter Geoghegan napsal(a):
>> 2011/11/6 Tomas Vondra <tv(at)fuzzy(dot)cz>:
>>> Hi everyone,
>>
>>> The patch implements a simple "cleaning" that replaces the parameter
>>> values with generic strings - e.g. numbers are turned to ":n", so the
>>> queries mentioned above are turned to
>>>
>>> SELECT abalance FROM pgbench_accounts WHERE aid = :n
>>>
>>> and thus tracked as a single query in pg_stat_statements.
>>
>> I'm a couple of days away from posting a much better principled
>> implementation of pg_stat_statements normalisation. To normalise, we
>> perform a selective serialisation of the query tree, which is hashed.
>
> OK, my patch definitely is not the only possible and if there's a way to
> get more info from the planner, the results surely might be better. My
> goal was to provide a simple patch that solves the problem better then
> I'll be more than happy to remove mine.

Hi Tomas,

Given Peter's patch on the same subject, should we now mark this one
as rejected in the commitfest app?
https://commitfest.postgresql.org/action/patch_view?id=681

Thanks,

Jeff

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date:	2011-12-01 10:56:36
Message-ID:	4ED75D64.9010907@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/25/2011 10:52 AM, Jeff Janes wrote:
> Given Peter's patch on the same subject, should we now mark this one
> as rejected in the commitfest app?
> https://commitfest.postgresql.org/action/patch_view?id=681

That may be premature. While the testing I've done so far suggests
Peter's idea is the better route forward, the much higher complexity
level does mean it's surely possible for it to be rejected. We have a
reviewer signed up for Peter's submission now; I think what to do with
Tomas's entry will be an easier to make once that report is in.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us