Re: Doing better at HINTing an appropriate column within errorMissingColumn()

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Ian Barwick <ian(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Jim Nasby <jim(at)nasby(dot)net>, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Subject: Re: Doing better at HINTing an appropriate column within errorMissingColumn()
Date: 2014-12-04 02:21:24
Message-ID: CAM3SWZRhLLdPD0JHQeXQpVVv54CC_j37t8T-oz5f3SxystShHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Basically, the case in which I think it's helpful to issue a
> suggestion here is when the user has used the table name rather than
> the alias name. I wonder if it's worth checking for that case
> specifically, in lieu of what you've done here, and issuing a totally
> different hint in that case ("HINT: You must refer to this as column
> as "prime_minister.id" rather than "cameron.id").

Well, if an alias is used, and you refer to an attribute using a
non-alias name (i.e. the original table name), then you'll already get
an error suggesting that the alias be used instead -- of course,
that's nothing new. It doesn't matter to the existing hinting
mechanism if the attribute name is otherwise wrong. Once you fix the
code to use the alias suggested, you'll then get this new
Levenshtein-based hint.

> Another idea, which I think I like less well, is to check the
> Levenshtein distance between the allowed alias and the entered alias
> and, if that's within the half-the-shorter-length threshold, consider
> possible matches from that RTE, charge the distance between the
> correct alias and the entered alias as a penalty to each potential
> column match.

I don't about that either. Aliases are often totally arbitrary,
particularly for ad-hoc queries, which is what this is aimed at.

> What I think won't do is to look at a situation where the user has
> entered automobile.id and suggest that maybe they meant student.iq, or
> even student.id.

I'm not sure I follow. If there is an automobile.ip, then it will be
suggested. If there is no automobile column that's much of a match (so
no "automobile.ip", say), then student.id will be suggested (and not
student.iq, *even if there is no student.id* - the final quality check
saves us). So this is possible:

postgres=# select iq, * from student, automobile;
ERROR: 42703: column "iq" does not exist
LINE 1: select iq, * from student, automobile;
^
HINT: Perhaps you meant to reference the column "student"."id".
postgres=# select automobile.iq, * from student, automobile;
ERROR: 42703: column automobile.iq does not exist
LINE 1: select automobile.iq, * from student, automobile;
^

(note that using the table name makes us *not* see a suggestion where
we otherwise would).

The point is that there is a fixed penalty for a wrong user-specified
alias, but all relation RTEs are considered.

> The amount of difference between the names has got to
> matter for the RTE names, just as it does for the column names.

I think it makes sense that it matters by a fixed amount. Besides,
this seems complicated enough already - I don't won't to add more
complexity to worry about equidistant (but still actually valid)
RTE/table/alias names.

It sounds like your concern here is mostly a concern about the
relative distance among multiple matches, as opposed to the absolute
quality of suggestions. The former seems a lot less controversial than
the latter was, though - the user always gets the best match, or the
join pair of best matches, or no match when this new hinting mechanism
is involved.

I attach a new revision. The revision:

* Uses default costs for Levenshtein distance.

* Still charges extra for a non-alias-matching match (although it only
charges a fixed distance of 1 extra). This has regression test
coverage.

* Applies a generic final quality check that enforces a requirement
that a hint have a distance of no greater than 50% of the total string
size. No special treatment of shorter strings is involved anymore.

* Moves almost everything out of scanRTEForColumn() as you outlined
(into a new function, updateFuzzyAttrMatchState(), per your
suggestion).

* Moves dropped column detection into updateFuzzyAttrMatchState(), per
your suggestion.

* Still does the "if (rte->rtekind == RTE_JOIN)" thing in the existing
function searchRangeTableForCol().

I am quite confident that a suggestion from a join RTE will never be
useful, to either the existing use of searchRangeTableForCol() or this
expanded use, and it makes more sense to me to put it there. In fact,
the existing use of searchRangeTableForCol() is really rather similar
to this, and will give up on the first identical match (which is taken
as evidence that there is a attribute of that name, but isn't visible
at this level of the query). So I have not followed your suggestion
here.

Thoughts?
--
Peter Geoghegan

Attachment Content-Type Size
0001-Levenshtein-distance-column-HINT.patch text/x-patch 25.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2014-12-04 03:08:18 Re: tracking commit timestamps
Previous Message Peter Eisentraut 2014-12-04 01:40:49 Re: [COMMITTERS] pgsql: Fix whitespace