Re: PATCH: Allow empty targets in unaccent dictionary

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Mohammad Alhashash <alhashash(at)alhashash(dot)net>
Subject: Re: PATCH: Allow empty targets in unaccent dictionary
Date: 2014-06-30 19:19:17
Message-ID: 20035.1404155957@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com> writes:
> I've attached a patch to contrib/unaccent as outlined in my review the
> other day.

I went to commit this, and while testing I realized that the current
implementation of unaccent_lexize is only capable of coping with "src"
strings that are single characters in the current encoding. I'm not
sure exactly how complicated it would be to fix that, but it might
not be terribly difficult. (Overlapping matches would be the main
complication, likely.)

Anyway, this raises the question of whether the current patch is
actually a desirable way to do things, or whether it would be better if
the unaccenting rules were like "base-char accent-char" -> "base-char".
Presumably, that would require more rules, but it would prevent deletion
of a standalone accent-char, which might or might not be a good thing.
Also, if there are any contexts where the right translation of an
accent-char depends on the base-char, you couldn't do it with the
patch as it stands. I don't know any languages that use separate
accent-chars, so I'm not qualified to opine on whether this is important.

It's not unlikely that we want this patch *and* an improvement that allows
multi-character src strings, but I held off committing for the moment
until somebody weighs in with an opinion about that.

In any case, the patch failed to update the user-facing docs
(unaccent.sgml) so we need some more work in that department. The
user-facing docs are already pretty weak in that they fail to explain the
whitespace rules, much less clarify that the src must be exactly a single
character.

If we don't improve the code to allow multi-character src, I wonder
whether the rule-reading code shouldn't be improved to forbid such
cases. I'm also wondering why it silently ignores malformed lines
instead of bleating. But that's more or less orthogonal to this patch.

Lastly, I didn't especially like the coding details of either proposed
patch, and rewrote it as attached. I didn't see a need to introduce
a "state 5", and I also didn't agree with changing the initial values
of trg/trglen to be meaningful. Right now, those variables are
initialized only to keep the compiler from bleating; the initial values
aren't actually used anywhere. It didn't seem like a good idea for
the trgxxx variables to be different from the srcxxx variables in that
respect.

regards, tom lane

Attachment Content-Type Size
empty-unaccent-v3.patch text/x-diff 1.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christian Ullrich 2014-06-30 19:28:03 Re: PostgreSQL in Windows console and Ctrl-C
Previous Message Andres Freund 2014-06-30 19:11:48 Re: Cluster name in ps output