From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Mohammad Alhashash <alhashash(at)alhashash(dot)net> |
Subject: | Re: PATCH: Allow empty targets in unaccent dictionary |
Date: | 2014-06-30 19:19:17 |
Message-ID: | 20035.1404155957@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com> writes:
> I've attached a patch to contrib/unaccent as outlined in my review the
> other day.
I went to commit this, and while testing I realized that the current
implementation of unaccent_lexize is only capable of coping with "src"
strings that are single characters in the current encoding. I'm not
sure exactly how complicated it would be to fix that, but it might
not be terribly difficult. (Overlapping matches would be the main
complication, likely.)
Anyway, this raises the question of whether the current patch is
actually a desirable way to do things, or whether it would be better if
the unaccenting rules were like "base-char accent-char" -> "base-char".
Presumably, that would require more rules, but it would prevent deletion
of a standalone accent-char, which might or might not be a good thing.
Also, if there are any contexts where the right translation of an
accent-char depends on the base-char, you couldn't do it with the
patch as it stands. I don't know any languages that use separate
accent-chars, so I'm not qualified to opine on whether this is important.
It's not unlikely that we want this patch *and* an improvement that allows
multi-character src strings, but I held off committing for the moment
until somebody weighs in with an opinion about that.
In any case, the patch failed to update the user-facing docs
(unaccent.sgml) so we need some more work in that department. The
user-facing docs are already pretty weak in that they fail to explain the
whitespace rules, much less clarify that the src must be exactly a single
character.
If we don't improve the code to allow multi-character src, I wonder
whether the rule-reading code shouldn't be improved to forbid such
cases. I'm also wondering why it silently ignores malformed lines
instead of bleating. But that's more or less orthogonal to this patch.
Lastly, I didn't especially like the coding details of either proposed
patch, and rewrote it as attached. I didn't see a need to introduce
a "state 5", and I also didn't agree with changing the initial values
of trg/trglen to be meaningful. Right now, those variables are
initialized only to keep the compiler from bleating; the initial values
aren't actually used anywhere. It didn't seem like a good idea for
the trgxxx variables to be different from the srcxxx variables in that
respect.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
empty-unaccent-v3.patch | text/x-diff | 1.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Christian Ullrich | 2014-06-30 19:28:03 | Re: PostgreSQL in Windows console and Ctrl-C |
Previous Message | Andres Freund | 2014-06-30 19:11:48 | Re: Cluster name in ps output |