Re: String comparision in PostgreSQL

Lists: pgsql-general
From: Nicola Cisternino <ncister(at)tiscali(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: String comparision in PostgreSQL
Date: 2012-08-28 14:46:55
Message-ID: 503CD9DF.40808@tiscali.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi all,
I'm valutating a complex porting of our application based on Sybase
SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your
opinion about searching/ordering funcionality.
The problem is about string comparision.
MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple
definition of case sensitive/insensitive behavior using char, varchar
and text field type.
In PostgreSQL I've already tried to use "citext", lower() function
(applied to indexes, too ...), ILIKE an so on ..... but nothing really
work as I need (poor performances ...) !!
My questions are:
1) Why PostgreSQL don't use COLLATE to manage case sensitive /
insensitive comparision (I think it's the best and ANSI standard way ....) ?
2) Can I build a custom COLLATION (for example named: "NOCASE" ....) to
apply to my DB objects ? What's the way (... some example ? ) ???

Thanks.
Best Regards.
Nicola.


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Nicola Cisternino <ncister(at)tiscali(dot)it>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-29 15:08:21
Message-ID: CAHyXU0zrMgadjRV8=DG68LxH9oJbLh7Z2C9+5tKZ4LcAqTgyTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Aug 28, 2012 at 9:46 AM, Nicola Cisternino <ncister(at)tiscali(dot)it> wrote:
> Hi all,
> I'm valutating a complex porting of our application based on Sybase
> SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your
> opinion about searching/ordering funcionality.
> The problem is about string comparision.
> MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple
> definition of case sensitive/insensitive behavior using char, varchar and
> text field type.
> In PostgreSQL I've already tried to use "citext", lower() function (applied
> to indexes, too ...), ILIKE an so on ..... but nothing really work as I need
> (poor performances ...) !!

hm, poor performance? can you elaborate?

merlin


From: Nicola Cisternino <ncister(at)tiscali(dot)it>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-29 15:56:01
Message-ID: 503E3B91.2040107@tiscali.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Il 29/08/2012 17.08, Merlin Moncure ha scritto:
> On Tue, Aug 28, 2012 at 9:46 AM, Nicola Cisternino<ncister(at)tiscali(dot)it> wrote:
>> Hi all,
>> I'm valutating a complex porting of our application based on Sybase
>> SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your
>> opinion about searching/ordering funcionality.
>> The problem is about string comparision.
>> MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple
>> definition of case sensitive/insensitive behavior using char, varchar and
>> text field type.
>> In PostgreSQL I've already tried to use "citext", lower() function (applied
>> to indexes, too ...), ILIKE an so on ..... but nothing really work as I need
>> (poor performances ...) !!
> hm, poor performance? can you elaborate?
>
> merlin
The same query using " .... LIKE <value> ...." is completed in 15 ms
while using " .... ILIKE <value> ...." the execution time is 453 ms ....


From: Chris Angelico <rosuav(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-29 16:09:06
Message-ID: CAPTjJmo1f0oTip6mU-52DUSkf7B7wW-OMAn57XUDi3ggytJf7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Aug 30, 2012 at 1:56 AM, Nicola Cisternino <ncister(at)tiscali(dot)it> wrote:
> The same query using " .... LIKE <value> ...." is completed in 15 ms while
> using " .... ILIKE <value> ...." the execution time is 453 ms ....

Sounds to me like (pun not intended) there's an index that's being
used in one case and not in the other.

But taking this back a step: Do you really need case-insensitive
comparisons? They become pretty much impossible once you start looking
at internationalization. Sure, it's easy in ASCII. You just mask off
one bit and off you go. But truly case insensitive matching gets
really hairy. Can you redo things with case sensitive searches,
possibly with some forcing of case in simple situations? For instance,
you accept a name prefix from the user, look at it and find that it's
all ASCII; lower-case it, then upper-case the first letter, then add a
percent sign, and use a case-sensitive LIKE. That's going to produce
correct results in most cases, and is way faster than truly case
insensitive searching.

ChrisA


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Nicola Cisternino <ncister(at)tiscali(dot)it>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-29 16:45:15
Message-ID: CAHyXU0x=CFYjT16VLh57v-NUJZdA9HUzhSpJJm+ahMKfWuLOuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Aug 29, 2012 at 10:56 AM, Nicola Cisternino <ncister(at)tiscali(dot)it> wrote:
> Il 29/08/2012 17.08, Merlin Moncure ha scritto:
>
> On Tue, Aug 28, 2012 at 9:46 AM, Nicola Cisternino <ncister(at)tiscali(dot)it>
> wrote:
>
> Hi all,
> I'm valutating a complex porting of our application based on Sybase
> SqlAnywhere on PostgreSQL (I've love it ...) and I'd like to have your
> opinion about searching/ordering funcionality.
> The problem is about string comparision.
> MS Sql server, MySql, SqlAnywhere and other DB engine allow a simple
> definition of case sensitive/insensitive behavior using char, varchar and
> text field type.
> In PostgreSQL I've already tried to use "citext", lower() function (applied
> to indexes, too ...), ILIKE an so on ..... but nothing really work as I need
> (poor performances ...) !!
>
> hm, poor performance? can you elaborate?
>
> merlin
>
> The same query using " .... LIKE <value> ...." is completed in 15 ms while
> using " .... ILIKE <value> ...." the execution time is 453 ms ....

citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower()

create table foo(f text);
create index on foo(lower(f));
select * from f where lower(f) = 'abc%'

this will be index optimized and fast as long as you specified C
locale for your database.

merlin


From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Nicola Cisternino <ncister(at)tiscali(dot)it>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-30 02:01:48
Message-ID: 503EC98C.3090601@ringerc.id.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 08/28/2012 10:46 PM, Nicola Cisternino wrote:

> 1) Why PostgreSQL don't use COLLATE to manage case sensitive /
> insensitive comparision (I think it's the best and ANSI standard way ....) ?

Support for per-column collations in PG was only added relatively
recently - in 9.1, by the looks:

http://www.postgresql.org/docs/9.1/static/collation.html

Prior to that, there was no meaningful way to use case insensitive
collations, as these would affect the whole database, including system
tables, which could break all sorts of things in new and exciting ways.

With the advent of per-column and per-operation collation control,
case-insensitive collations become very appealing.

One of the challenges with adding case insensitive collations is that,
AFAIK, collations are implemented using the operating system charset and
locale support, which may not offer case insensitive collation directly.

Another challenge is the rather ... variable ... meaning of "case
insensitive". Results are likely to vary between host platforms and
versions.

Still, now that per-col / per-op collation is supported, it'd be nice to
have. I don't know if it's just a matter of needing someone with the
desire and time (or funding) and expertise to design and build it, or if
there'd be issues with getting it accepted.

--
Craig Ringer


From: Nicola Cisternino <ncister(at)tiscali(dot)it>
To: Chris Angelico <rosuav(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-30 08:36:40
Message-ID: 503F2618.6070003@tiscali.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Il 29/08/2012 18.09, Chris Angelico ha scritto:
> On Thu, Aug 30, 2012 at 1:56 AM, Nicola Cisternino<ncister(at)tiscali(dot)it> wrote:
>> The same query using " .... LIKE<value> ...." is completed in 15 ms while
>> using " .... ILIKE<value> ...." the execution time is 453 ms ....
> Sounds to me like (pun not intended) there's an index that's being
> used in one case and not in the other.
>
> But taking this back a step: Do you really need case-insensitive
> comparisons? They become pretty much impossible once you start looking
> at internationalization. Sure, it's easy in ASCII. You just mask off
> one bit and off you go. But truly case insensitive matching gets
> really hairy. Can you redo things with case sensitive searches,
> possibly with some forcing of case in simple situations? For instance,
> you accept a name prefix from the user, look at it and find that it's
> all ASCII; lower-case it, then upper-case the first letter, then add a
> percent sign, and use a case-sensitive LIKE. That's going to produce
> correct results in most cases, and is way faster than truly case
> insensitive searching.
>
> ChrisA
>
>
Yes I need case-insensitive comparision and the best way to (optionally)
obtain it is without alter all application queries that already works
with other DB engine (!!)
My simply questions are:
1) Can be a custom collation a solution for my needs ?
2) How can create a custom collation (... what steps ...) ?
Thanks.


From: Nicola Cisternino <ncister(at)tiscali(dot)it>
To: ringerc(at)ringerc(dot)id(dot)au
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-30 09:16:40
Message-ID: 503F2F78.7060805@tiscali.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Il 30/08/2012 4.01, Craig Ringer ha scritto:
> On 08/28/2012 10:46 PM, Nicola Cisternino wrote:
>
>> 1) Why PostgreSQL don't use COLLATE to manage case sensitive /
>> insensitive comparision (I think it's the best and ANSI standard way
>> ....) ?
>
> Support for per-column collations in PG was only added relatively
> recently - in 9.1, by the looks:
>
> http://www.postgresql.org/docs/9.1/static/collation.html
>
> Prior to that, there was no meaningful way to use case insensitive
> collations, as these would affect the whole database, including system
> tables, which could break all sorts of things in new and exciting ways.
>
> With the advent of per-column and per-operation collation control,
> case-insensitive collations become very appealing.
>
> One of the challenges with adding case insensitive collations is that,
> AFAIK, collations are implemented using the operating system charset
> and locale support, which may not offer case insensitive collation
> directly.
>
> Another challenge is the rather ... variable ... meaning of "case
> insensitive". Results are likely to vary between host platforms and
> versions.
>
> Still, now that per-col / per-op collation is supported, it'd be nice
> to have. I don't know if it's just a matter of needing someone with
> the desire and time (or funding) and expertise to design and build it,
> or if there'd be issues with getting it accepted.
>
> --
> Craig Ringer
>
Thus the problem is that " .... collations are implemented using the
operating system charset and locale support ... " while, other engines,
implements collations internally ..... is it right ?
Thanks.


From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Nicola Cisternino <ncister(at)tiscali(dot)it>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-30 10:45:25
Message-ID: 503F4445.4070809@ringerc.id.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 08/30/2012 05:16 PM, Nicola Cisternino wrote:

> Thus the problem is that " .... collations are implemented using the
> operating system charset and locale support ... " while, other engines,
> implements collations internally ..... is it right ?

That's my understanding, but I don't know which other database systems
you're talking about because you've never specifically named any.

It's entirely possible that some other DBMSs use the system locale and
collation support with internal downcasing, for example.

All I know, I've already said, but I'm not an expert on Pg's innards.

--
Craig Ringer


From: Nicola Cisternino <ncister(at)tiscali(dot)it>
To: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-30 10:54:42
Message-ID: 503F4672.9040405@tiscali.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Il 30/08/2012 12.45, Craig Ringer ha scritto:
> On 08/30/2012 05:16 PM, Nicola Cisternino wrote:
>
>> Thus the problem is that " .... collations are implemented using the
>> operating system charset and locale support ... " while, other engines,
>> implements collations internally ..... is it right ?
>
> That's my understanding, but I don't know which other database systems
> you're talking about because you've never specifically named any.
>
> It's entirely possible that some other DBMSs use the system locale and
> collation support with internal downcasing, for example.
>
> All I know, I've already said, but I'm not an expert on Pg's innards.
>
> --
> Craig Ringer
Tahnk you for replay.
At this point, the solution could be a new, custom, operating system
collation .... (something like: en_CI_US.UTF-8) ....


From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Nicola Cisternino <ncister(at)tiscali(dot)it>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-30 11:15:38
Message-ID: 503F4B5A.2000707@ringerc.id.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 08/30/2012 06:54 PM, Nicola Cisternino wrote:

> At this point, the solution could be a new, custom, operating system
> collation .... (something like: en_CI_US.UTF-8) ....

As far as I know - and as I said, I'm hardly an expert in Pg's guts -
there's no way to create a case insensitive collation as things stand.

Creating case insensitive collations in the C library is probably more
work than you could possibly expect, if it's even possible at all on
some platforms.

You'd have a better chance investigating whether it could be possible to
"wrap" an operating system collation with lower-casing. This is likely
to be something you will want to work with some experienced Pg
developers with, most likely on a funded work basis because you're
unlikely to find anyone who wants to implement case insensitive
collations for fun in their spare time.

If you're seriously interested in enhancing PostgreSQL to support case
insensitive collation, and you have the funds to sponsor the significant
amount of work required, you could ask on pgsql-hackers and see if
anyone's interested, or contact
http://www.postgresql.org/support/professional_support/ . This is not a
trivial job.

It will help if you are able to provide *specific* *test* *cases*
showing exactly how you think it should work, with samples showing how
it works on other specifically named products. No more hand-waving.

--
Craig Ringer


From: "F(dot) BROUARD / SQLpro" <sqlpro(at)club-internet(dot)fr>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-30 12:09:28
Message-ID: 503F57F8.4070206@club-internet.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Le 30/08/2012 12:45, Craig Ringer a écrit :

> That's my understanding, but I don't know which other database systems
> you're talking about because you've never specifically named any.
>
In his primary post he talk about SQL Server, Sybase and MySQL wich does
good jobs with collation....

Almost a majority of RDBMS have collation support wich is very important
for non english languages, that represents about 90% of the planet
languages !

This feature has always been a "black spot" in PG, and the most
important topic to discourage to use it in professionnal applications.

A +

--
Frédéric BROUARD - expert SGBDR et SQL - MVP SQL Server - 06 11 86 40 66
Le site sur le langage SQL et les SGBDR : http://sqlpro.developpez.com
Enseignant Arts & Métiers PACA, ISEN Toulon et CESI/EXIA Aix en Provence
Audit, conseil, expertise, formation, modélisation, tuning, optimisation
*********************** http://www.sqlspot.com *************************


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: "F(dot) BROUARD / SQLpro" <sqlpro(at)club-internet(dot)fr>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-30 14:34:47
Message-ID: CAOR=d=3gPV_6joao5eOuE8U+SW2iLD5bFVFf7d_1KCcrNVPG3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Aug 30, 2012 at 6:09 AM, F. BROUARD / SQLpro
<sqlpro(at)club-internet(dot)fr> wrote:
> Le 30/08/2012 12:45, Craig Ringer a écrit :
>
>
>> That's my understanding, but I don't know which other database systems
>> you're talking about because you've never specifically named any.
>>
> In his primary post he talk about SQL Server, Sybase and MySQL wich does
> good jobs with collation....
>
> Almost a majority of RDBMS have collation support wich is very important for
> non english languages, that represents about 90% of the planet languages !
>
> This feature has always been a "black spot" in PG, and the most important
> topic to discourage to use it in professionnal applications.

Citations please.

PostgreSQL has excellent collation support.
http://www.postgresql.org/docs/9.1/static/collation.html

Sybase performance on upper() case insensitive searchs:
http://stackoverflow.com/questions/81268/case-insensitive-search-on-sybase

MySQL case insensitive searchs rely on indexing upper or lower
functions just like PostgreSQL and use seq scans for collation induced
case insensitive searchs:
http://use-the-index-luke.com/sql/where-clause/functions/case-insensitive-search

Further PostgreSQL has the citext type:
http://www.postgresql.org/docs/9.1/static/citext.html
Which can be handy for case insensitive searches but can ONLY do case
insensitive stuff.


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: "F(dot) BROUARD / SQLpro" <sqlpro(at)club-internet(dot)fr>, pgsql-general(at)postgresql(dot)org
Subject: Re: String comparision in PostgreSQL
Date: 2012-08-30 15:09:57
Message-ID: CAHyXU0wstnLneRDs8EP_P=S96+uk=QSFQyikTv9Gv0eXNGC9sg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Aug 30, 2012 at 9:34 AM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> On Thu, Aug 30, 2012 at 6:09 AM, F. BROUARD / SQLpro
> <sqlpro(at)club-internet(dot)fr> wrote:
>> Le 30/08/2012 12:45, Craig Ringer a écrit :
>>
>>
>>> That's my understanding, but I don't know which other database systems
>>> you're talking about because you've never specifically named any.
>>>
>> In his primary post he talk about SQL Server, Sybase and MySQL wich does
>> good jobs with collation....
>>
>> Almost a majority of RDBMS have collation support wich is very important for
>> non english languages, that represents about 90% of the planet languages !
>>
>> This feature has always been a "black spot" in PG, and the most important
>> topic to discourage to use it in professionnal applications.
>
> Citations please.
>
> PostgreSQL has excellent collation support.
> http://www.postgresql.org/docs/9.1/static/collation.html
>
> Sybase performance on upper() case insensitive searchs:
> http://stackoverflow.com/questions/81268/case-insensitive-search-on-sybase
>
> MySQL case insensitive searchs rely on indexing upper or lower
> functions just like PostgreSQL and use seq scans for collation induced
> case insensitive searchs:
> http://use-the-index-luke.com/sql/where-clause/functions/case-insensitive-search
>
> Further PostgreSQL has the citext type:
> http://www.postgresql.org/docs/9.1/static/citext.html
> Which can be handy for case insensitive searches but can ONLY do case
> insensitive stuff.

Yeah. In particular, lower() approaches for case insensitive
searching have always worked and IMSNHO remain the best way.
Expression based searching and indexing is a real strong point for
postgres and is the 'go to' method for solving a broad array of
problems. The fairest complaint you can make is that historically
you've had to ditch performance to get good collation features -- and
this is mostly solved.

I guess the biggest problem that remains is the inability to use LIKE
searches for index through utf8 ordered indexes.

merlin