Q: Structured index - which one runs faster?

Lists: pgsql-adminpgsql-generalpgsql-performance
From: Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at>
To: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>, "pgsql-general" <pgsql-general(at)postgresql(dot)org>, "pgsql-performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Q: Structured index - which one runs faster?
Date: 2003-05-22 20:41:22
Message-ID: 5.1.1.6.2.20030522223909.05929600@mail.vogelsinger.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

Hi all,

sorry for reposting this to the lists, but I feel I posted this at the
wrong time of day, since now a lot more of you gurus are reading, and I
really need some knowledgeable input... thanks for consideration :)

I have a question concerning table/key layout.

I need to store an ID value that consists of three numerical elements:
- ident1 char(5)
- ident2 char(5)
- nodeid int4

I need an index on these columns. Insert, delete, and lookup operations
this in this need to be as fast as possible. Now I have two options:

(a) creating an index on all three columns, or
(b) create a single varchar column combining all three components into a
single string, like "ident1:ident2:nodeid" and indexing this column only.

There will be a couple of million rows in this table, the values in
question are not unique.

Which would be faster in your opinion? (a) or (b)?

Thanks for any insight,

--
>O Ernest E. Vogelsinger
(\) ICQ #13394035
^ http://www.vogelsinger.at/

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
message can get through to the mailing list cleanly

--
>O Ernest E. Vogelsinger
(\) ICQ #13394035
^ http://www.vogelsinger.at/


From: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
To: Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at>
Cc: pgsql-admin <pgsql-admin(at)postgresql(dot)org>, pgsql-general <pgsql-general(at)postgresql(dot)org>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [GENERAL] Q: Structured index - which one runs faster?
Date: 2003-05-22 22:23:44
Message-ID: Pine.LNX.4.33.0305221623150.25804-100000@css120.ihs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

On Thu, 22 May 2003, Ernest E Vogelsinger wrote:

> Hi all,
>
> sorry for reposting this to the lists, but I feel I posted this at the
> wrong time of day, since now a lot more of you gurus are reading, and I
> really need some knowledgeable input... thanks for consideration :)
>
>
> I have a question concerning table/key layout.
>
> I need to store an ID value that consists of three numerical elements:
> - ident1 char(5)
> - ident2 char(5)
> - nodeid int4
>
> I need an index on these columns. Insert, delete, and lookup operations
> this in this need to be as fast as possible. Now I have two options:
>
> (a) creating an index on all three columns, or
> (b) create a single varchar column combining all three components into a
> single string, like "ident1:ident2:nodeid" and indexing this column only.
>
> There will be a couple of million rows in this table, the values in
> question are not unique.
>
> Which would be faster in your opinion? (a) or (b)?

Generally speaking, b should be faster, but a should be more versatile.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at>
Cc: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>, "pgsql-general" <pgsql-general(at)postgresql(dot)org>, "pgsql-performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Q: Structured index - which one runs faster?
Date: 2003-05-22 22:53:00
Message-ID: 11951.1053643980@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at> writes:
> (a) creating an index on all three columns, or
> (b) create a single varchar column combining all three components into a
> single string, like "ident1:ident2:nodeid" and indexing this column only.

I can't imagine that (b) is a good idea ... it's dubious that you are
saving anything on the indexing, and you're sure adding a lot of space
to the table, not to mention maintenance effort, potential for bugs,
etc.

It might be worth creating the index so that the "least non-unique"
column is mentioned first, if there's a clear winner in those terms.
That would minimize the number of times that comparisons have to look at
the additional columns.

regards, tom lane


From: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
To: Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [ADMIN] Q: Structured index - which one runs faster?
Date: 2003-05-22 23:00:54
Message-ID: 20030522155056.R47562-100000@megazone23.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

On Thu, 22 May 2003, Ernest E Vogelsinger wrote:

[response only to -performance]

> sorry for reposting this to the lists, but I feel I posted this at the
> wrong time of day, since now a lot more of you gurus are reading, and I
> really need some knowledgeable input... thanks for consideration :)

It just takes time. :)

> I have a question concerning table/key layout.
>
> I need to store an ID value that consists of three numerical elements:
> - ident1 char(5)
> - ident2 char(5)
> - nodeid int4

This seems like a somewhat odd key layout, why char(5) for the first
two parts if they're numeric as well?

> I need an index on these columns. Insert, delete, and lookup operations
> this in this need to be as fast as possible. Now I have two options:
>
> (a) creating an index on all three columns, or
> (b) create a single varchar column combining all three components into a
> single string, like "ident1:ident2:nodeid" and indexing this column only.
>
> There will be a couple of million rows in this table, the values in
> question are not unique.
>
> Which would be faster in your opinion? (a) or (b)?

Generally, you're probably better off with an index on the three columns.
Otherwise either your clients need to composite the value for the varchar
column or the system does in triggers for insert/update.

Also, what kinds of lookups are you going to be doing? Only lookups based
on all three parts of the key or will you ever be searching based on parts
of the keys?


From: Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at>
To: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [ADMIN] Q: Structured index - which one runs faster?
Date: 2003-05-22 23:36:14
Message-ID: 5.1.1.6.2.20030523012954.0400a730@mail.vogelsinger.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

Thanks for replying :)

At 01:00 23.05.2003, Stephan Szabo said:
--------------------[snip]--------------------
>On Thu, 22 May 2003, Ernest E Vogelsinger wrote:
>
>> I need to store an ID value that consists of three numerical elements:
>> - ident1 char(5)
>> - ident2 char(5)
>> - nodeid int4
>
>This seems like a somewhat odd key layout, why char(5) for the first
>two parts if they're numeric as well?

It's not odd - ident1 and ident2 are in fact logical identifiers that _are_
character values, no numbers.

>Generally, you're probably better off with an index on the three columns.
>Otherwise either your clients need to composite the value for the varchar
>column or the system does in triggers for insert/update.

This table will be used by a PHP library accessing it - no direct client
intervention (except the developers and they should know what they're doing ;-)

>Also, what kinds of lookups are you going to be doing? Only lookups based
>on all three parts of the key or will you ever be searching based on parts
>of the keys?

Hmm. Yes, lookups on parts of the keys will be possible, but only from left
to right, ident1 having the highest precedence, followed by ident2 and
finally by nodeid.

These columns will never be modified once inserted. The only operations
these columns will be affected are insert and delete, and lookup of course.
I'm not so concerned with delete since this will not happen too often, but
inserts will, and lookups of course permanently, and both operations must
be as fast as possible, even with gazillions of rows...

--
>O Ernest E. Vogelsinger
(\) ICQ #13394035
^ http://www.vogelsinger.at/


From: Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Q: Structured index - which one runs faster?
Date: 2003-05-22 23:43:06
Message-ID: 5.1.1.6.2.20030523013910.03095bb8@mail.vogelsinger.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

At 00:53 23.05.2003, Tom Lane said:
--------------------[snip]--------------------
>Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at> writes:
>> (a) creating an index on all three columns, or
>> (b) create a single varchar column combining all three components into a
>> single string, like "ident1:ident2:nodeid" and indexing this column only.
>
>I can't imagine that (b) is a good idea ... it's dubious that you are
>saving anything on the indexing, and you're sure adding a lot of space
>to the table, not to mention maintenance effort, potential for bugs,
>etc.
>
>It might be worth creating the index so that the "least non-unique"
>column is mentioned first, if there's a clear winner in those terms.
>That would minimize the number of times that comparisons have to look at
>the additional columns.
--------------------[snip]--------------------

Thanks for replying :)

Do you know if there's a general performance difference between numeric
(int4) and character (fixed-size char[5]) columns? The ident1 and ident2
columns are planned to be char[5], only the third column (with least
precedence) will be numeric.

The application is still in the design phase, so I still could fiddle
around that and make that char[5] numeric with an additional mapping
(@runtime, not in the DB) if this will increase performance.

Thanks,

--
>O Ernest E. Vogelsinger
(\) ICQ #13394035
^ http://www.vogelsinger.at/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at>
Cc: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Q: Structured index - which one runs faster?
Date: 2003-05-23 00:00:14
Message-ID: 12259.1053648014@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at> writes:
> Do you know if there's a general performance difference between numeric
> (int4) and character (fixed-size char[5]) columns? The ident1 and ident2
> columns are planned to be char[5], only the third column (with least
> precedence) will be numeric.

int4 is certainly faster to compare than char(n), but I wouldn't contort
your database design on that basis... if the idents aren't naturally
integers, don't force them to be.

regards, tom lane


From: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
To: Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [ADMIN] Q: Structured index - which one runs faster?
Date: 2003-05-23 06:42:36
Message-ID: 20030522233646.Y58014-100000@megazone23.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance


On Fri, 23 May 2003, Ernest E Vogelsinger wrote:

> Thanks for replying :)
>
> At 01:00 23.05.2003, Stephan Szabo said:
> --------------------[snip]--------------------
> >On Thu, 22 May 2003, Ernest E Vogelsinger wrote:
> >
> >> I need to store an ID value that consists of three numerical elements:
> >> - ident1 char(5)
> >> - ident2 char(5)
> >> - nodeid int4
> >
> >This seems like a somewhat odd key layout, why char(5) for the first
> >two parts if they're numeric as well?
>
> It's not odd - ident1 and ident2 are in fact logical identifiers that _are_
> character values, no numbers.

The reason I mentioned it is that the original said, "three numerical
elements" ;)

> >Also, what kinds of lookups are you going to be doing? Only lookups based
> >on all three parts of the key or will you ever be searching based on parts
> >of the keys?
>
> Hmm. Yes, lookups on parts of the keys will be possible, but only from left
> to right, ident1 having the highest precedence, followed by ident2 and
> finally by nodeid.

The multi-column index helps for those as well, as long as you put the
columns in the precedence order. If they're ordered ident1,ident2,nodeid
then it'll potentially use it for searches on ident1 or ident1 and ident2
if it thinks that the condition is selective enough.


From: Vivek Khera <khera(at)kcilink(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: [ADMIN] Q: Structured index - which one runs faster?
Date: 2003-05-23 15:09:00
Message-ID: x7ptm94s4z.fsf@yertle.int.kciLink.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

A related question:

Are any of these indexes redundant:

CREATE UNIQUE INDEX user_list_id_email ON user_list (owner_id,user_email);
CREATE INDEX user_list_owner_id ON user_list (owner_id);
CREATE INDEX user_list_oid_created ON user_list (owner_id,user_created);

In particular, is user_list_owner_id redundant to
user_list_oid_created? Will the latter be used for queries such as

SELECT user_fname from user_list where owner_id=34

If so, I can drop the owner_id index. the _id columns are integers,
created is a datetime, and email is a string. owner_id is also a
foreign key into the owners table (via REFERENCES), if that matters.

I'd try it out by dropping the index, but reindexing it takes a *LONG*
time which I cannot afford to be unavailable.

Thanks.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D. Khera Communications, Inc.
Internet: khera(at)kciLink(dot)com Rockville, MD +1-240-453-8497
AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/


From: Reece Hart <rkh(at)gene(dot)COM>
To: Ernest E Vogelsinger <ernest(at)vogelsinger(dot)at>
Cc: pgsql-admin <pgsql-admin(at)postgresql(dot)org>, pgsql-general <pgsql-general(at)postgresql(dot)org>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [GENERAL] Q: Structured index - which one runs faster?
Date: 2003-05-23 16:46:25
Message-ID: 1053708385.29339.26.camel@tallac
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

Ernest-

> (a) creating an index on all three columns, or
> (b) create a single varchar column combining all three components into a
> single string, like "ident1:ident2:nodeid" and indexing this column only.
>
> There will be a couple of million rows in this table, the values in
> question are not unique.

I'd go with (a). (b) is not very flexible (e.g., lookup by ident2
only), and any speed advantage will require knowing in advance the
optimal key order (i1:i2:n v. n:i2:i1 v. ...). I'd expect it would be
comparable to a multi-column index for speed.

(a) can really be implemented in 3 ways:
(a1) an index of all 3 columns
(a2) an index on /each/ of 3 columns
(a3) a multi-column index AND separate indices on the others.
e.g., index (i1,i2,n), and index (i2) and index (n)

The choice of which is fastest depends a lot on the distribution of keys
in each column and whether you need to do lookups on only one or two
columns. Again, once you choose (b), you're kinda stuck with treating
the compound key as a single entity (without incurring a big performance
hit); (a) will allow you to experiment with optimal indexing without
affecting code.

Since it sounds like you've already got the data loaded, I (probably
others) would be interested in any timing runs you do.

-Reece

--
Reece Hart, Ph.D. rkh(at)gene(dot)com, http://www.gene.com/
Genentech, Inc. 650/225-6133 (voice), -5389 (fax)
Bioinformatics and Protein Engineering
1 DNA Way, MS-93 http://www.in-machina.com/~reece/
South San Francisco, CA 94080-4990 reece(at)in-machina(dot)com, GPG: 0x25EC91A0


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Vivek Khera <khera(at)kcilink(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [ADMIN] Q: Structured index - which one runs faster?
Date: 2003-05-23 16:50:20
Message-ID: 20030523165020.GA11094@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

On Fri, May 23, 2003 at 11:09:00 -0400,
Vivek Khera <khera(at)kcilink(dot)com> wrote:
> A related question:
>
> Are any of these indexes redundant:
>
> CREATE UNIQUE INDEX user_list_id_email ON user_list (owner_id,user_email);
> CREATE INDEX user_list_owner_id ON user_list (owner_id);
> CREATE INDEX user_list_oid_created ON user_list (owner_id,user_created);
>
> In particular, is user_list_owner_id redundant to
> user_list_oid_created? Will the latter be used for queries such as

Yes. Any prefix of a multicolumn index can be used for queries. They
(prefixes) won't be usable by foreign key references because even if the
index as a whole is unique, the prefixes won't necessarily be.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Vivek Khera <khera(at)kcilink(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [ADMIN] Q: Structured index - which one runs faster?
Date: 2003-05-23 17:38:37
Message-ID: 17188.1053711517@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

Vivek Khera <khera(at)kcilink(dot)com> writes:
> Are any of these indexes redundant:

> CREATE UNIQUE INDEX user_list_id_email ON user_list (owner_id,user_email);
> CREATE INDEX user_list_owner_id ON user_list (owner_id);
> CREATE INDEX user_list_oid_created ON user_list (owner_id,user_created);

> In particular, is user_list_owner_id redundant to
> user_list_oid_created?

Any of the three indexes can be used for a search on owner_id alone, so
yeah, user_list_owner_id is redundant. It would be marginally faster to
use user_list_owner_id for such a search, just because it's physically
smaller than the other two indexes, but against that you have to balance
the extra update cost of maintaining the additional index.

Also, I can imagine scenarios where even a pure SELECT query load could
find the extra index to be a net loss: if you have a mix of queries that
use two or all three indexes, and the indexes don't fit in kernel disk
cache but just one or two would, then you'll lose on extra I/O as the
indexes compete for cache space. Not sure how likely that scenario is,
but it's something to think about.

regards, tom lane


From: Vivek Khera <khera(at)kcilink(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: [ADMIN] Q: Structured index - which one runs faster?
Date: 2003-05-23 18:04:28
Message-ID: 16078.25260.512996.337914@yertle.int.kciLink.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

>>>>> "TL" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> In particular, is user_list_owner_id redundant to
>> user_list_oid_created?

TL> Any of the three indexes can be used for a search on owner_id alone, so
TL> yeah, user_list_owner_id is redundant. It would be marginally faster to
TL> use user_list_owner_id for such a search, just because it's physically
TL> smaller than the other two indexes, but against that you have to balance
TL> the extra update cost of maintaining the additional index.

This is great info. That extra index is gonna be nuked in about 37.23
seconds... It takes up a lot of space and is wasting time with
updates and inserts, which happen a *lot* on that table (nearly 10
million rows).

Thanks!


From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Vivek Khera <khera(at)kcilink(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [ADMIN] Q: Structured index - which one runs faster?
Date: 2003-05-23 18:30:03
Message-ID: i0pscvo0kfalgh7bplilq8gk0rg5q8gh34@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general pgsql-performance

On 23 May 2003 11:09:00 -0400, Vivek Khera <khera(at)kcilink(dot)com> wrote:
> CREATE UNIQUE INDEX user_list_id_email ON user_list (owner_id,user_email);
> CREATE INDEX user_list_owner_id ON user_list (owner_id);
> CREATE INDEX user_list_oid_created ON user_list (owner_id,user_created);
>
>In particular, is user_list_owner_id redundant to
>user_list_oid_created?

In theory yes, but in practice it depends ...

> Will the latter be used for queries such as
>
> SELECT user_fname from user_list where owner_id=34

All other things being equal, the planner tends to estimate higher
costs for the multi column index. This has to do with its attempt to
adjust correlation for the additional index columns. So unless the
physical order of tuples is totally unrelated to owner_id, I'd expect
it to choose the single column index.

>If so, I can drop the owner_id index.

If the planner estimates the cost for an user_list_id_email or
user_list_oid_created index scan lower than for a seq scan, you will
notice no difference.

But under unfortunate circumstances it might choose a seq scan ...

Servus
Manfred