Re: Difference between array column type and separate table

Lists: pgsql-general
From: Mike Christensen <mike(at)kitchenpc(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Difference between array column type and separate table
Date: 2009-05-02 06:40:53
Message-ID: 7aa638e00905012340u62c9fd17t5b97e5ba808f3ad4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Let's say you have a table called Threads, and each thread can have zero or
more "tags" associated with it. A tag is just a byte which maps to some
enum somewhere.
There's two ways I can think of to do this. The first would be to have:

create table Threads (
Id uuid not null,
Posted timestamp not null,
Subject varchar(255) not null,
Replies int4 not null,
PosterId uuid not null,
primary key (Id)
);

and a separate table for each tag on a thread:

create table ThreadTags (
Id uuid not null,
Tag int2 not null,
ThreadId uuid not null,
primary key (Id)
)

I can query for all threads with a certain tag using a join:

select * from Threads T
inner join ThreadTags tag ON tag.ThreadId = T.Id AND tag.Tag = 5;

This should work fine. However, recently I was digging through Postgres
manuals and found that you can store arrays of stuff in a column. Using
this, I could completely get rid of ThreadTags and have a table like this:

create table Threads (
Id uuid not null,
Posted timestamp not null,
Subject varchar(255) not null,
Replies int4 not null,
PosterId uuid not null,
Tags int2[],
primary key (Id)
);

and then find threads using the ANY function:

select * from Threads where 5 = ANY (Tags);

To me this seems cleaner, but I'm wondering about performance. If I had
millions of threads, is a JOIN going to be faster? I guess what I'm asking
about is the underlying implementation of ANY. Is it doing
a sequential search? Can I index Tags and will ANY() then use that index?
Any other opinions on what option is better?

One thing about the array approach is I'm using NHibernate which doesn't
really seem to want to support Postgres arrays without a whole bunch of
custom driver code and IUserTypes and junk, so I'd like to make sure this
architecture is best before I commit to it. Thanks!!

Mike


From: Mike Christensen <mike(at)kitchenpc(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Difference between array column type and separate table
Date: 2009-05-02 07:33:59
Message-ID: 7aa638e00905020033y7847632fw8e439f417c482f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Let's say you have a table called Threads, and each thread can have zero or
more "tags" associated with it. A tag is just a byte which maps to some
enum somewhere.
There's two ways I can think of to do this. The first would be to have:

create table Threads (
Id uuid not null,
Posted timestamp not null,
Subject varchar(255) not null,
Replies int4 not null,
PosterId uuid not null,
primary key (Id)
);

and a separate table for each tag on a thread:

create table ThreadTags (
Id uuid not null,
Tag int2 not null,
ThreadId uuid not null,
primary key (Id)
)

I can query for all threads with a certain tag using a join:

select * from Threads T
inner join ThreadTags tag ON tag.ThreadId = T.Id AND tag.Tag = 5;

This should work fine. However, recently I was digging through Postgres
manuals and found that you can store arrays of stuff in a column. Using
this, I could completely get rid of ThreadTags and have a table like this:

create table Threads (
Id uuid not null,
Posted timestamp not null,
Subject varchar(255) not null,
Replies int4 not null,
PosterId uuid not null,
Tags int2[],
primary key (Id)
);

and then find threads using the ANY function:

select * from Threads where 5 = ANY (Tags);

To me this seems cleaner, but I'm wondering about performance. If I had
millions of threads, is a JOIN going to be faster? I guess what I'm asking
about is the underlying implementation of ANY. Is it doing
a sequential search? Can I index Tags and will ANY() then use that index?
Any other opinions on what option is better?

One thing about the array approach is I'm using NHibernate which doesn't
really seem to want to support Postgres arrays without a whole bunch of
custom driver code and IUserTypes and junk, so I'd like to make sure this
architecture is best before I commit to it. Thanks!!

Mike


From: Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
To: Mike Christensen <mike(at)kitchenpc(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Difference between array column type and separate table
Date: 2009-05-02 10:14:53
Message-ID: BBE2D362-CAAF-420B-9F8B-37C24A395303@solfertje.student.utwente.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On May 2, 2009, at 9:33 AM, Mike Christensen wrote:

> Using this, I could completely get rid of ThreadTags and have a
> table like this:
>
> create table Threads (
> Id uuid not null,
> Posted timestamp not null,
> Subject varchar(255) not null,
> Replies int4 not null,
> PosterId uuid not null,
> Tags int2[],
> primary key (Id)
> );
>
> and then find threads using the ANY function:
>
> select * from Threads where 5 = ANY (Tags);
>
> To me this seems cleaner, but I'm wondering about performance. If I
> had millions of threads, is a JOIN going to be faster? I guess what
> I'm asking about is the underlying implementation of ANY. Is it
> doing a sequential search? Can I index Tags and will ANY() then use
> that index? Any other opinions on what option is better?

If you modify the array the entire array needs to be rewritten. I
don't think you'd want that with millions of threads in it. I don't
think array values are indexable either. So while they're probably
faster to query for small amounts of threads, the join is likely
faster to query for large amounts (provided they're indexed properly,
of course).

If you want to be sure, play around with explain analyse with both
implementations.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,49fc1d20129743379199738!


From: Mike Christensen <mike(at)kitchenpc(dot)com>
To: Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Difference between array column type and separate table
Date: 2009-05-02 18:03:14
Message-ID: 7aa638e00905021103g5fae8f47v16bb25d4cdb3cda1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Yeah I talked with some other SQL guru friends of mine and they all agree
the separate table is the way to go for a number of reasons, so that's what
I'll stick with. It was just one of those things where you see a new
feature and try to find an excuse to try it out <g>
Thanks!
Mike

On Sat, May 2, 2009 at 3:14 AM, Alban Hertroys <
dalroi(at)solfertje(dot)student(dot)utwente(dot)nl> wrote:

> On May 2, 2009, at 9:33 AM, Mike Christensen wrote:
>
> Using this, I could completely get rid of ThreadTags and have a table like
>> this:
>>
>> create table Threads (
>> Id uuid not null,
>> Posted timestamp not null,
>> Subject varchar(255) not null,
>> Replies int4 not null,
>> PosterId uuid not null,
>> Tags int2[],
>> primary key (Id)
>> );
>>
>> and then find threads using the ANY function:
>>
>> select * from Threads where 5 = ANY (Tags);
>>
>> To me this seems cleaner, but I'm wondering about performance. If I had
>> millions of threads, is a JOIN going to be faster? I guess what I'm asking
>> about is the underlying implementation of ANY. Is it doing a sequential
>> search? Can I index Tags and will ANY() then use that index? Any other
>> opinions on what option is better?
>>
>
> If you modify the array the entire array needs to be rewritten. I don't
> think you'd want that with millions of threads in it. I don't think array
> values are indexable either. So while they're probably faster to query for
> small amounts of threads, the join is likely faster to query for large
> amounts (provided they're indexed properly, of course).
>
> If you want to be sure, play around with explain analyse with both
> implementations.
>
> Alban Hertroys
>
> --
> If you can't see the forest for the trees,
> cut the trees and you'll see there is no forest.
>
>
> !DSPAM:880,49fc1d1e129741592332518!
>
>
>


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
Cc: Mike Christensen <mike(at)kitchenpc(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Difference between array column type and separate table
Date: 2009-05-03 16:19:43
Message-ID: 49FDC41F.6030202@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Alban Hertroys wrote:
> On May 2, 2009, at 9:33 AM, Mike Christensen wrote:
>
>> ...
>> create table Threads ( ... Tags int2[], ...);
>>
>> To me this seems cleaner, but I'm wondering about performance. If I
>> had millions of threads, is a JOIN going to be faster? ...
>
> ...I don't think array values are indexable either. ...

Of course they are indexable if you construct the right index
http://www.postgresql.org/docs/current/static/intarray.html

For certain queries, the array will probably be *much* faster
than the extra tables. For example a query like this:

SELECT * FROM Threads WHERE Threads.tags @@ '1&(2|3)'::query_int;

that does a single indexscan will be far far faster using an
extra table, where the query would look something like:

select * from Threads T where
id in (select threadid from threadtags where id = 1)
and
id in (select threadid from threadtags where id = 2 or id = 3);

requiring 3 indexscans on a table with many more rows, and
some potentially rather expensive joins.