Re: Putting an INDEX on a boolean field?

Lists: pgsql-sql
From: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
To: pgsql-sql(at)postgresql(dot)org
Subject: Putting an INDEX on a boolean field?
Date: 2005-06-17 04:14:45
Message-ID: 20050617011336.Q90456@ganymede.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql


Does that make sense? Would it ever get used? I can't see it, but
figured I'd ask ...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ: 7615664


From: David Dick <david_dick(at)iprimus(dot)com(dot)au>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Putting an INDEX on a boolean field?
Date: 2005-06-17 04:35:45
Message-ID: 42B25321.30507@iprimus.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

how about an very large table with a "processed" type flag?
uru
-Dave

Marc G. Fournier wrote:
>
> Does that make sense? Would it ever get used? I can't see it, but
> figured I'd ask ...
>
> ----
> Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Putting an INDEX on a boolean field?
Date: 2005-06-17 04:41:49
Message-ID: 15887.1118983309@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

"Marc G. Fournier" <scrappy(at)postgresql(dot)org> writes:
> Does that make sense? Would it ever get used?

It could get used if one of the two values is far less frequent than the
other. Personally I'd think about a partial index instead ...

regards, tom lane


From: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Putting an INDEX on a boolean field?
Date: 2005-06-17 05:04:12
Message-ID: 20050617020039.P90456@ganymede.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Fri, 17 Jun 2005, Tom Lane wrote:

> "Marc G. Fournier" <scrappy(at)postgresql(dot)org> writes:
>> Does that make sense? Would it ever get used?
>
> It could get used if one of the two values is far less frequent than the
> other. Personally I'd think about a partial index instead ...

Hrmmmm, hadn't thought of that ... wouldn't you have to build two indexes
(one for true, one for false) for this to be completely effective? unless
you know all your queries are going to search for one, but not the other?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ: 7615664


From: Jaime Casanova <systemguards(at)gmail(dot)com>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-sql(at)postgresql(dot)org
Subject: Re: Putting an INDEX on a boolean field?
Date: 2005-06-17 05:42:22
Message-ID: c2d9e70e0506162242557a7e65@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On 6/17/05, Marc G. Fournier <scrappy(at)postgresql(dot)org> wrote:
> On Fri, 17 Jun 2005, Tom Lane wrote:
>
> > "Marc G. Fournier" <scrappy(at)postgresql(dot)org> writes:
> >> Does that make sense? Would it ever get used?
> >
> > It could get used if one of the two values is far less frequent than the
> > other. Personally I'd think about a partial index instead ...
>
> Hrmmmm, hadn't thought of that ... wouldn't you have to build two indexes
> (one for true, one for false) for this to be completely effective? unless
> you know all your queries are going to search for one, but not the other?
>

I guess it will be effective only if you know wich value will be less
frequent... on the other value a sequential scan will be a win, isn't
it?

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)


From: "Erik Aronesty" <erik(at)memebot(dot)com>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Putting an INDEX on a boolean field?
Date: 2005-06-17 10:40:34
Message-ID: 000a01c57328$fe00f990$1400a8c0@yellow
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

In my database, the "sites" table is large, and the "usersites" table
has only a few sites per userid - so it should be looked in first. I'm
surprised that I had to juggle my query around (below), rather than
trusting the optimizer to figure this out for me.

Should I start looking to figure out why the optimizer didn't figure out
that it should be doing this sort of thing? Or should I just keep
juggling with subselects, since it's not that hard to do.

This query results in a sequential scan:

select sites.*, blocks from sites left join quota on
sites.host_u = quota.host and quota.date = ('2005-06-16 23:55:05-04')
where sites.id in (select siteid from usersites where userid = 1);

versus this one which does not:

explain analyze select sites.*, blocks from (select * from sites
where id in (select siteid from usersites where userid = 1)) as sites
left join quota on sites.host_u = quota.host and quota.date =
('2005-06-16 23:55:05-04');

The tables have been vacuumed/analyzed, etc.

---slow---

explain analyze select sites.*, blocks from sites left join quota on
sites.host_u = quota.host and quota.date = ('2005-06-16 23:55:05-04')
where sites.id in (select siteid from usersites where userid = 1);

---

Hash IN Join (cost=3183.30..3295.50 rows=13 width=158) (actual
time=4865.895..4942.097 rows=10 loops=1)
Hash Cond: ("outer".id = "inner".siteid)
-> Merge Left Join (cost=3173.52..3263.12 rows=4493 width=158)
(actual time=4856.212..4939.329 rows=4443 loops=1)
Merge Cond: ("outer"."?column24?" = "inner".host)
-> Sort (cost=2786.62..2797.72 rows=4443 width=154) (actual
time=4811.499..4816.164 rows=4443 loops=1)
Sort Key: (sites.host_u)::text
-> Seq Scan on sites (cost=0.00..2517.43 rows=4443
width=154) (actual time=11.611..4598.849 rows=4443 loops=1)
-> Sort (cost=386.91..398.13 rows=4489 width=26) (actual
time=44.638..46.307 rows=4454 loops=1)
Sort Key: quota.host
-> Index Scan using quota_date_idx on quota
(cost=0.00..114.60 rows=4489 width=26) (actual time=0.069..10.780
rows=4453 loops=1)
Index Cond: (date = '2005-06-16
23:55:05-04'::timestamp with time zone)
-> Hash (cost=9.75..9.75 rows=12 width=4) (actual time=0.086..0.086
rows=0 loops=1)
-> Index Scan using usersites_userid_idx on usersites
(cost=0.00..9.75 rows=12 width=4) (actual time=0.047..0.076 rows=10
loops=1)
Index Cond: (userid = 1)
Total runtime: 4944.575 ms
(15 rows)

---fast---

explain analyze select sites.*, blocks from (select * from sites where
id in (select siteid from usersites where userid = 1)) as sites left
join quota on sites.host_u = quota.host and quota.date = ('2005-06-16
23:55:05-04');

---

Merge Left Join (cost=468.77..491.41 rows=13 width=158) (actual
time=46.547..53.669 rows=10 loops=1)
Merge Cond: ("outer"."?column24?" = "inner".host)
-> Sort (cost=81.86..81.89 rows=12 width=154) (actual
time=0.450..0.454 rows=10 loops=1)
Sort Key: (public.sites.host_u)::text
-> Nested Loop (cost=9.78..81.65 rows=12 width=154) (actual
time=0.129..0.392 rows=10 loops=1)
-> HashAggregate (cost=9.78..9.78 rows=12 width=4)
(actual time=0.084..0.095 rows=10 loops=1)
-> Index Scan using usersites_userid_idx on
usersites (cost=0.00..9.75 rows=12 width=4) (actual time=0.040..0.067
rows=10 loops=1)
Index Cond: (userid = 1)
-> Index Scan using sites_pkey on sites
(cost=0.00..5.98 rows=1 width=154) (actual time=0.017..0.018 rows=1
loops=10)
Index Cond: (sites.id = "outer".siteid)
-> Sort (cost=386.91..398.13 rows=4489 width=26) (actual
time=44.971..46.042 rows=3741 loops=1)
Sort Key: quota.host
-> Index Scan using quota_date_idx on quota
(cost=0.00..114.60 rows=4489 width=26) (actual time=0.025..10.643
rows=4453 loops=1)
Index Cond: (date = '2005-06-16 23:55:05-04'::timestamp
with time zone)
Total runtime: 54.988 ms
(15 rows)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Erik Aronesty" <erik(at)memebot(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Putting an INDEX on a boolean field?
Date: 2005-06-19 18:46:24
Message-ID: 18815.1119206784@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

"Erik Aronesty" <erik(at)memebot(dot)com> writes:
> Should I start looking to figure out why the optimizer didn't figure out
> that it should be doing this sort of thing?

It looks to me that the problem is that convert_IN_to_join() is not
being smart about where to attach the IN's subselect to the join tree.
It's just adding it to the top FROM-expression, so that the join tree
is effectively
((sites left join quota) IN-join usersites)
and since we don't currently allow any rearrangement of outer joins,
this cannot be rearranged into
((sites IN-join usersites) left join quota)
as you'd like.

The really clean solution to this would be to implement logic about when
it is safe to rearrange the join order of outer joins. But I think
that's a fairly hard problem in general. A shorter-term solution might
be possible by teaching convert_IN_to_join() to attach the IN subselect
further down in the join tree, using logic similar to what we use to
decide where ordinary WHERE quals can bubble down to.

regards, tom lane