Re: Help with sql

Lists: pgsql-general
From: Perry Smith <pedzsan(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Help with sql
Date: 2012-07-06 21:34:57
Message-ID: 06772287-B2D0-4C9C-826C-7D49E1015978@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi Guys,

This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list, please let me know.

It is also, perhaps, a really silly question.

This query (without the 'explain' keyword) , when executed takes forever and a day:

> condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps );
> QUERY PLAN
> ----------------------------------------------------------------------------------------------
> Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4)
> Filter: (NOT (SubPlan 1))
> SubPlan 1
> -> Materialize (cost=0.00..2517.78 rows=95852 width=4)
> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
> (5 rows)

This query returns within a second:

> condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps );
> QUERY PLAN
> ----------------------------------------------------------------------------------------------
> Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4)
> Filter: (NOT (hashed SubPlan 1))
> SubPlan 1
> -> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4)
> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
> (5 rows)

The difference is the "distinct" keyword in the inner select.

What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using "sets" and an item (I thought) can not be in a set more than once.

Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operator is "in" or "not in" ?

And if I can throw in another question on top: is there a different method other than "not in" that would work better?

Thank you guys for the help and a really awesome database.
pedz
(this is *suppose* to be sent as plain text... I hope my mailer does what it is told)


From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Help with sql
Date: 2012-07-07 00:43:32
Message-ID: 4FF78634.9080800@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 07/06/2012 03:34 PM, Perry Smith wrote:
> Hi Guys,
>
> This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list, please let me know.
>
> It is also, perhaps, a really silly question.
>
> This query (without the 'explain' keyword) , when executed takes forever and a day:
>
>> condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps );
>> QUERY PLAN
>> ----------------------------------------------------------------------------------------------
>> Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4)
>> Filter: (NOT (SubPlan 1))
>> SubPlan 1
>> -> Materialize (cost=0.00..2517.78 rows=95852 width=4)
>> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
>> (5 rows)
>
> This query returns within a second:
>
>> condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps );
>> QUERY PLAN
>> ----------------------------------------------------------------------------------------------
>> Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4)
>> Filter: (NOT (hashed SubPlan 1))
>> SubPlan 1
>> -> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4)
>> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
>> (5 rows)
>
> The difference is the "distinct" keyword in the inner select.
>
> What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using "sets" and an item (I thought) can not be in a set more than once.
>
> Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operator is "in" or "not in" ?
>
> And if I can throw in another question on top: is there a different method other than "not in" that would work better?
>
> Thank you guys for the help and a really awesome database.
> pedz
> (this is *suppose* to be sent as plain text... I hope my mailer does what it is told)
>
>

Well they are distinct records, they just may have the same values. And
I'm not trying to be flippant. We don't see the structure of those
table: are all the id fields involved primary keys or with unique index
coverage? Does "not exists ( select fileset.id = fileset_id from
service_pack_fileset_map) change the behaviour?


From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Perry Smith <pedzsan(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Help with sql
Date: 2012-07-07 00:56:39
Message-ID: 4FF78947.5080902@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 07/06/2012 02:34 PM, Perry Smith wrote:
> Hi Guys,
>
> This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list, please let me know.
>
> It is also, perhaps, a really silly question.
>
> This query (without the 'explain' keyword) , when executed takes forever and a day:
>
>> condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps );
>> QUERY PLAN
>> ----------------------------------------------------------------------------------------------
>> Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4)
>> Filter: (NOT (SubPlan 1))
>> SubPlan 1
>> -> Materialize (cost=0.00..2517.78 rows=95852 width=4)
>> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
>> (5 rows)
> This query returns within a second:
>
>> condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps );
>> QUERY PLAN
>> ----------------------------------------------------------------------------------------------
>> Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4)
>> Filter: (NOT (hashed SubPlan 1))
>> SubPlan 1
>> -> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4)
>> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
>> (5 rows)
> The difference is the "distinct" keyword in the inner select.
>
> What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using "sets" and an item (I thought) can not be in a set more than once.
>
> Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operator is "in" or "not in" ?
>
> And if I can throw in another question on top: is there a different method other than "not in" that would work better?
Actually it is *very* PostgreSQL specific. In fact, it may even be
PostgreSQL *version* specific as you are delving into how the planner
decides how to handle a query.

It appears that the planner is assuming, based on collected stats and
available indexes, that there will be roughly 1/6 the records returned
by the "distinct" query and thus chose a different method to join the
records. One useful piece of information would be the indexes on the two
tables.

As to other methods, you can use:
... where not exists (select 1 from service_pack_fileset_maps where
fileset_id = filesets.id)...
(Note: as alluded to above, ...not in... works better in some releases
and ...not exists... better in others due to improvements over time.)

Still another method:
select id from filesets except select fileset_id from
service_pack_fileset_maps;

Cheers,
Steve


From: Perry Smith <pedzsan(at)gmail(dot)com>
To: "<pgsql-general(at)postgresql(dot)org>" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Help with sql
Date: 2012-07-07 13:48:06
Message-ID: 1E64A6FD-FF31-48DE-B1BA-0C50F7AB4521@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


On Jul 6, 2012, at 7:56 PM, Steve Crawford wrote:

> On 07/06/2012 02:34 PM, Perry Smith wrote:
>> Hi Guys,
>>
>> This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list, please let me know.
>>
>> It is also, perhaps, a really silly question.
>>
>> This query (without the 'explain' keyword) , when executed takes forever and a day:
>>
>>> condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps );
>>> QUERY PLAN
>>> ----------------------------------------------------------------------------------------------
>>> Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4)
>>> Filter: (NOT (SubPlan 1))
>>> SubPlan 1
>>> -> Materialize (cost=0.00..2517.78 rows=95852 width=4)
>>> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
>>> (5 rows)
>> This query returns within a second:
>>
>>> condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps );
>>> QUERY PLAN
>>> ----------------------------------------------------------------------------------------------
>>> Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4)
>>> Filter: (NOT (hashed SubPlan 1))
>>> SubPlan 1
>>> -> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4)
>>> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4)
>>> (5 rows)
>> The difference is the "distinct" keyword in the inner select.
>>
>> What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using "sets" and an item (I thought) can not be in a set more than once.
>>
>> Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operator is "in" or "not in" ?
>>
>> And if I can throw in another question on top: is there a different method other than "not in" that would work better?
> Actually it is *very* PostgreSQL specific. In fact, it may even be PostgreSQL *version* specific as you are delving into how the planner decides how to handle a query.
>
> It appears that the planner is assuming, based on collected stats and available indexes, that there will be roughly 1/6 the records returned by the "distinct" query and thus chose a different method to join the records. One useful piece of information would be the indexes on the two tables.
>
> As to other methods, you can use:
> ... where not exists (select 1 from service_pack_fileset_maps where fileset_id = filesets.id)...
> (Note: as alluded to above, ...not in... works better in some releases and ...not exists... better in others due to improvements over time.)
>
> Still another method:
> select id from filesets except select fileset_id from service_pack_fileset_maps;

Thanks guys.

Small side note: I thought I saw "set difference" in the documentation but I couldn't find it. It appears "EXCEPT" is set difference. Thank you for that tidbit.

The database is mostly static. I run through a very lengthy process to populate the database maybe once a month and then it is 99% read-only. By far, most of the accesses are via a view that I have that is rather long and ugly so I won't paste it in. I've tried to make this particular view as fast as possible so the indexes, etc are what I think will help that out.

The version is psql (PostgreSQL) 9.0.4

Each table has a key of "id" -- this database was created by / used by Ruby on Rails and that is how it likes to do things.

> condor_development=> \d service_pack_fileset_maps
> Table "public.service_pack_fileset_maps"
> Column | Type | Modifiers
> -----------------+-----------------------------+------------------------------------------------------------------------
> id | integer | not null default nextval('service_pack_fileset_maps_id_seq'::regclass)
> service_pack_id | integer | not null
> fileset_id | integer | not null
> created_at | timestamp without time zone |
> updated_at | timestamp without time zone |
> Indexes:
> "service_pack_fileset_maps_pkey" PRIMARY KEY, btree (id)
> "service_pack_fileset_maps_service_pack_id_key" UNIQUE, btree (service_pack_id, fileset_id)
> "index_service_pack_fileset_maps_on_fileset_id" btree (fileset_id)
> Foreign-key constraints:
> "service_pack_fileset_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
> "service_pack_fileset_maps_service_pack_id_fkey" FOREIGN KEY (service_pack_id) REFERENCES service_packs(id) ON DELETE CASCADE DEFERRABLE

> condor_development=> \d filesets
> Table "public.filesets"
> Column | Type | Modifiers
> ------------+-----------------------------+-------------------------------------------------------
> id | integer | not null default nextval('filesets_id_seq'::regclass)
> lpp_id | integer | not null
> vrmf | character varying(255) | not null
> created_at | timestamp without time zone |
> updated_at | timestamp without time zone |
> Indexes:
> "filesets_pkey" PRIMARY KEY, btree (id)
> "filesets_lpp_id_key" UNIQUE, btree (lpp_id, vrmf)
> Foreign-key constraints:
> "filesets_lpp_id_fkey" FOREIGN KEY (lpp_id) REFERENCES lpps(id) ON DELETE CASCADE DEFERRABLE
> Referenced by:
> TABLE "fileset_aix_file_maps" CONSTRAINT "fileset_aix_file_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
> TABLE "fileset_ptf_maps" CONSTRAINT "fileset_ptf_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
> TABLE "package_fileset_maps" CONSTRAINT "package_fileset_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
> TABLE "service_pack_fileset_maps" CONSTRAINT "service_pack_fileset_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE
> TABLE "upd_pc_views" CONSTRAINT "upd_pc_views_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE

Thank you again for your help,
pedz


From: Chris Angelico <rosuav(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Help with sql
Date: 2012-07-07 14:39:12
Message-ID: CAPTjJmr_iBNAXJeSw0KLJTJMye6uGPtBfjYt=e2Shqc43hFKug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Sat, Jul 7, 2012 at 11:48 PM, Perry Smith <pedzsan(at)gmail(dot)com> wrote:
> The database is mostly static. I run through a very lengthy process to populate the database maybe once a month and then it is 99% read-only.

Do you run an ANALYZE on the table after populating it? Postgres needs
up-to-date statistics for best results. I'd recommend doing an
explicit 'VACUUM ANALYZE' once your data's loaded, and then try your
queries after that - it might not do much, but it also might give a
massive improvement.

ChrisA