Optimization rules for semi and anti joins

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Optimization rules for semi and anti joins
Date: 2009-02-10 20:10:50
Message-ID: 23610.1234296650@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote (in response to Kevin Grittner's recent issues):
> Reflecting on this further, I suspect there are also some bugs in the
> planner's rules about when semi/antijoins can commute with other joins;

After doing some math I've concluded this is in fact the case. Anyone
want to check my work?

regards, tom lane

-----------------------------------

The basic outer-join identities used up through 8.3 are (as quoted from
optimizer/README):

1. (A leftjoin B on (Pab)) innerjoin C on (Pac)
= (A innerjoin C on (Pac)) leftjoin B on (Pab)

where Pac is a predicate referencing A and C, etc (in this case, clearly
Pac cannot reference B, or the transformation is nonsensical).

2. (A leftjoin B on (Pab)) leftjoin C on (Pac)
= (A leftjoin C on (Pac)) leftjoin B on (Pab)

3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)
= A leftjoin (B leftjoin C on (Pbc)) on (Pab)

Identity 3 only holds if predicate Pbc must fail for all-null B rows
(that is, Pbc is strict for at least one column of B).

How do these relate to semi/antijoins?

Semijoin can be rearranged as freely as inner join
--------------------------------------------------

We have these properties:

S1. (A semijoin B on (Pab)) semijoin C on (Pac)
= (A semijoin C on (Pac)) semijoin B on (Pab)

The two semijoins amount to independent filter conditions for each A row,
so we can test them in either order.

S2. (A semijoin B on (Pab)) innerjoin C on (Pac)
= (A innerjoin C on (Pac)) semijoin B on (Pab)

This is only really safe if Pab and Pac are nonvolatile, since we might
evaluate them different numbers of times in the second form, but that is
true for innerjoin associativity as well. Currently, since we only create
semijoins from EXISTS subqueries with nonvolatile WHERE clauses, we know
a priori that Pab is nonvolatile; and we have never bothered to worry
about the volatility of innerjoin clauses, so I don't see that there's
a reason to be extra careful with Pac.

We also have to consider whether semijoin re-associates with outer joins
in the same ways as an inner join does.

S3. (A leftjoin B on (Pab)) semijoin C on (Pac)
= (A semijoin C on (Pac)) leftjoin B on (Pab)

This also works as long as the quals are nonvolatile.

S4. (A antijoin B on (Pab)) semijoin C on (Pac)
= (A semijoin C on (Pac)) antijoin B on (Pab)

Again, these are independent conditions on each A row.

Hence semijoins can be rearranged just as freely as inner joins.

Antijoin is a tad more strict than left join
--------------------------------------------

We have these properties:

A1. (A antijoin B on (Pab)) innerjoin C on (Pac)
= (A innerjoin C on (Pac)) antijoin B on (Pab)

True given nonvolatile quals.

A2. (A antijoin B on (Pab)) antijoin C on (Pac)
= (A antijoin C on (Pac)) antijoin B on (Pab)

Again, these are independent conditions on each A row.

A3? (A antijoin B on (Pab)) antijoin C on (Pbc)
= A antijoin (B antijoin C on (Pbc)) on (Pab)

This one unfortunately fails for antijoins. For example assume that
all three are one-column relations with equality join conditions, and
A = (1), (2)
B = (1)
C = (1)
The antijoin of A and B is (2,NULL), and antijoining that to C
gives (2,NULL,NULL). But the antijoin of B and C is empty, so
the second form produces output (1,NULL,NULL), (2,NULL,NULL).

We also have to consider mixed antijoin/leftjoin cases in identities
2 and 3.

A4. (A antijoin B on (Pab)) leftjoin C on (Pac)
= (A leftjoin C on (Pac)) antijoin B on (Pab)

True given nonvolatile quals.

A5. (A leftjoin B on (Pab)) antijoin C on (Pac)
= (A antijoin C on (Pac)) leftjoin B on (Pab)

This is just a restatement of A4.

A6. (A antijoin B on (Pab)) leftjoin C on (Pbc)
= A antijoin (B leftjoin C on (Pbc)) on (Pab)

The second form is in fact equivalent to null-extending the A/B antijoin
--- the actual contents of C cannot affect the result. So we could just
drop C altogether. (I'm not going to do anything about that now, but
it's something to consider for the planned join-elimination optimization.)
In the first form, if Pbc is strict on B then it must fail for all rows of
the antijoin result so we get the null-extended A/B result. If Pbc is not
strict then the first form might duplicate some rows in the antijoin
result, or produce non-null-extended rows. So in this case the identity
holds only if Pbc is strict, which is the same as for left joins.

A7? (A leftjoin B on (Pab)) antijoin C on (Pbc)
= A leftjoin (B antijoin C on (Pbc)) on (Pab)

This identity fails even if both qual clauses are strict (try it in
the same example as above).

So: identity 3 fails if C is an antijoin, but otherwise antijoins
associate like leftjoins.

The bottom line is that the current code isn't strict enough for antijoins
but is too strict for semijoins. Also, if we fix the second part of that,
we can still allow an EXISTS in the ON condition of an outer join to be
flattened --- it's just the NOT EXISTS case that poses a risk of
getting a worse plan from flattening.


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-10 20:41:46
Message-ID: FAEBF2D0-40DA-434D-92D2-3DDB33F8B357@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Le 10 févr. 09 à 21:10, Tom Lane a écrit :

> I wrote (in response to Kevin Grittner's recent issues):
>> Reflecting on this further, I suspect there are also some bugs in the
>> planner's rules about when semi/antijoins can commute with other
>> joins;
>
> After doing some math I've concluded this is in fact the case. Anyone
> want to check my work?

I don't know how easy it would be to do, but maybe the Coq formal
proof management system could help us here:
http://coq.inria.fr/

The harder part in using coq might well be to specify the problem the
way you just did, so...

HTH,
--
dim


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-10 21:51:01
Message-ID: 603c8f070902101351i41077552j780054bf36c50654@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> A6. (A antijoin B on (Pab)) leftjoin C on (Pbc)
> = A antijoin (B leftjoin C on (Pbc)) on (Pab)
>
> The second form is in fact equivalent to null-extending the A/B antijoin
> --- the actual contents of C cannot affect the result. So we could just

I don't understand why antijoins need to null-extend the tuple at all.
It seems that it would be cheaper and all-around simpler to just pass
through the left-hand tuple unchanged.

In the case of a semijoin, it's theoretically possible that there
could be syntax which allows access to the attributes of the outer
side of the relation, though IN and EXISTS do not. But with an
antijoin that's just nonsense, so I don't quite understand why we're
handling it as we are.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-10 22:03:32
Message-ID: 25300.1234303412@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I don't understand why antijoins need to null-extend the tuple at all.

Well, we are talking theoretical definition here, not implementation.
But if you need an example where the column values can be referenced:

select * from a left join b on (a.id = b.id)
where b.id is null

8.4 does recognize this as an antijoin, if the join operator is strict.

> In the case of a semijoin, it's theoretically possible that there
> could be syntax which allows access to the attributes of the outer
> side of the relation, though IN and EXISTS do not.

Actually, that makes less sense than the antijoin case. For antijoin
there is a well-defined value for the extended columns, ie null. For
a semijoin the RHS values might come from any of the rows that happen
to join to the current LHS row, so I'm just as happy that it's
syntactically impossible to reference them.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-10 22:23:04
Message-ID: 603c8f070902101423r38053e15i28690c5fa5eaefe4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 10, 2009 at 5:03 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I don't understand why antijoins need to null-extend the tuple at all.
>
> Well, we are talking theoretical definition here, not implementation.
> But if you need an example where the column values can be referenced:
>
> select * from a left join b on (a.id = b.id)
> where b.id is null
>
> 8.4 does recognize this as an antijoin, if the join operator is strict.

Oh, I see. Hmm.

>> In the case of a semijoin, it's theoretically possible that there
>> could be syntax which allows access to the attributes of the outer
>> side of the relation, though IN and EXISTS do not.
>
> Actually, that makes less sense than the antijoin case. For antijoin
> there is a well-defined value for the extended columns, ie null. For
> a semijoin the RHS values might come from any of the rows that happen
> to join to the current LHS row, so I'm just as happy that it's
> syntactically impossible to reference them.

You might some day want to optimize this case as a semijoin, or
something similar to a semijoin:

SELECT foo.a, (SELECT bar.b FROM bar WHERE bar.a = foo.a) FROM foo;

...Robert


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 01:09:03
Message-ID: 36e682920902101709n392e081fw57662a084ad87b0f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 10, 2009 at 3:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I wrote (in response to Kevin Grittner's recent issues):
> > Reflecting on this further, I suspect there are also some bugs in the
> > planner's rules about when semi/antijoins can commute with other joins;
>
> After doing some math I've concluded this is in fact the case. Anyone
> want to check my work?

FWIW, the logic looks correct to me.

--
Jonah H. Harris, Senior DBA
myYearbook.com


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 01:12:56
Message-ID: 36e682920902101712w3bc22c18t703f5b2d84e21f94@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 10, 2009 at 8:09 PM, Jonah H. Harris <jonah(dot)harris(at)gmail(dot)com>wrote:

> On Tue, Feb 10, 2009 at 3:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> I wrote (in response to Kevin Grittner's recent issues):
>> > Reflecting on this further, I suspect there are also some bugs in the
>> > planner's rules about when semi/antijoins can commute with other joins;
>>
>> After doing some math I've concluded this is in fact the case. Anyone
>> want to check my work?
>
>
> FWIW, the logic looks correct to me.

Cripes! I just had an idea and it looks like the buggers beat me to it :(

http://www.google.com/patents?id=4bqBAAAAEBAJ&dq=null+aware+anti-join

--
Jonah H. Harris, Senior DBA
myYearbook.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 01:41:45
Message-ID: 28499.1234316505@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:
> Cripes! I just had an idea and it looks like the buggers beat me to it :(
> http://www.google.com/patents?id=4bqBAAAAEBAJ&dq=null+aware+anti-join

I wonder if the USPTO is really clueless enough to accept this?
Claim 1 would give Oracle ownership of the definition of NOT IN,
and few of the other claims seem exactly non-obvious either.

regards, tom lane


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 02:06:28
Message-ID: 36e682920902101806m14eaa27cr9bd721716194b84c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 10, 2009 at 8:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:
> > Cripes! I just had an idea and it looks like the buggers beat me to it
> :(
> > http://www.google.com/patents?id=4bqBAAAAEBAJ&dq=null+aware+anti-join
>
> I wonder if the USPTO is really clueless enough to accept this?
> Claim 1 would give Oracle ownership of the definition of NOT IN,
> and few of the other claims seem exactly non-obvious either.

Yeah, I just looked up semi and anti-join optimization patents and
Oracle/IBM have a ton. What an obvious exploitation of math for business
gain. I doubt they'd be enforceable. I wish they'd just do away with
software patents altogether :(

--
Jonah H. Harris, Senior DBA
myYearbook.com


From: David Fetter <david(at)fetter(dot)org>
To: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 13:05:20
Message-ID: 20090211130520.GF26601@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 10, 2009 at 08:12:56PM -0500, Jonah H. Harris wrote:
> On Tue, Feb 10, 2009 at 8:09 PM, Jonah H. Harris <jonah(dot)harris(at)gmail(dot)com>wrote:
>
> > On Tue, Feb 10, 2009 at 3:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> >> I wrote (in response to Kevin Grittner's recent issues):
> >> > Reflecting on this further, I suspect there are also some bugs
> >> > in the planner's rules about when semi/antijoins can commute
> >> > with other joins;
> >>
> >> After doing some math I've concluded this is in fact the case.
> >> Anyone want to check my work?
> >
> >
> > FWIW, the logic looks correct to me.
>
> Cripes! I just had an idea and it looks like the buggers beat me to
> it :(
>
> http://www.google.com/patents?id=4bqBAAAAEBAJ&dq=null+aware+anti-join

As has been discussed here many, many times, the only kind of person
who should be doing a patent search is a company's IP attorney, which
you are not, and even if you were, under no circumstances would such a
person paste that link in a public forum.

Should we have a kick-off policy for this kind of misbehavior?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Gianni Ciolli <gianni(dot)ciolli(at)2ndquadrant(dot)it>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 13:24:31
Message-ID: 20090211132431.GA3400@eee.gi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

On Tue, Feb 10, 2009 at 09:41:46PM +0100, Dimitri Fontaine wrote:
> Hi,
>
> Le 10 févr. 09 à 21:10, Tom Lane a écrit :
>
>> I wrote (in response to Kevin Grittner's recent issues):
>>> Reflecting on this further, I suspect there are also some bugs in the
>>> planner's rules about when semi/antijoins can commute with other
>>> joins;
>>
>> After doing some math I've concluded this is in fact the case. Anyone
>> want to check my work?
>
> I don't know how easy it would be to do, but maybe the Coq formal proof
> management system could help us here:
> http://coq.inria.fr/
>
> The harder part in using coq might well be to specify the problem the
> way you just did, so...

formal theorem proving and mechanized mathematics happen to be one of
my research topics in the last few years; so I think that my
experience could be helpful with such problems.

(Actually instead of Coq I use HOL Light, whereas other people in my
research group work with Coq; but both of them are for the same
purpose)

Perhaps a way to begin would be to start writing a formalization of
the above rules, in order to assess the problem quickly, and then to
get back to pg-hackers soon.

Any comments/warnings/suggestions ?

Thank you,
Dr. Gianni Ciolli - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
gianni(dot)ciolli(at)2ndquadrant(dot)it | www.2ndquadrant.it


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 14:36:38
Message-ID: 36e682920902110636n5e54d0m7701653f8f33a5ba@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 11, 2009 at 8:05 AM, David Fetter <david(at)fetter(dot)org> wrote:

> As has been discussed here many, many times, the only kind of person
> who should be doing a patent search is a company's IP attorney, which
> you are not, and even if you were, under no circumstances would such a
> person paste that link in a public forum.

First of all, it was not an intentional patent search. Secondly, I don't
believe there's any restriction of explicitly what can and cannot be posted
on a public Postgres mailing list.

> Should we have a kick-off policy for this kind of misbehavior?

Shut up David.

--
Jonah H. Harris, Senior DBA
myYearbook.com


From: David Fetter <david(at)fetter(dot)org>
To: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 16:19:38
Message-ID: 20090211161938.GC8599@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 11, 2009 at 09:36:38AM -0500, Jonah H. Harris wrote:
> On Wed, Feb 11, 2009 at 8:05 AM, David Fetter <david(at)fetter(dot)org> wrote:
>
> > As has been discussed here many, many times, the only kind of
> > person who should be doing a patent search is a company's IP
> > attorney, which you are not, and even if you were, under no
> > circumstances would such a person paste that link in a public
> > forum.
>
> First of all, it was not an intentional patent search.

I don't see anything in patent law or case law that talks about
"intentional."

> Secondly, I don't believe there's any restriction of explicitly what
> can and cannot be posted on a public Postgres mailing list.

We have plenty of such restrictions. Take the Nazi spammer, for
example, and what he's doing is just offensive and silly. What you're
doing exposes people to real, substantive harm.

> > Should we have a kick-off policy for this kind of misbehavior?
>
> Shut up David.

Not a chance.

This is a very big deal, as you are exposing every US PostgreSQL
contributor to triple damages for "knowing infringement." Are you
saying you're going to pay all that out of your own pocket? Are you
making a legal commitment, say, with a few tens of million dollars in
escrow to back it?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 16:44:33
Message-ID: 36e682920902110844t1780d912q9791a529e05f6b4e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 11, 2009 at 11:19 AM, David Fetter <david(at)fetter(dot)org> wrote:

> This is a very big deal, as you are exposing every US PostgreSQL
> contributor to triple damages for "knowing infringement." Are you
> saying you're going to pay all that out of your own pocket? Are you
> making a legal commitment, say, with a few tens of million dollars in
> escrow to back it?

Per IRC, this discussion will (and likely should) be taken elsewhere.

--
Jonah H. Harris, Senior DBA
myYearbook.com


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: David Fetter <david(at)fetter(dot)org>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 16:50:32
Message-ID: 20090211165032.GL8123@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David,

* David Fetter (david(at)fetter(dot)org) wrote:
> On Wed, Feb 11, 2009 at 09:36:38AM -0500, Jonah H. Harris wrote:
> > Secondly, I don't believe there's any restriction of explicitly what
> > can and cannot be posted on a public Postgres mailing list.
>
> We have plenty of such restrictions. Take the Nazi spammer, for
> example, and what he's doing is just offensive and silly. What you're
> doing exposes people to real, substantive harm.

Do you have a court case backing that statement? If not, then I think
you're playing a bit too much of the lawyer for a public mailing list.

> This is a very big deal, as you are exposing every US PostgreSQL
> contributor to triple damages for "knowing infringement."

Again, it's not at all clear that such a claim would stand up in court
and threatening to kick people off of public mailing lists for talking
about patents is patently ridiculous. You could make a similar claim
that we should go through our mail archive and remove any post that ever
talked about a patent in case we're required to provide web access logs
that show someone looked at a page that talked about a patent.

And all of that without even bringing up the fact that core folks have
talked about patents on this list in the past.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Fetter <david(at)fetter(dot)org>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 17:10:42
Message-ID: 11098.1234372242@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Fetter <david(at)fetter(dot)org> writes:
> On Wed, Feb 11, 2009 at 09:36:38AM -0500, Jonah H. Harris wrote:
>> Secondly, I don't believe there's any restriction of explicitly what
>> can and cannot be posted on a public Postgres mailing list.

> We have plenty of such restrictions. Take the Nazi spammer, for
> example, and what he's doing is just offensive and silly.

My take on the Nazi spammer is that we're banning him because he's
off-topic for these lists. We'd ban someone spamming on a less
offensive topic too, as long as it was unrelated to Postgres.

Unfortunately, while software patents are pretty offensive, they can
hardly be said to be off-topic.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gianni Ciolli <gianni(dot)ciolli(at)2ndquadrant(dot)it>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 17:28:32
Message-ID: 11341.1234373312@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gianni Ciolli <gianni(dot)ciolli(at)2ndquadrant(dot)it> writes:
> On Tue, Feb 10, 2009 at 09:41:46PM +0100, Dimitri Fontaine wrote:
>> I don't know how easy it would be to do, but maybe the Coq formal proof
>> management system could help us here:
>> http://coq.inria.fr/
>>
>> The harder part in using coq might well be to specify the problem the
>> way you just did, so...

> formal theorem proving and mechanized mathematics happen to be one of
> my research topics in the last few years; so I think that my
> experience could be helpful with such problems.

Unless you've got a prover that already understands the concepts of
outer joins etc, I'd think that teaching it about that would require
enough work and introduce enough possibilities for human error so as
to make the exercise pretty much moot. The identities I put up don't
look that complicated to me...

regards, tom lane


From: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 17:43:18
Message-ID: 831948C5-D718-429D-9A64-204EEA6EE575@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11 Feb 2009, at 00:03, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Actually, that makes less sense than the antijoin case. For antijoin
> there is a well-defined value for the extended columns, ie null. For
> a semijoin the RHS values might come from any of the rows that happen
> to join to the current LHS row, so I'm just as happy that it's
> syntactically impossible to reference them.

Actually I think the way mysql users used to spell EXISTS/IN before
mysql supported them would qualify as a semijoin where you can access
the columns:

SELECT distinct a.* from a,b WHERE a.id = b.id

To access columns from b in postgres you would have to use DISTINCT ON.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgreSQL(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-11 23:57:48
Message-ID: 4993119C.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote (in response to Kevin Grittner's recent issues):
>> Reflecting on this further, I suspect there are also some bugs in
>> the planner's rules about when semi/antijoins can commute with
>> other joins;
>
> After doing some math I've concluded this is in fact the case.
> Anyone want to check my work?

> Hence semijoins can be rearranged just as freely as inner joins.

Made sense. Agreed.

> A6. (A antijoin B on (Pab)) leftjoin C on (Pbc)
> = A antijoin (B leftjoin C on (Pbc)) on (Pab)
>
> The second form is in fact equivalent to null-extending the A/B
> antijoin --- the actual contents of C cannot affect the result. So
> we could just drop C altogether. (I'm not going to do anything
> about that now, but it's something to consider for the planned
> join-elimination optimization.) In the first form, if Pbc is strict
> on B then it must fail for all rows of the antijoin result so we get
> the null-extended A/B result. If Pbc is not strict then the first
> form might duplicate some rows in the antijoin result, or produce
> non-null-extended rows. So in this case the identity holds only if
> Pbc is strict, which is the same as for left joins.

How do you get the first form as a starting point? Aren't we limited
to NOT IN or NOT EXISTS clauses for B? Those can't really contribute
columns to the result can they? (I'm probably missing something
here.)

The rest of it made sense, although two identities jumped to mind
which weren't listed.

(A semijoin B on (Pab)) antijoin C on (Pac)
= (A antijoin C on (Pac)) semijoin B on (Pab)

This one turned out, on closer inspection, to be a restatement of S4.

(A semijoin B on (Pab)) antijoin C on (Pbc)
= A semijoin (B antijoin C on (Pbc)) on (Pab)

I think this one is true, and it doesn't seem to be mentioned, unless
I'm missing something. It seems potentially useful.

Hopefully I didn't miss your point entirely.

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgreSQL(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Kevin Grittner" <Kgrittn(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-12 00:13:53
Message-ID: 49931561.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> I wrote:
>> A6.
> [less coherent version of a question already asked and answered]

Got that part on a reread of the thread. Sorry for asking that after
it had been addressed. No need to answer again.

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-12 00:33:46
Message-ID: 23867.1234398826@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> A6. (A antijoin B on (Pab)) leftjoin C on (Pbc)
>> = A antijoin (B leftjoin C on (Pbc)) on (Pab)

> How do you get the first form as a starting point?

Not sure if you can in SQL, but the point of the identity is you can
apply the transformation in either direction. Consider this version
of the second form:

select ... from A
where not exists(select 1 from B left join C on B.y = C.y where B.x = A.x)

The identity says that if B.y = C.y is strict we can antijoin A to B
first (because, in fact, the join to C is pointless here).

Anyway, whether the identity is really useful for antijoins isn't what
I'm concerned about --- I was just trying to see if it was okay for the
planner's join ordering logic to treat left and anti joins the same.
Seems it's not :-(

> (A semijoin B on (Pab)) antijoin C on (Pbc)
> = A semijoin (B antijoin C on (Pbc)) on (Pab)

> I think this one is true, and it doesn't seem to be mentioned, unless
> I'm missing something. It seems potentially useful.

Hmm, it doesn't seem terribly well-defined --- the values of B are
indeterminate above the semijoin in the first case, so having Pbc refer
to them doesn't seem like a good idea. In particular, it seems like in
the first case the semijoin could randomly choose a B row that has a
join partner in C, causing the A row to disappear from the result, when
the same A row has another B partner that does not join to C --- and the
second form would find that B partner and allow the A row to be output.

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-12 13:06:38
Message-ID: 4993CA7E.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>> (A semijoin B on (Pab)) antijoin C on (Pbc)
>> = A semijoin (B antijoin C on (Pbc)) on (Pab)
>
>> I think this one is true, and it doesn't seem to be mentioned,
>> unless I'm missing something. It seems potentially useful.
>
> Hmm, it doesn't seem terribly well-defined --- the values of B are
> indeterminate above the semijoin in the first case, so having Pbc
> refer to them doesn't seem like a good idea. In particular, it
> seems like in the first case the semijoin could randomly choose a B
> row that has a join partner in C, causing the A row to disappear
> from the result, when the same A row has another B partner that does
> not join to C --- and the second form would find that B partner and
> allow the A row to be output.

I was looking at it from the abstraction that A semijoin B could be
treated as the equivalent of an inner join with duplicate A rows from
the result removed before the final result of the enclosing query. It
seems you've been interpreting it as meaning the inner join of A to
the first (arbitrarily chosen) row of B found. It appears that these
two views of it generate the same results for the other identities,
but not this one.

The first case here could be implemented as an inner join which
included (in addition to any columns needed for other purposes) a row
identifier for A and all the columns of B which are needed for the Pbc
predicate. The antijoin could be performed on that result, after
which duplicate A rows would be eliminated, as well as the row
identifier and the B columns. A simplified (and only slightly
artificial) example of where this could buy orders of magnitude
improvement in run time follows.

Imagine that A is a Party table with ten million rows. Imagine that B
and C are sets of rows within a hundred million row CaseHist table
which records events on cases, some of which are associated with
parties. B represents warrant issuing events, each of which is
related to a party. C represents warrant disposition events, each of
which is related to a warrant issuing event. Both tables are indexed
on case number and a sequence number. Party has a name index.

You've got a name, and you want a list of outstanding warrants for
parties with a matching name.

The second case above would be the natural way to write the query.
The first case, implemented as I describe, would be orders of
magnitude faster.

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Kevin Grittner" <Kgrittn(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-12 13:29:18
Message-ID: 4993CFCE.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> I wrote:
> You've got a name, and you want a list of outstanding warrants for
> parties with a matching name.

Correction, if that was the list you wanted, you would use an inner
join, not a semijoin. For purposes of this illustration I guess you
would be looking for a list of parties who have outstanding warrants,
not a list of the warrants themselves.

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-20 15:59:03
Message-ID: 9519.1235145543@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Hence semijoins can be rearranged just as freely as inner joins.

I guess nobody checked my work, because that claim is bogus. Consider

A semijoin (B innerjoin C on (Pbc)) on (Pab)
=? (A semijoin B on (Pab)) innerjoin C on (Pbc)

In the second form the inner join is now using indeterminate B values.
What's more, if there are multiple C rows joining to some B, we could
get duplicated A rows, which can never happen in the first form. So
semijoins do not commute with inner joins in their RHS. A more accurate
statement seems to be that semijoins can be treated like innerjoins
for the purposes of rearrangement of other special joins.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-20 16:13:56
Message-ID: 603c8f070902200813o298ac32akdc40325d53a0a7a8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 20, 2009 at 10:59 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> Hence semijoins can be rearranged just as freely as inner joins.
>
> I guess nobody checked my work, because that claim is bogus.

I spent some time reading your email and thinking through the cases,
but I completely failed to notice this.

Sorry,

...Robert