SQL for removing duplicates?

Lists: pgsql-novice
From: <kynn(at)panix(dot)com>
To: pgsql-novice(at)postgresql(dot)org
Subject: SQL for removing duplicates?
Date: 2006-06-13 16:11:41
Message-ID: 200606131611.k5DGBff21012@panix3.panix.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice

Hi. I'm stumped. I have a large table (about 8.5M records), let's
call it t, whose columns include x and y. I want to remove records
from this table so that any pair of values for these two fields appear
only once. (This will get rid of about 15% of the records in t.)

One simple solution would be something like

CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;

This works, but it uses a lot of space. I would prefer to simply cull
the unwanted records from t, but I just can't figure out the SQL for
it. Any help with it would be *much* appreciated.

Thanks!

kj


From: Brad Nicholson <bnichols(at)ca(dot)afilias(dot)info>
To: kynn(at)panix(dot)com
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Re: SQL for removing duplicates?
Date: 2006-06-13 18:26:04
Message-ID: 448F033C.1020708@ca.afilias.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice

kynn(at)panix(dot)com wrote:
> Hi. I'm stumped. I have a large table (about 8.5M records), let's
> call it t, whose columns include x and y. I want to remove records
> from this table so that any pair of values for these two fields appear
> only once. (This will get rid of about 15% of the records in t.)
>
> One simple solution would be something like
>
> CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t;
> DROP TABLE t;
> ALTER TABLE tmp RENAME TO t;
>
> This works, but it uses a lot of space. I would prefer to simply cull
> the unwanted records from t, but I just can't figure out the SQL for
> it. Any help with it would be *much* appreciated.

If your table is created with OIDs, this should work. If not add a
unique column to the table and use that in place of oid.

DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT SELECT
max(t3.oid) from t t3 group by t3.x, t3.y);

Also note, the query plan for this is going to be very ugly, it might
very well be cheaper to use the solution that you initially mentioned.

--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.


From: <kynn(at)panix(dot)com>
To: pgsql-novice(at)postgresql(dot)org
Subject: Re: SQL for removing duplicates?
Date: 2006-06-13 22:22:37
Message-ID: 200606132222.k5DMMbv21040@panix3.panix.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice

From: Brad Nicholson <bnichols(at)ca(dot)afilias(dot)info>
Date: Jun 13, 2006 2:26 PM
Subject: Re: [NOVICE] SQL for removing duplicates?
To: kynn(at)panix(dot)com

kynn(at)panix(dot)com wrote:
> Hi. I'm stumped. I have a large table (about 8.5M records), let's
> call it t, whose columns include x and y. I want to remove records
> from this table so that any pair of values for these two fields appear
> only once. (This will get rid of about 15% of the records in t.)
>
> One simple solution would be something like
>
> CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t;
> DROP TABLE t;
> ALTER TABLE tmp RENAME TO t;
>
> This works, but it uses a lot of space. I would prefer to simply cull
> the unwanted records from t, but I just can't figure out the SQL for
> it. Any help with it would be *much* appreciated.

If your table is created with OIDs, this should work. If not add a
unique column to the table and use that in place of oid.

DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT SELECT
max(t3.oid) from t t3 group by t3.x, t3.y);

Also note, the query plan for this is going to be very ugly, it might
very well be cheaper to use the solution that you initially mentioned.

Thanks! That's just what I was looking for.

kj

--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.

------=_Part_9560_6324061.1150236298505
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

<br><br>---------- Forwarded message ----------<br><span class="gmail_quote">From: <b class="gmail_sendername">Brad Nicholson</b> &lt;<a href="mailto:bnichols(at)ca(dot)afilias(dot)info">bnichols(at)ca(dot)afilias(dot)info</a>&gt;<br>Date: Jun 13, 2006 2:26 PM
<br>Subject: Re: [NOVICE] SQL for removing duplicates?<br>To: <a href="mailto:kynn(at)panix(dot)com">kynn(at)panix(dot)com</a><br>Cc: <a href="mailto:pgsql-novice(at)postgresql(dot)org">pgsql-novice(at)postgresql(dot)org</a><br><br></span><a href="mailto:kynn(at)panix(dot)com">
kynn(at)panix(dot)com</a> wrote:<br>&gt; Hi.&nbsp;&nbsp;I'm stumped.&nbsp;&nbsp;I have a large table (about 8.5M records), let's<br>&gt; call it t, whose columns include x and y.&nbsp;&nbsp;I want to remove records<br>&gt; from this table so that any pair of values for these two fields appear
<br>&gt; only once.&nbsp;&nbsp;(This will get rid of about 15% of the records in t.)<br>&gt;<br>&gt; One simple solution would be something like<br>&gt;<br>&gt;&nbsp;&nbsp; CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t;<br>&gt;&nbsp;&nbsp; DROP TABLE t;
<br>&gt;&nbsp;&nbsp; ALTER TABLE tmp RENAME TO t;<br>&gt;<br>&gt; This works, but it uses a lot of space.&nbsp;&nbsp;I would prefer to simply cull<br>&gt; the unwanted records from t, but I just can't figure out the SQL for<br>&gt; it.&nbsp;&nbsp;Any help with it would be *much* appreciated.
<br><br><br>If your table is created with OIDs, this should work.&nbsp;&nbsp;If not add a<br>unique column to the table and use that in place of oid.<br><br>DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT SELECT<br>max(t3.oid
) from t t3 group by t3.x, t3.y);<br><br>Also note, the query plan for this is going to be very ugly, it might<br>very well be cheaper to use the solution that you initially mentioned.<br><br><br><br><br><br><br>--<br>Brad Nicholson&nbsp;&nbsp;416-673-4106
<br>Database Administrator, Afilias Canada Corp.<br><br>

------=_Part_9560_6324061.1150236298505--