Re: Indexes on Aggregate Functions

Lists: pgsql-general
From: Kevin Brannen <kevinb(at)nurseamerica(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: serial columns & loads misfeature?
Date: 2002-06-28 18:28:20
Message-ID: 3D1CAAC4.2060202@nurseamerica.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I'm new to Postgres, so sorry if this is easy, but I did search the docs
and could find nothing to answer this...

I've got a Mysql DB that I've dumped out and am trying to insert into a
Pg DB, as we transition. There were a few changes I had to do to the
data, but they were easy--except for the auto_increment columns. :-/

After I created the DB, I inserted the data (thousands of inserts) via
psql. All went well. Then I started testing the changed code (Perl)
and when I went to insert, I got a "dup key" error.

It took me awhile to figure out what was going on, but I can recreate
the problem with:

create table test (s serial, i int);
insert into test values (1,1);
insert into test values (2,2);
insert into test values (3,3);
insert into test (i) values (4);
ERROR: Cannot insert a duplicate key into unique index test_s_key

I was expecting the system to realize new "keys" had been inserted, and
so when the "nextval" that implicitly happens on a serial field is run,
it would "know" that it was too small and return "max(s)+1". [FWIW, my
expectations in this area were set by my experience with Informix and
mysql, both do this; not sure if other RDBMs do.]

I realize I can fix this by writting a Perl/DBI script to read the list
of sequences, and do [conceptually]:

get the list of sequences
from the sequence name, retrieve the table name
n=select max(id)+1 from table;
select setval('seq_name',n) from seq_name;

and things will be fine from here after, but surely this is a common
enough problem after a bulk load that there is something already built
in to handle this and I just don't have it configured correctly (or is
this a bug?).

Oh, this on a RH 7.2 system with Pg 7.1.3.

TIA for any help in understanding this better!
Kevin


From: Gregory Seidman <gss+pg(at)cs(dot)brown(dot)edu>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-28 18:50:46
Message-ID: 20020628185046.GA13066@cs.brown.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Kevin Brannen sez:
} I'm new to Postgres, so sorry if this is easy, but I did search the docs
} and could find nothing to answer this...
}
} I've got a Mysql DB that I've dumped out and am trying to insert into a
} Pg DB, as we transition. There were a few changes I had to do to the
} data, but they were easy--except for the auto_increment columns. :-/
}
} After I created the DB, I inserted the data (thousands of inserts) via
} psql. All went well. Then I started testing the changed code (Perl)
} and when I went to insert, I got a "dup key" error.
[...]
} and things will be fine from here after, but surely this is a common
} enough problem after a bulk load that there is something already built
} in to handle this and I just don't have it configured correctly (or is
} this a bug?).

It's a known problem. I ran into the exact same thing (also transferring
from MySQL to PostgreSQL). The right way to do it is to add a line after
all the inserts for the table (I am assuming you have a big SQL file dumped
by mysql or whatever):

SELECT setval('seq_name', max(serial_column)) FROM appropriate_table;

Unfortunately, I don't think even pg_dump produces this line, though I
could be wrong. I suppose one could set up a trigger/rule to update the
sequence, but that's probably overkill and costly in performance.

} Oh, this on a RH 7.2 system with Pg 7.1.3.
} TIA for any help in understanding this better!
} Kevin
--Greg


From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Kevin Brannen <kevinb(at)nurseamerica(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-28 19:01:58
Message-ID: 9bcphu8igep4e0ilul4269jm77rs7tk5lc@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, 28 Jun 2002 13:28:20 -0500, Kevin Brannen
<kevinb(at)nurseamerica(dot)net> wrote:
>create table test (s serial, i int);
>insert into test values (1,1);
>insert into test values (2,2);
>insert into test values (3,3);
>insert into test (i) values (4);
>ERROR: Cannot insert a duplicate key into unique index test_s_key
Kevin,

SELECT setval('test_s_seq', 3);

should do what you need.

Servus
Manfred


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Brannen <kevinb(at)nurseamerica(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-28 19:03:39
Message-ID: 5982.1025291019@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Kevin Brannen <kevinb(at)nurseamerica(dot)net> writes:
> I was expecting the system to realize new "keys" had been inserted, and
> so when the "nextval" that implicitly happens on a serial field is run,
> it would "know" that it was too small and return "max(s)+1".

It does not, and I'm not convinced that it should. Manual insertion
of values into an autonumbering column seems like, well, a manual
operation. If you're going to bypass the autonumbering then you should
also be responsible for setting the sequence counter to whatever you
want it to be afterwards.

> I realize I can fix this by writting a Perl/DBI script to read the list
> of sequences, and do [conceptually]:

The usual locution is just

select setval('seqname', (select max(col) + 1 from table));

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: gss+pg(at)cs(dot)brown(dot)edu
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-28 19:11:35
Message-ID: 6057.1025291495@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gregory Seidman <gss+pg(at)cs(dot)brown(dot)edu> writes:
> Unfortunately, I don't think even pg_dump produces this line, though I
> could be wrong.

pg_dump restores the sequence to whatever it was at when dumped.
As it should ...

regards, tom lane


From: nconway(at)klamath(dot)dyndns(dot)org (Neil Conway)
To: Gregory Seidman <gss+pg(at)cs(dot)brown(dot)edu>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-28 19:21:21
Message-ID: 20020628192121.GA5727@klamath.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Jun 28, 2002 at 02:50:46PM -0400, Gregory Seidman wrote:
> Kevin Brannen sez:
> } After I created the DB, I inserted the data (thousands of inserts) via
> } psql. All went well. Then I started testing the changed code (Perl)
> } and when I went to insert, I got a "dup key" error.
> [...]
> } and things will be fine from here after, but surely this is a common
> } enough problem after a bulk load that there is something already built
> } in to handle this and I just don't have it configured correctly (or is
> } this a bug?).
>
> It's a known problem.

IMHO, it's not a problem. If you abuse sequences (i.e. INSERT into a
SERIAL column a value that is not generated by its sequence), you're
going to run into trouble -- so the easy solution is, "don't do that".

> I ran into the exact same thing (also transferring
> from MySQL to PostgreSQL). The right way to do it is to add a line after
> all the inserts for the table (I am assuming you have a big SQL file dumped
> by mysql or whatever):
>
> SELECT setval('seq_name', max(serial_column)) FROM appropriate_table;
>
> Unfortunately, I don't think even pg_dump produces this line, though I
> could be wrong.

You're wrong -- pg_dump will setval() the sequence to the value it had
when the dump was made.

Cheers,

Neil

--
Neil Conway <neilconway(at)rogers(dot)com>
PGP Key ID: DB3C29FC


From: Jason Earl <jason(dot)earl(at)simplot(dot)com>
To: Kevin Brannen <kevinb(at)nurseamerica(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-28 19:21:30
Message-ID: 878z4z43b9.fsf@npa01zz001.simplot.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Kevin Brannen <kevinb(at)nurseamerica(dot)net> writes:

> I'm new to Postgres, so sorry if this is easy, but I did search the
> docs and could find nothing to answer this...
>
> I've got a Mysql DB that I've dumped out and am trying to insert into
> a Pg DB, as we transition. There were a few changes I had to do to
> the data, but they were easy--except for the auto_increment
> columns. :-/
>
> After I created the DB, I inserted the data (thousands of inserts) via
> psql. All went well. Then I started testing the changed code (Perl)
> and when I went to insert, I got a "dup key" error.
>
> It took me awhile to figure out what was going on, but I can recreate
> the problem with:
>
> create table test (s serial, i int);
> insert into test values (1,1);
> insert into test values (2,2);
> insert into test values (3,3);
> insert into test (i) values (4);
> ERROR: Cannot insert a duplicate key into unique index test_s_key
>
> I was expecting the system to realize new "keys" had been inserted,
> and so when the "nextval" that implicitly happens on a serial field is
> run, it would "know" that it was too small and return "max(s)+1".
> [FWIW, my expectations in this area were set by my experience with
> Informix and mysql, both do this; not sure if other RDBMs do.]
>
> I realize I can fix this by writting a Perl/DBI script to read the
> list of sequences, and do [conceptually]:
>
> get the list of sequences
> from the sequence name, retrieve the table name
> n=select max(id)+1 from table;
> select setval('seq_name',n) from seq_name;
>
> and things will be fine from here after, but surely this is a common
> enough problem after a bulk load that there is something already built
> in to handle this and I just don't have it configured correctly (or is
> this a bug?).
>
> Oh, this on a RH 7.2 system with Pg 7.1.3.
>
> TIA for any help in understanding this better!
> Kevin

First of all, since you are migrating to PostgreSQL you really might
consider trying the newest version of PostgreSQL. The 7.1 series was
good, but the 7.2 series is better.

With that out of the way here's an explanation of how PostgreSQL's
serial type works. The serial type is really nothing more than a
wrapper around a PostgreSQL sequence and a table with a default value
for the insert. However, for this default to be used you have to make
sure that you don't assign s a value manually. When you insert into
test like this:

INSERT INTO test VALUES (1,1);

Then you are deliberately short-circuiting the default value. This is
useful because there are times when you don't want the next value from
your sequence. When you insert into the table like this:

INSERT INTO test (i) VALUES (4);

You didn't specify a value for s, and so PostgreSQL uses the default,
which since you haven't used a value from the sequence is still 1.
This is almost certainly what you want your application to do once you
get it ported over to PostgreSQL.

The trick, when importing data from some other source, is to make sure
that you update the sequence so that it is set to a number higher than
the the data you imported. Using your example you only need to change
one line:

CREATE TABLE test (s serial, i int);
INSERT INTO test VALUES (1,1);
INSERT INTO test VALUES (2,2);
INSERT INTO test VALUES (3,3);
SELECT setval('test_s_seq', 4);
INSERT INTO test (i) VALUES (4);

You could even do something like:

SELECT setval('test_s_seq', (SELECT max(s) + 1 FROM test));

PostgreSQL doesn't use the indexes on aggregate functions (like max())
so it would be faster on large tables to write that as:

SELECT setval('test_s_seq', (SELECT s + 1 FROM test ORDER BY s DESC
LIMIT 1));

That's somewhat ugly, but it's much much faster. Just another little
PostgreSQL hint that is likely to come in handy :).

So the trick is to simply import your data from MySQL, reset your
sequences using setval, and then make sure that your queries do not
supply a value for your serial columns so that PostgreSQL will supply
a default value from your sequence. As long as you let PostgreSQL
fill in the default value PostgreSQL's serial type should do precisely
what you need.

Jason


From: Andrew Sullivan <andrew(at)libertyrms(dot)info>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-28 20:26:54
Message-ID: 20020628162654.Q32484@mail.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Jun 28, 2002 at 02:50:46PM -0400, Gregory Seidman wrote:
>
> SELECT setval('seq_name', max(serial_column)) FROM appropriate_table;
>
> Unfortunately, I don't think even pg_dump produces this line, though I
> could be wrong.

You are indeed wrong. It is produced by pg_dump (well, actually, it
doesn't use the max() function, but the result is the same).

A

--
----
Andrew Sullivan 87 Mowat Avenue
Liberty RMS Toronto, Ontario Canada
<andrew(at)libertyrms(dot)info> M6K 3E3
+1 416 646 3304 x110


From: Kevin Brannen <kevinb(at)nurseamerica(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-28 21:51:05
Message-ID: 3D1CDA49.40407@nurseamerica.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Trying for 2 at once...

Tom Lane wrote:
> Kevin Brannen <kevinb(at)nurseamerica(dot)net> writes:
>
>>I was expecting the system to realize new "keys" had been inserted, and
>>so when the "nextval" that implicitly happens on a serial field is run,
>>it would "know" that it was too small and return "max(s)+1".
>
>
> It does not, and I'm not convinced that it should. Manual insertion
> of values into an autonumbering column seems like, well, a manual
> operation. If you're going to bypass the autonumbering then you should
> also be responsible for setting the sequence counter to whatever you
> want it to be afterwards.
>

Neil Conway wrote:
> On Fri, Jun 28, 2002 at 02:50:46PM -0400, Gregory Seidman wrote:
>
>>Kevin Brannen sez:
...
>>} and things will be fine from here after, but surely this is a common
>>} enough problem after a bulk load that there is something already built
>>} in to handle this and I just don't have it configured correctly (or is
>>} this a bug?).
>>
>>It's a known problem.
>
>
> IMHO, it's not a problem. If you abuse sequences (i.e. INSERT into a
> SERIAL column a value that is not generated by its sequence), you're
> going to run into trouble -- so the easy solution is, "don't do that".

I picked these 2 comments just because there were expressing the "other
viewpoint". I think that when all is said and done, there is no correct
answer to whether the SERIAL column (or actually the the underlying
sequence) should or should not pay attention to other inserted values.
Some people seem expect it one way, because we came from other products
that do that, while others don't expect that, because they think it to
be incorrect or for some other reason.

I suppose I could change the question to: Why can't we have it both
ways? You know, have that ability controlled by a param in the
postgresql.conf file; it could even be defaulted to "off". I really
think this could be important, especially if the ultimate goal is world
domination. ;-) After all, bulk loads and transformations are a fact of
life that must be dealt with (so "don't do that" is not an option unless
I don't use Pg, else I'll never migrate away from mysql).

And I really appreciate all the advice and help given to help me
understand the Pg way on this topic, and the magic statement(s) to fix
it! A Perl program to reset all 35 sequences should be simple.

Thanks!
Kevin


From: Gregory Seidman <gss+pg(at)cs(dot)brown(dot)edu>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-28 22:16:30
Message-ID: 20020628221630.GA24082@cs.brown.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Kevin Brannen sez:
[...]
} I suppose I could change the question to: Why can't we have it both
} ways? You know, have that ability controlled by a param in the
} postgresql.conf file; it could even be defaulted to "off". I really
} think this could be important, especially if the ultimate goal is world
} domination. ;-) After all, bulk loads and transformations are a fact of
} life that must be dealt with (so "don't do that" is not an option unless
} I don't use Pg, else I'll never migrate away from mysql).

You can have it both ways, at what may or may not be an acceptable
performance penalty. A trigger/rule can be written to update the sequence
upon any insertion. For that matter, since SERIAL is just shorthand for
creating a sequence and setting the column default to a function call, one
can simply write it longhand and replace the function call with a function
you wrote. For example (this requires some hoop-jumping to make the
references come out right):

CREATE TABLE Foo (
id int UNIQUE NOT NULL, -- no default yet
-- ... whatever else
primary key (id)
);

CREATE FUNCTION next_foo() RETURNS int
AS 'SELECT COALESCE(max(id), 0)+1 FROM Foo'
LANGUAGE SQL;

ALTER TABLE Foo ALTER COLUMN id SET DEFAULT next_foo();

...and every time you insert a row without specifying the id by hand, the
id field will default to one more than the largest value currently in the
table. If you were feeling really clever and didn't mind the performance
penalty, you could even write a function that would reuse gaps.

Basically, the default behavior is the one with the minimum performance
hit.

[...]
} Thanks!
} Kevin
--Greg


From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Gregory Seidman <gss+pg(at)cs(dot)brown(dot)edu>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-28 22:53:55
Message-ID: c5pphugtmoebbrv3448in8fi61eabjpfhb@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, 28 Jun 2002 18:16:30 -0400, Gregory Seidman
<gss+pg(at)cs(dot)brown(dot)edu> wrote:
>CREATE TABLE Foo (
> id int UNIQUE NOT NULL, -- no default yet
> -- ... whatever else
> primary key (id)
>);
>
>CREATE FUNCTION next_foo() RETURNS int
>AS 'SELECT COALESCE(max(id), 0)+1 FROM Foo'
>LANGUAGE SQL;
>
>ALTER TABLE Foo ALTER COLUMN id SET DEFAULT next_foo();

Greg, yes this works, but ...

it turns your application into a single user system, because
concurrent transactions could see the same max(id) and therefore try
to insert duplicate keys,

and the performance hit soon gets unacceptable, because select max()
always does a table scan.

Kevin, stick to serials. They offer the best performance and you will
get used to them. And you do the migration only once, don't you?

Servus
Manfred


From: Lee Harr <missive(at)frontiernet(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: serial columns & loads misfeature?
Date: 2002-06-29 22:45:00
Message-ID: afld9c$1eni$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> After I created the DB, I inserted the data (thousands of inserts) via
> psql. All went well. Then I started testing the changed code (Perl)
> and when I went to insert, I got a "dup key" error.
>
> It took me awhile to figure out what was going on, but I can recreate
> the problem with:
>
> create table test (s serial, i int);
> insert into test values (1,1);
> insert into test values (2,2);
> insert into test values (3,3);
> insert into test (i) values (4);
> ERROR: Cannot insert a duplicate key into unique index test_s_key
>

With these inserts, you are bypassing the SERIAL mechanism
(it uses a DEFAULT value)

> I was expecting the system to realize new "keys" had been inserted, and
> so when the "nextval" that implicitly happens on a serial field is run,
> it would "know" that it was too small and return "max(s)+1". [FWIW, my
> expectations in this area were set by my experience with Informix and
> mysql, both do this; not sure if other RDBMs do.]
>

I can certainly see the advantage of having the SERIAL columns set
properly by some kind of OtherDB --> Postgres conversion tool, but
I do not think there is a need for a different mechanism in the
usual case.


From: Curt Sampson <cjs(at)cynic(dot)net>
To: Jason Earl <jason(dot)earl(at)simplot(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Indexes on Aggregate Functions
Date: 2002-07-01 06:18:09
Message-ID: Pine.NEB.4.43.0207011513000.408-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 28 Jun 2002, Jason Earl wrote:

> SELECT setval('test_s_seq', (SELECT max(s) + 1 FROM test));
>
> PostgreSQL doesn't use the indexes on aggregate functions (like max())
> so it would be faster on large tables to write that as:
>
> SELECT setval('test_s_seq', (SELECT s + 1 FROM test ORDER BY s DESC
> LIMIT 1));

I've wondered about this, actually. Why doesn't postgres use the
indexes? For something like MAX(s) it would certainly be a lot faster.

Another cool optimisation that MS SQL Server does is, if the
information requested is looked up in an index, and all the columns
you're retrieving are already in the index data, it doesn't bother
retrieving the values from the table itself (though presumably it
checks for triggers to execute and so on). E.g., if you have

CREATE TABLE foo (
mykey int PRIMARY KEY,
otherkey_part1 int NOT NULL,
otherkey_part2 varchar(255) NOT NULL,
morestuff varchar(255) NOT NULL);
CREATE INDEX otherkeys ON foo (otherkey_part1, otherkey_part2);
SELECT otherkey_part1, otherkey_part2 FROM foo
WHERE otherkey_part1 = 17;

that query will read only the index, not the table itself.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC


From: Alvaro Herrera <alvherre(at)atentus(dot)com>
To: Curt Sampson <cjs(at)cynic(dot)net>
Cc: Jason Earl <jason(dot)earl(at)simplot(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Indexes on Aggregate Functions
Date: 2002-07-01 07:01:50
Message-ID: Pine.LNX.4.44.0207010257130.16429-100000@cm-lcon-46-187.cm.vtr.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Curt Sampson dijo:

> On 28 Jun 2002, Jason Earl wrote:
>
> > SELECT setval('test_s_seq', (SELECT max(s) + 1 FROM test));
> >
> > PostgreSQL doesn't use the indexes on aggregate functions (like max())
> > so it would be faster on large tables to write that as:
> >
> > SELECT setval('test_s_seq', (SELECT s + 1 FROM test ORDER BY s DESC
> > LIMIT 1));
>
> I've wondered about this, actually. Why doesn't postgres use the
> indexes? For something like MAX(s) it would certainly be a lot faster.

Because Postgres does not know what operator is running inside the
aggregate, so it doesn't know whether a given index can be used.
Aggregates are opaque functions. Maybe a hack could be written that
allows indexes to be used, allowing one to tie an operator to an
aggregate. I don't know if it's possible.

> Another cool optimisation that MS SQL Server does is, if the
> information requested is looked up in an index, and all the columns
> you're retrieving are already in the index data, it doesn't bother
> retrieving the values from the table itself (though presumably it
> checks for triggers to execute and so on). E.g., if you have

I think Postgres cannot do this because the index doesn't contain the
needed transaction visibility information; it's on the heap, so it has
to go there anyway.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Entristecido, Wutra
echa a Freyr a rodar
y a nosotros al mar" (cancion de Las Barreras)