COPY performance

Lists: pgsql-adminpgsql-general
From: "Nick Fankhauser" <nickf(at)ontko(dot)com>
To: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>
Subject: A *short* planner question
Date: 2002-04-12 21:56:03
Message-ID: NEBBLAAHGLEEPCGOBHDGMECJEMAA.nickf@ontko.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

I know I'm about to become a pest, but I promise, this is a short one!

Before doing the explain below, I specifically did a verbose analyze & noted
that the results seemed in line with what I expected. I'm on v7.1.3 of PGSQL

Here's the query that runs too slow: (It takes about 30 seconds on a 1.2Ghz
Athlon system with 512MB RAM)

monroe=# explain
monroe-# select
monroe-# actor.actor_id,
monroe-# actor.role_class_code,
monroe-# actor.actor_full_name,
monroe-# actor.actor_person_date_of_birth
monroe-# from actor
monroe-# where exists (select 'x'
monroe(# from actor_case_assignment,case_data
monroe(# where actor_case_assignment.actor_id = actor.actor_id
monroe(# and actor_case_assignment.case_id =
case_data.case_id
monroe(# and case_data.case_disp_global_code = 'Open')
monroe-# and upper(actor_full_name) like 'RAY, J%';

NOTICE: QUERY PLAN:

Index Scan using actor_upper_full_name on actor (cost=0.00..1284478.54
rows=2799 width=40)
SubPlan
-> Nested Loop (cost=0.00..19291.12 rows=42 width=24)
-> Seq Scan on case_data (cost=0.00..5424.69 rows=2871 width=12)
-> Index Scan using actor_case_assignment_both on
actor_case_assignment (cost=0.00..4.82 rows=1 width=12)

The index actor_case_assignment_both is on (case_id, actor_id) in that
order.
There is also an index on actor_case_assigment(actor_id).
There is a unique index on case_data(case_id).
The is an index on upper(actor_full_name).
case_data.case_disp_global_code has only two distinct values ('Open',
'Closed').
actor contains 279855 rows.
case_data contains 168775 rows.
actor_case_assignment contains 753610 rows.

So... why wouldn't the planner do this:

Index scan on actor for upper(actor_full_name) like 'RAY, J%' returns a
small set (rows < 1000)
For each actor_id in the set:
do an Index Scan using actor_case_assignment.actor_id to return
actor_case_assignment.case_id (rows < 2000)
For each case_id:
do an index scan to return case_data.case_disp_global_code
Scan each code for 'Open' (rows < 100)

Any thoughts on how I might encourage the planner to use such a strategy?

Another thing that doesn't make sense to me, but doesn't hurt the plan, is
that although we have analyze stats on actor, the planner seems to be using
the default of .01 on upper(actor_full_name) and predicts that 2799 rows
will be returned, however, if I get rid of the upper(), it uses the stats
and predicts that 1 row will be returned. Does the use of an index on a
function make the planner stop using the stats?

Thanks,

-Nick


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: nickf(at)ontko(dot)com
Cc: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: A *short* planner question
Date: 2002-04-12 22:29:46
Message-ID: 11224.1018650586@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

"Nick Fankhauser" <nickf(at)ontko(dot)com> writes:
> So... why wouldn't the planner do this:

Offhand I think the planner should have considered that plan; evidently
it thought it was more expensive than this plan. (Perhaps it was right;
how selective is the actor_case_assignment(actor_id) index?) You could
check by temporarily dropping the actor_case_assignment_both index and
seeing what plan you get. You don't even have to really drop it; try

begin;
drop index ...;
explain ...;
rollback;

Should work, without the pain of rebuilding the index afterwards...

> Another thing that doesn't make sense to me, but doesn't hurt the plan, is
> that although we have analyze stats on actor, the planner seems to be using
> the default of .01 on upper(actor_full_name) and predicts that 2799 rows
> will be returned, however, if I get rid of the upper(), it uses the stats
> and predicts that 1 row will be returned. Does the use of an index on a
> function make the planner stop using the stats?

There are no stats computed for the values of functional indexes, at
present, thus no way for the planner to derive any realistic estimate for
the selectivity of that clause. I have a private TODO item about that,
but I'm not sure if the public TODO list mentions it.

regards, tom lane


From: "Nick Fankhauser" <nickf(at)ontko(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: A *short* planner question
Date: 2002-04-12 23:31:48
Message-ID: NEBBLAAHGLEEPCGOBHDGOECNEMAA.nickf@ontko.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general


> You could
> check by temporarily dropping the actor_case_assignment_both index and
> seeing what plan you get.

Here is the result:

Index Scan using actor_upper_full_name on actor (cost=0.00..1544484.16
rows=3051 width=40)
SubPlan
-> Nested Loop (cost=0.00..21275.72 rows=42 width=24)
-> Index Scan using actor_case_assignment_fk1 on
actor_case_assignment (cost=0.00..9221.62 rows=2696 width=12)
-> Index Scan using case_data_case_id on case_data
(cost=0.00..4.46 rows=1 width=12)

Lightning-fast, but I need that index on both ids for other purposes.

The problem is that I need the index on both foreign keys because I use it
to kick out duplicate entry attempts during my load process. (Duplicate
actors are ok, and duplicate cases are ok, but an actor can only be assigned
to a case once, so the combination must be unique.)

Fortunately, your info on the function index not using stats got me thinking
in a profitable direction:

I'm using the "exists" subquery only because when I used "distinct", I got
even worse performance. I think this was because the planner chose not to
use my index on upper(actor_full_name) when I simply joined all of the
tables. If I reword the query to use "distinct", and eliminate also the
upper() on my constraint, the query really flies.

So I think my solution will be to add a new column called
"upper_actor_full_name" to my "actor" table, and add a bit of code to my
load process that will populate this field with upper(actor_full_name). It's
a bit of a kludge, but should work until the day that you get to adding
stats for function indexes.

Thanks for the help.

-Nick

--------------------------------------------------------------------------
Nick Fankhauser nickf(at)ontko(dot)com Phone 1.765.935.4283 Fax 1.765.962.9788
Ray Ontko & Co. Software Consulting Services http://www.ontko.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: nickf(at)ontko(dot)com
Cc: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: A *short* planner question
Date: 2002-04-13 00:27:56
Message-ID: 11984.1018657676@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

"Nick Fankhauser" <nickf(at)ontko(dot)com> writes:
>> You could
>> check by temporarily dropping the actor_case_assignment_both index and
>> seeing what plan you get.

> Here is the result:

> Index Scan using actor_upper_full_name on actor (cost=0.00..1544484.16
> rows=3051 width=40)
> SubPlan
> -> Nested Loop (cost=0.00..21275.72 rows=42 width=24)
> -> Index Scan using actor_case_assignment_fk1 on
> actor_case_assignment (cost=0.00..9221.62 rows=2696 width=12)
> -> Index Scan using case_data_case_id on case_data
> (cost=0.00..4.46 rows=1 width=12)

> Lightning-fast, but I need that index on both ids for other purposes.

Hmm. If the outer side of the nestloop were actually hitting 2696 rows
on average, it wouldn't be "lightning fast". So the planner's failure
to choose this plan is probably due to this overestimate. You said you
were still on 7.1.*, right? It'd be interesting to know if 7.2 gets
this right; it has more detailed stats and hopefully would make a better
estimate of the number of matches.

regards, tom lane


From: "Nick Fankhauser" <nickf(at)ontko(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: A *short* planner question
Date: 2002-04-13 16:03:59
Message-ID: NEBBLAAHGLEEPCGOBHDGMEDFEMAA.nickf@ontko.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general


> Hmm. If the outer side of the nestloop were actually hitting 2696 rows
> on average, it wouldn't be "lightning fast". So the planner's failure
> to choose this plan is probably due to this overestimate.

This may be a reasonable estimate- as you might have gathered, we're
matching people to court cases. The typical person (hopefully) only has one
case that they're involved with, but an attorney may be in the 1K-10K range
and a judge in 10k-100K. Since they all live in the same table, the estimate
may not be all that far off. In addition, since we're using a "like", typing
in only an initial char would get roughly 1/26 of the rows back- Does the
planner look at the number of leading chars provided when coming up with a
plan?

> still on 7.1.*, right? It'd be interesting to know if 7.2 gets
> this right; it has more detailed stats and hopefully would make a better
> estimate of the number of matches.

7.2 on our development box is on the middle-priority TODO list. (2-4 weeks.)
I'll provide feedback when we get there.

As always, thanks for the help!

-Nick

--------------------------------------------------------------------------
Nick Fankhauser nickf(at)ontko(dot)com Phone 1.765.935.4283 Fax 1.765.962.9788
Ray Ontko & Co. Software Consulting Services http://www.ontko.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: nickf(at)ontko(dot)com
Cc: "pgsql-admin" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: A *short* planner question
Date: 2002-04-13 16:28:22
Message-ID: 16054.1018715302@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

"Nick Fankhauser" <nickf(at)ontko(dot)com> writes:
> ... In addition, since we're using a "like", typing
> in only an initial char would get roughly 1/26 of the rows back- Does the
> planner look at the number of leading chars provided when coming up with a
> plan?

Yeah it does, if it can see a constant pattern for the LIKE or regex match.

regards, tom lane


From: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: COPY performance
Date: 2002-04-13 18:13:06
Message-ID: Pine.LNX.4.21.0204131840360.3278-100000@ponder.fairway2k.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

I'm trying to COPY in a table of 1 million rows. The table is created by:

create table chat_post_new (
session_id INT NOT NULL references chat_session (session_id),
poster_name VARCHAR(32) NOT NULL,
time TIMESTAMP NOT NULL,
post_number INT NOT NULL,
FTS txtidx
);

The old definition had the integer columns as int2s and the FTS column wasn't
there. Because I wanted to reorder the rows, add them in a more natural order
(by time), I created the data file read in by the copy command using

SELECT session_id || '\t' || poster_name || '\t' || time || '\t' || post_number
FROM chat_post
ORDER BY time

After removing the first and last couple of lines, so that only the data is in
the file, renaming the original table and creating the new version I tried
running:

COPY chat_post FROM 'file-path'

The data file is 40MB and has somewhere over 1000000 rows of data. During the
copy I can sit and watch the memory foot print of the backend process
growing. It eventually gets to something like 80MB, with 10-20MB less for it's
resident set size. I've not seen this copy in complete because it can't get
sufficient resources from the system after this point.

So splitting the file into 300000 row chunks and doing four COPY commands:

Initially when a COPY kicks off there is fair amount of disk activity, which is
to be expected, but then the memory foot print starts growing (from a fresh
connection size), disk activity drops off to unnoticeable unless I watch and
listen and the CPU usage increases. The memory usage stablises around 30MB with
24MB resident before the CPU utilisation finishes creeping up. Eventually, but
not exactly long after starting, the backend process is clocking up 99% of the
CPU.

So why does the memory usage increase so? Why does is the process only IO bound
initially? And why is the process so CPU intensive?

I'm thinking that the data file is read into memory entirely before any
processing of the data is done but can't work out why the file isn't just
mmap()ed, STDIN can be special cased surely, and how about all the CPU power
required for the ascii to internal representations being so high in a process
that I would have thought to be very nearly IO bound?

I could take a look at the code, not something I've done yet I must admit, and
see if I can contribute by doing the mmap business if it's seen as a good idea.

--
Nigel J. Andrews
Director

---
Logictree Systems Limited
Computer Consultants


From: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY performance
Date: 2002-04-13 18:19:21
Message-ID: Pine.LNX.4.21.0204131916080.3278-100000@ponder.fairway2k.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general


On Sat, 13 Apr 2002, Nigel J. Andrews wrote:
>
>
> I'm trying to COPY in a table of 1 million rows. The table is created by:
>
> [sniped]
>
> mmap()ed, STDIN can be special cased surely, and how about all the CPU power
> required for the ascii to internal representations being so high in a process
> that I would have thought to be very nearly IO bound?
>
> I could take a look at the code, not something I've done yet I must admit, and
> see if I can contribute by doing the mmap business if it's seen as a good idea.

I forgot to mention that even with 99% of the CPU it still takes a pretty long
time to load 300,000 rows, it's still working on the second batch and I think I
kicked the first off 30-60 minutes ago.

Also:

=>select version();
version
----------------------------------------------------------------------
PostgreSQL 7.2.1 on i386-unknown-freebsd3.3, compiled by GCC 2.7.2.3

--
Nigel J. Andrews
Director

---
Logictree Systems Limited
Computer Consultants


From: Joe Conway <mail(at)joeconway(dot)com>
To: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY performance
Date: 2002-04-13 18:38:40
Message-ID: 3CB87B30.1010203@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

Nigel J. Andrews wrote:
>
> I'm trying to COPY in a table of 1 million rows. The table is created by:
>
> create table chat_post_new (
> session_id INT NOT NULL references chat_session (session_id),
> poster_name VARCHAR(32) NOT NULL,
> time TIMESTAMP NOT NULL,
> post_number INT NOT NULL,
> FTS txtidx
> );
>
> The old definition had the integer columns as int2s and the FTS column wasn't
> there. Because I wanted to reorder the rows, add them in a more natural order
> (by time), I created the data file read in by the copy command using
>
> SELECT session_id || '\t' || poster_name || '\t' || time || '\t' || post_number
> FROM chat_post
> ORDER BY time
>
> After removing the first and last couple of lines, so that only the data is in
> the file, renaming the original table and creating the new version I tried
> running:
>
> COPY chat_post FROM 'file-path'
>

I'm not sure if this is your problem, but I believe you need the same
number of columns in your file as your table. So if you want FTS to be
null, append \N to the output, e.g.

SELECT session_id || '\t' || poster_name || '\t' || time || '\t' ||
post_number || '\t\N'
FROM chat_post
ORDER BY time

Also, you should probably leave the foreign key reference off the table
and add it after you load the data.

Hope this helps,

Joe


From: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY performance
Date: 2002-04-13 19:04:04
Message-ID: Pine.LNX.4.21.0204131959070.3278-100000@ponder.fairway2k.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general


On Sat, 13 Apr 2002, Joe Conway wrote:

> Nigel J. Andrews wrote:
> >
> > I'm trying to COPY in a table of 1 million rows. The table is created by:
> >
> > create table chat_post_new (
> > session_id INT NOT NULL references chat_session (session_id),
> > poster_name VARCHAR(32) NOT NULL,
> > time TIMESTAMP NOT NULL,
> > post_number INT NOT NULL,
> > FTS txtidx
> > );
> >
> > The old definition had the integer columns as int2s and the FTS column wasn't
> > there. Because I wanted to reorder the rows, add them in a more natural order
> > (by time), I created the data file read in by the copy command using
> >
> > SELECT session_id || '\t' || poster_name || '\t' || time || '\t' || post_number
> > FROM chat_post
> > ORDER BY time
> >
> > After removing the first and last couple of lines, so that only the data is in
> > the file, renaming the original table and creating the new version I tried
> > running:
> >
> > COPY chat_post FROM 'file-path'
> >
>
> I'm not sure if this is your problem, but I believe you need the same
> number of columns in your file as your table. So if you want FTS to be
> null, append \N to the output, e.g.
>
> SELECT session_id || '\t' || poster_name || '\t' || time || '\t' ||
> post_number || '\t\N'
> FROM chat_post
> ORDER BY time
>

Ah, thanks for that tip. I don't know if it'll make a difference, I'll try
later, but I had wondered what that \N was doing at the end of a copy out to
file I did before.

> Also, you should probably leave the foreign key reference off the table
> and add it after you load the data.

I should have mentioned that I'm doing the copy in to the table in a
transaction block with all constraints deferred. That should mean it's only at
the commit stage that foreign key will be checked right?

> Hope this helps,

Well the \N is useful to know thanks.

>
> Joe
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>

--
Nigel J. Andrews
Director

---
Logictree Systems Limited
Computer Consultants


From: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
To: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY performance
Date: 2002-04-13 19:18:59
Message-ID: 20020413121415.J86939-100000@megazone23.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general


On Sat, 13 Apr 2002, Nigel J. Andrews wrote:

> > Also, you should probably leave the foreign key reference off the table
> > and add it after you load the data.
>
> I should have mentioned that I'm doing the copy in to the table in a
> transaction block with all constraints deferred. That should mean it's only at
> the commit stage that foreign key will be checked right?

With the definition shown, I believe your constraint is not deferrable so
setting the constraint mode to deferred won't help. In any case it'd still
need to be saving the information on the triggers to run.


From: Joe Conway <mail(at)joeconway(dot)com>
To: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY performance
Date: 2002-04-13 19:19:34
Message-ID: 3CB884C6.2050307@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

Nigel J. Andrews wrote:
>
>
> I should have mentioned that I'm doing the copy in to the table in a
> transaction block with all constraints deferred. That should mean
> it's only at the commit stage that foreign key will be checked
> right?
>

I'm not really sure, maybe someone else can chime in.

But I can say that I imported about 38 million rows of similar size
(actually, a bit larger) using COPY in under an hour with 7.2RC3. I had
no indexes or references defined during the COPY. I never saw memory use
grow beyond about 10MB IIRC -- possibly in part because I forgot to
increase shared buffers beyond the default before I started.

FWIW, hardware was dual P3 833 (I think), with 2GB RAM, hardware
mirrored SCSI hard-drives, Red Hat 7.2.

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
Cc: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY performance
Date: 2002-04-13 20:05:21
Message-ID: 17514.1018728321@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
> On Sat, 13 Apr 2002, Nigel J. Andrews wrote:
>> I should have mentioned that I'm doing the copy in to the table in a
>> transaction block with all constraints deferred. That should mean it's only at
>> the commit stage that foreign key will be checked right?

> With the definition shown, I believe your constraint is not deferrable so
> setting the constraint mode to deferred won't help. In any case it'd still
> need to be saving the information on the triggers to run.

In any case the RI trigger firings will be postponed till end of query.
I suspect that the memory growth is due to the list of pending trigger
firings. The advice to add the REFERENCES constraint after you've
loaded the table seems good to me.

Another possibility is that there's some memory leak associated with the
txtidx data type; I dunno how thoroughly that type has been tested...

regards, tom lane


From: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY performance
Date: 2002-04-13 20:19:14
Message-ID: Pine.LNX.4.21.0204132114050.3278-100000@ponder.fairway2k.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general


On Sat, 13 Apr 2002, Tom Lane wrote:

> Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
> > On Sat, 13 Apr 2002, Nigel J. Andrews wrote:
> >> I should have mentioned that I'm doing the copy in to the table in a
> >> transaction block with all constraints deferred. That should mean it's only at
> >> the commit stage that foreign key will be checked right?
>
> > With the definition shown, I believe your constraint is not deferrable so
> > setting the constraint mode to deferred won't help. In any case it'd still
> > need to be saving the information on the triggers to run.
>
> In any case the RI trigger firings will be postponed till end of query.
> I suspect that the memory growth is due to the list of pending trigger
> firings. The advice to add the REFERENCES constraint after you've
> loaded the table seems good to me.
>
> Another possibility is that there's some memory leak associated with the
> txtidx data type; I dunno how thoroughly that type has been tested...

I believe I have seen large memory footprints at other times and I haven't used
the txtidx type before. However, I will also do a test loading into a table
with just the standard types. If it turns out to be associated with the new
column I'll sort out who to email and probably also report on here just so
people can get 'closure'.

I'll do the test of loading the table without the foreign ket set also, of
course.

Thanks for the quick replys folks, don't you people ever go home?

--
Nigel J. Andrews
Director

---
Logictree Systems Limited
Computer Consultants


From: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY performance
Date: 2002-04-13 23:17:23
Message-ID: Pine.LNX.4.21.0204132356040.3278-100000@ponder.fairway2k.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

On Sat, 13 Apr 2002, Nigel J. Andrews wrote:
> On Sat, 13 Apr 2002, Tom Lane wrote:
>
> > Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
> > > On Sat, 13 Apr 2002, Nigel J. Andrews wrote:
> > >> I should have mentioned that I'm doing the copy in to the table in a
> > >> transaction block with all constraints deferred. That should mean it's only at
> > >> the commit stage that foreign key will be checked right?
> >
> > > With the definition shown, I believe your constraint is not deferrable so
> > > setting the constraint mode to deferred won't help. In any case it'd still
> > > need to be saving the information on the triggers to run.
> >
> > In any case the RI trigger firings will be postponed till end of query.
> > I suspect that the memory growth is due to the list of pending trigger
> > firings. The advice to add the REFERENCES constraint after you've
> > loaded the table seems good to me.
> >
> > Another possibility is that there's some memory leak associated with the
> > txtidx data type; I dunno how thoroughly that type has been tested...
>
> I believe I have seen large memory footprints at other times and I haven't used
> the txtidx type before. However, I will also do a test loading into a table
> with just the standard types. If it turns out to be associated with the new
> column I'll sort out who to email and probably also report on here just so
> people can get 'closure'.
>
> I'll do the test of loading the table without the foreign ket set also, of
> course.

Right, I'm not even going to bother doing the \N test to stick a null in the
final column of the table. Missing the foreign key constraint from the table
definition and leaving the txtidx typed column in does indeed enable the entire
data set to be loaded in a few minutes with the memory footprint staying short
of 6MB, with and without the load being done within a transaction.

So, the slowness and the large memory usage was due to the foreign key. I note
that the SQL reference manual for 7.2.1 says about SET CONSTRAINTS
... 'Currently, only foreign key contraints are affected by this setting. Check
and unique constraints are always effectively initially immediate not
deferrable.' I see from the CREATE TABLE page that although foreign keys
are the only constraints that currently accept the deferrable/not deferrable
setting that it is the not deferrable setting that is set by default. This was
my problem, I didn't read this page only the SET CONSTRAINTS page and jumped to
the assumption that foreign keys were defaulted to deferrable.

A case of read TFM. Thanks for the help though folks,

--
Nigel J. Andrews
Director

---
Logictree Systems Limited
Computer Consultants


From: Francisco Reyes <lists(at)natserv(dot)com>
To: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY performance
Date: 2002-04-19 16:08:58
Message-ID: 20020419115726.D76398-100000@zoraida.natserv.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

On Sat, 13 Apr 2002, Nigel J. Andrews wrote:
> > I'm trying to COPY in a table of 1 million rows. The table is created by:
> I forgot to mention that even with 99% of the CPU it still takes a pretty long
> time to load 300,000 rows, it's still working on the second batch and I think I
> kicked the first off 30-60 minutes ago.
>
> Also:
>
> =>select version();
> version
> ----------------------------------------------------------------------
> PostgreSQL 7.2.1 on i386-unknown-freebsd3.3, compiled by GCC 2.7.2.3

Coming late to the party. :-)
How about telling us about your complete setup?
SYS V memory settings.
Type of disks (SCSI, rpm, etc).

I merge 4.5 M records in less than 20 minutes. The file loaded is about
160MB. The Machine is Pentium 4, 2GB Ram, 2 SCSI disks for the databa (10K
rpm), base directory in one disk and logs in another.