COPY FROM command v8.1.4

Lists: pgsql-admin
From: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Cc: bitsandbytes88(at)hotmail(dot)com
Subject: COPY FROM command v8.1.4
Date: 2006-09-11 18:55:26
Message-ID: BAY116-F16BCF6E15545D8EFE3AD3FD12A0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

We are having some serious problems with the PostgreSQL COPY FROM command
and we have no clue what to do with this at this point. I think we need to
ask for 'guru' help in diagnosing this problem. Here is exactly what we are
doing:

We have a table with 12028587 records in it.
We do a COPY TO file of the records for this table.
We count the number of records in the file, it has 12028587 lines.
We do a COPY FROM that file into a table with identical structure as the
first table.
We count the number of records in the second table and there are only
12028538 records.
We do another COPY FROM same file onto a different database on an identical
table and this time there were 12028487 records in the count.

Another symptom of this problem is that when we do a pg_dump to file and
then a pg_restore from that file (only on large databases 20+ Gb) the
restore also loses about 100-300 records in the largest tables (the table in
question has 12+ and 6+ million records). The commands used were:

pg_dump -p <port> -U postgres -Fc -f <file> <db_name>
pg_restore -p <post> -U postgres -d <db> <file>

We are using PostgreSQL version 8.1.4 with SLES (suse enterprise linux with
SP3 - 9.3) running on 2 Opteron chips with 16 GB of memory.

Any help in diagnosing or resolving this problem is GREATLY appreciated.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: COPY FROM command v8.1.4
Date: 2006-09-11 19:12:16
Message-ID: 5558.1158001936@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

"Mr. Dan" <bitsandbytes88(at)hotmail(dot)com> writes:
> We have a table with 12028587 records in it.
> We do a COPY TO file of the records for this table.
> We count the number of records in the file, it has 12028587 lines.
> We do a COPY FROM that file into a table with identical structure as the
> first table.
> We count the number of records in the second table and there are only
> 12028538 records.

Try dumping the second table with COPY TO and diff'ing the dump files
to get more detail about what's missing.

If you repeat the entire experiment, are the same records missing each
time? I'm wondering about flaky hardware as much as anything.

regards, tom lane


From: "Peter Childs" <peterachilds(at)gmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Fwd: COPY FROM command v8.1.4
Date: 2006-09-12 09:17:04
Message-ID: a2de01dd0609120217ncfd2c21l7cdefa5d3cad9154@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Sorry Tom, Somone had not put the reply to header in correctly......

---------- Forwarded message ----------
From: Peter Childs <peterachilds(at)gmail(dot)com>
Date: 12-Sep-2006 10:16
Subject: Re: [ADMIN] COPY FROM command v8.1.4
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>

On 11/09/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Mr. Dan" <bitsandbytes88(at)hotmail(dot)com> writes:
> > We have a table with 12028587 records in it.
> > We do a COPY TO file of the records for this table.
> > We count the number of records in the file, it has 12028587 lines.
> > We do a COPY FROM that file into a table with identical structure as the
> > first table.
> > We count the number of records in the second table and there are only
> > 12028538 records.
>
> Try dumping the second table with COPY TO and diff'ing the dump files
> to get more detail about what's missing.
>
> If you repeat the entire experiment, are the same records missing each
> time? I'm wondering about flaky hardware as much as anything.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

I presume your doing a select count(*) from table not reading the
stats from vacuum verbose or pg_class....
I know it sounds stupid but its a common mistake and vacuum is only a guess

Peter Childs


From: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Cc: bitsandbytes88(at)hotmail(dot)com
Subject: COPY FROM command v8.1.4
Date: 2006-09-12 17:38:19
Message-ID: BAY116-F146A38326C5630EA33B36BD12B0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin


Hi,

Thanks for this quick response.
##Try dumping the second table with COPY TO and diff'ing the dump files to
get more detail about what's missing.##

We tried 3 times from a local file system, instead of the NFS mount (to try
to rule out hardware)

1st attempt: the data copied over fine
2nd attempt: there were 50 records missing
3rd attempt: there were 100 records missing.

Looking back and comparing the data, we realized that its always missing
rows in 25 (24+1) rows sequences or we're calling it a block. So, there were
2 blocks of 25 missing in the 2nd attempt and 4 blocks of 25 missing in the
third. That's the only thing consistent in doing it from anywhere (NFS
mounts or the local filessystem). We've had the same 'sporadic' results on 3
different servers, so that may rule out memory. Our local Reiser FS is in
formated in 64k blocks and our NFS mounts are 4k blocks.

In summary, we are very concerned and have no idea of what direction to go
with this.

~DjK

If you repeat the entire experiment, are the same records missing each time?
I'm wondering about flaky hardware as much as anything.

>From: "Mr. Dan" <bitsandbytes88(at)hotmail(dot)com>
>To: pgsql-admin(at)postgresql(dot)org
>CC: bitsandbytes88(at)hotmail(dot)com
>Subject: COPY FROM command v8.1.4
>Date: Mon, 11 Sep 2006 14:55:26 -0400
>
>We are having some serious problems with the PostgreSQL COPY FROM command
>and we have no clue what to do with this at this point. I think we need to
>ask for 'guru' help in diagnosing this problem. Here is exactly what we are
>doing:
>
>We have a table with 12028587 records in it.
>We do a COPY TO file of the records for this table.
>We count the number of records in the file, it has 12028587 lines.
>We do a COPY FROM that file into a table with identical structure as the
>first table.
>We count the number of records in the second table and there are only
>12028538 records.
>We do another COPY FROM same file onto a different database on an identical
>table and this time there were 12028487 records in the count.
>
>Another symptom of this problem is that when we do a pg_dump to file and
>then a pg_restore from that file (only on large databases 20+ Gb) the
>restore also loses about 100-300 records in the largest tables (the table
>in question has 12+ and 6+ million records). The commands used were:
>
>pg_dump -p <port> -U postgres -Fc -f <file> <db_name>
>pg_restore -p <post> -U postgres -d <db> <file>
>
>We are using PostgreSQL version 8.1.4 with SLES (suse enterprise linux with
>SP3 - 9.3) running on 2 Opteron chips with 16 GB of memory.
>
>Any help in diagnosing or resolving this problem is GREATLY appreciated.
>
>


From: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: COPY FROM command v8.1.4
Date: 2006-09-12 17:42:13
Message-ID: BAY116-F356B7D6515964308838106D12B0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

No, there are different blocks of records missing every time.

>
>
>If you repeat the entire experiment, are the same records missing each
>time? I'm wondering about flaky hardware as much as anything.
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: COPY FROM command v8.1.4
Date: 2006-09-12 21:27:52
Message-ID: 21688.1158096472@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

"Mr. Dan" <bitsandbytes88(at)hotmail(dot)com> writes:
> Looking back and comparing the data, we realized that its always missing
> rows in 25 (24+1) rows sequences or we're calling it a block. So, there were
> 2 blocks of 25 missing in the 2nd attempt and 4 blocks of 25 missing in the
> third.

This is pretty confusing. You mean that groups of 25 adjacent rows were
missing in the output? What's the (24+1) supposed to convey? Are these
groups aligned on 25-row boundaries in the whole output (ie, are there
an exact multiple of 25 rows before each dropped group)?

How are you doing the copies, exactly? SQL COPY command, psql \copy,
something else?

regards, tom lane


From: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: COPY FROM command v8.1.4
Date: 2006-09-13 15:57:14
Message-ID: BAY116-F350F077F2714231F971BD4D1280@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

>This is pretty confusing. You mean that groups of 25 adjacent rows were
>missing in the output?
Yes, isn't that interesting? It's always in a group of 25 rows. This is
always random. Say 800,000 to 800,025 out of 12 million represents one
random group of 25 rows.

>What's the (24+1) supposed to convey?
One time we thought we counted 24 rows. (in a row - yes adjacent)

>groups aligned on 25-row boundaries in the whole output (ie, are there
>an exact multiple of 25 rows before each dropped group)?
Yes

>How are you doing the copies, exactly? SQL COPY command, psql \copy,
>something else?
We've tried SQL COY and psql \copy and always get random results - 0,1, or 2
blocks of 25 rows missing.

Is there a pre-compiled version of postgres we can try ? Where could I
download that? We built postgres with GNU make 3.80, the documentation says
3.76 and higher is ok.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: COPY FROM command v8.1.4
Date: 2006-09-13 16:50:26
Message-ID: 13194.1158166226@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

"Mr. Dan" <bitsandbytes88(at)hotmail(dot)com> writes:
>> How are you doing the copies, exactly? SQL COPY command, psql \copy,
>> something else?

> We've tried SQL COY and psql \copy and always get random results - 0,1, or 2
> blocks of 25 rows missing.

Hmph. If it happens with a SQL COPY command then psql seems to be off
the hook, and that also eliminates some theories about dropped TCP
packets and such.

Would you check back in the source table for the COPY and see what the
ctid values are for the missing rows? I'm wondering about a pattern
like "the dropped rows of a group are all on the same disk page", ie,
what's being missed is one whole page at a time.

If that's what's happening, the only very plausible theory I can think
of is that your disk drive is sometimes glitching and returning a page
of all-zeroes instead of what it should return. Postgres will not
complain about this in normal operation (because there are legitimate
error-recovery scenarios where a zero page can be in a table); it'll
just treat the page as empty. VACUUM will complain though, so the next
step would be to set up a test table by copying your large table and
then repeatedly run plain VACUUM on the test table. If you get sporadic
warnings "relation foo page N is uninitialized --- fixing" then we have
the smoking gun. Don't run this test directly on a valuable table, as
each such message would mean you just lost another page of data :-(

FWIW, I spent several hours yesterday evening copying 6GB tables around
to see if I could reproduce any such problem, and I couldn't...

regards, tom lane


From: Ray Stell <stellr(at)cns(dot)vt(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject: Re: COPY FROM command v8.1.4
Date: 2006-09-13 22:54:38
Message-ID: 20060913225438.GB2690@cns.vt.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

You said "local Reiser FS." Maybe repeat on one of the others, Ext3/JFS?

Tom asked about hardware issues, is there nothing in syslog
that relates to the timing of the event? I don't recall you
responding in public to this. Maybe I missed it.

Just musing...

On Wed, Sep 13, 2006 at 12:50:26PM -0400, Tom Lane wrote:
> "Mr. Dan" <bitsandbytes88(at)hotmail(dot)com> writes:
> >> How are you doing the copies, exactly? SQL COPY command, psql \copy,
> >> something else?
>
> > We've tried SQL COY and psql \copy and always get random results - 0,1, or 2
> > blocks of 25 rows missing.
>
> Hmph. If it happens with a SQL COPY command then psql seems to be off
> the hook, and that also eliminates some theories about dropped TCP
> packets and such.
>
> Would you check back in the source table for the COPY and see what the
> ctid values are for the missing rows? I'm wondering about a pattern
> like "the dropped rows of a group are all on the same disk page", ie,
> what's being missed is one whole page at a time.
>
> If that's what's happening, the only very plausible theory I can think
> of is that your disk drive is sometimes glitching and returning a page
> of all-zeroes instead of what it should return. Postgres will not
> complain about this in normal operation (because there are legitimate
> error-recovery scenarios where a zero page can be in a table); it'll
> just treat the page as empty. VACUUM will complain though, so the next
> step would be to set up a test table by copying your large table and
> then repeatedly run plain VACUUM on the test table. If you get sporadic
> warnings "relation foo page N is uninitialized --- fixing" then we have
> the smoking gun. Don't run this test directly on a valuable table, as
> each such message would mean you just lost another page of data :-(
>
> FWIW, I spent several hours yesterday evening copying 6GB tables around
> to see if I could reproduce any such problem, and I couldn't...
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq

--


From: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: COPY FROM command v8.1.4
Date: 2006-09-14 13:14:29
Message-ID: BAY116-F12040DAA30AA163F06B71CD1290@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Hi Ray,

I've tried "it" on both WAFL and Reiser filesystems and gotten either 0,1,
or 2 blocks of 25 rows missing. I haven't tried it on ext3 or JFS. No, I
haven't seen any problems in the syslogs and I haven't seen any drops on the
network in regards to WAFL. Reiser is the local filesystem.

I'm planning to verify the ctid values today in the missing rows and get
back to Tom with information ASAP.

~DjK


From: "Robert Reeves" <robert(dot)reeves(at)quest(dot)com>
To: <pgsql-admin(at)postgresql(dot)org>
Subject: Re: COPY FROM command v8.1.4
Date: 2006-09-19 00:46:00
Message-ID: 0BAEFFA7AFADFD4497F50DD093687E07016FB908@melmbxw01.prod.quest.corp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Hi,
Any update on this ?

--robert

-----Original Message-----
From: pgsql-admin-owner(at)postgresql(dot)org
[mailto:pgsql-admin-owner(at)postgresql(dot)org] On Behalf Of Mr. Dan
Sent: Thursday, 14 September 2006 11:14 PM
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: [ADMIN] COPY FROM command v8.1.4

Hi Ray,

I've tried "it" on both WAFL and Reiser filesystems and gotten either
0,1,
or 2 blocks of 25 rows missing. I haven't tried it on ext3 or JFS. No,
I
haven't seen any problems in the syslogs and I haven't seen any drops on
the
network in regards to WAFL. Reiser is the local filesystem.

I'm planning to verify the ctid values today in the missing rows and get

back to Tom with information ASAP.

~DjK

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match


From: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: COPY FROM command v8.1.4
Date: 2006-09-19 20:01:41
Message-ID: BAY116-F31E8D9802EEB1FA62E37A6D1220@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin


Hi Tom,

Sorry - to all my groupies out there for the time delay :) -

It's a rather time consuming endeavor.

Ok the ctid numbers did all seem to match for the group of 25 rows that
disappear.

For example., ctiid (146649,1) to (146649,25) represented a group or block
of 25 rows that go missing. I have 44 total columns including the system
objects. In case it helps, here's my data types.

I'm not following exactly what these results mean. Are you saying because
it has the same ctid that it's a hardware problem? I would find that
difficult to believe because I've reproduced this on 2 different file
systems WAFL and REISER FS with totally different hard drives.

varchar(14) NOT NULL,
varchar(9) NOT NULL,
varchar(4) NOT NULL,
varchar(14),
varchar(25) NOT NULL,
char(1) NOT NULL,
char(4) NOT NULL,
char(1) NOT NULL,
char(1) NOT NULL,
char(3) NOT NULL,
char(1) NOT NULL,
char(2),
char(12),
umeric(7,3) NOT NULL,
char(1) NOT NULL,
char(1) NOT NULL,
varchar(8) NOT NULL,
char(3) NOT NULL,
char(2) NOT NULL,
char(3) NOT NULL,
char(3) NOT NULL,
int4 NOT NULL,
int4 NOT NULL,
char(1) NOT NULL,
timestamp,
timestamp,
timestamp NOT NULL,
timestamp NOT NULL,
int4,
char(5),
varchar(14),
int4,
char(1),
numeric(12,3),
numeric(12,3),
char(1),
char(1),
nuumeric(4,3)


From: Ray Stell <stellr(at)cns(dot)vt(dot)edu>
To: "Mr(dot) Dan" <bitsandbytes88(at)hotmail(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: COPY FROM command v8.1.4
Date: 2006-09-20 12:51:17
Message-ID: 20060920125117.GB16077@cns.vt.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin


Could you please keep this inline. Is there some reason the community
should not be able to read about debug logic? I asked this question
last week, no reply. Now, Mr. Dan is asking the same.

'What is the logic being followed here by ctid verification? You collect
from the original db the locations of the rows that come up missing on
the target db. How does that lead to an understanding of where the rows
"went?"'

Not knowing how pg technology would be used to trace what is going on
is limiting. If you could move to solaris 10 and use dtrace to turn on
reporting based on some logic that draws you near to the "problem" rows,
that might make the trace data more manageable. Also, it might point
to/away from an os problem. I don't see how else you can determine what
is going on, but I sure would like to hear about the tools that are used
for this purpose in pg. Please let us be informed.

On Tue, Sep 19, 2006 at 04:01:41PM -0400, Mr. Dan wrote:
>
> Hi Tom,
>
> Sorry - to all my groupies out there for the time delay :) -
>
> It's a rather time consuming endeavor.
>
> Ok the ctid numbers did all seem to match for the group of 25 rows that
> disappear.
>
> For example., ctiid (146649,1) to (146649,25) represented a group or block
> of 25 rows that go missing. I have 44 total columns including the system
> objects. In case it helps, here's my data types.
>
> I'm not following exactly what these results mean. Are you saying because
> it has the same ctid that it's a hardware problem? I would find that
> difficult to believe because I've reproduced this on 2 different file
> systems WAFL and REISER FS with totally different hard drives.
>
>
> varchar(14) NOT NULL,
> varchar(9) NOT NULL,
> varchar(4) NOT NULL,
> varchar(14),
> varchar(25) NOT NULL,
> char(1) NOT NULL,
> char(4) NOT NULL,
> char(1) NOT NULL,
> char(1) NOT NULL,
> char(3) NOT NULL,
> char(1) NOT NULL,
> char(2),
> char(12),
> umeric(7,3) NOT NULL,
> char(1) NOT NULL,
> char(1) NOT NULL,
> varchar(8) NOT NULL,
> char(3) NOT NULL,
> char(2) NOT NULL,
> char(3) NOT NULL,
> char(3) NOT NULL,
> int4 NOT NULL,
> int4 NOT NULL,
> char(1) NOT NULL,
> timestamp,
> timestamp,
> timestamp NOT NULL,
> timestamp NOT NULL,
> int4,
> char(5),
> varchar(14),
> int4,
> char(1),
> numeric(12,3),
> numeric(12,3),
> char(1),
> char(1),
> nuumeric(4,3)
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

--