Quick Links

Re: 8.3 / 8.2.6 restore comparison

Lists:	pgsql-hackers

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	8.3 / 8.2.6 restore comparison
Date:	2008-02-07 06:52:13
Message-ID:	20080206225213.3910a2f4@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

I have been testing a migration for a week now trying to get it into a
reasonable state. This is what we have:

Restore file 220G

8.2.6 and 8.3.0 are configured identically:

shared_buffers = 8000MB
work_mem = 32MB
maintenance_work_mem = 512MB
fsync = off
full_page_writes = off
checkpoint_segments = 300
synchronous_commit = off (8.3)
wal_writer_delay = off (8.3)
autovacuum = off

8.2.6 after 2 hours has restored 41GB.
8.3.0 after 2.5 hours had restored 38GB.

Originally I was thinking that 8.2.6 was stomping 8.3. However I am
thinking that the reduction in the tuple header sizes for 8.3 means
that yes I restored 38GB, it is actually *more* data than 8.2.6. Does
that seem accurate to everyone else? If so what can we do to speed this
up? We are certainly *not* saturating the disk (16 spindles SCSI).

I am thinking the way we are going to need to do this is to have an
extended outage and write a custom script to do a concurrent dump and
load. (no in this case slony is not an option).

Sincerely,

Joshua D. Drake

--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-07 14:47:08
Message-ID:	47AB19EC.2070703@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joshua D. Drake wrote:
> Hello,
>
> I have been testing a migration for a week now trying to get it into a
> reasonable state. This is what we have:
>
> Restore file 220G
>
> 8.2.6 and 8.3.0 are configured identically:
>
> shared_buffers = 8000MB
> work_mem = 32MB
> maintenance_work_mem = 512MB
> fsync = off
> full_page_writes = off
> checkpoint_segments = 300
> synchronous_commit = off (8.3)
> wal_writer_delay = off (8.3)
> autovacuum = off
>
> 8.2.6 after 2 hours has restored 41GB.
> 8.3.0 after 2.5 hours had restored 38GB.
>
> Originally I was thinking that 8.2.6 was stomping 8.3. However I am
> thinking that the reduction in the tuple header sizes for 8.3 means
> that yes I restored 38GB, it is actually *more* data than 8.2.6. Does
> that seem accurate to everyone else? If so what can we do to speed this
> up? We are certainly *not* saturating the disk (16 spindles SCSI).
>
> I am thinking the way we are going to need to do this is to have an
> extended outage and write a custom script to do a concurrent dump and
> load. (no in this case slony is not an option).
>
>

I just tested a ~110GB load. On our modest backup server, 8.2 yesterday
did the data load (i.e. the COPY steps) in 1h57m. Today, 8.3 on
identical data and settings took 1h42m. Relation size is down by about
10% too, which is very nice, and probably accounts for the load time
improvement.

cheers

andrew

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-07 15:55:17
Message-ID:	20080207075517.1b05b9a6@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 07 Feb 2008 09:47:08 -0500
Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

>
> > Restore file 220G
> >
> > 8.2.6 and 8.3.0 are configured identically:
> >
> > shared_buffers = 8000MB
> > work_mem = 32MB
> > maintenance_work_mem = 512MB
> > fsync = off
> > full_page_writes = off
> > checkpoint_segments = 300
> > synchronous_commit = off (8.3)
> > wal_writer_delay = off (8.3)
> > autovacuum = off
> >
> > 8.2.6 after 2 hours has restored 41GB.
> > 8.3.0 after 2.5 hours had restored 38GB.
>
> I just tested a ~110GB load. On our modest backup server, 8.2
> yesterday did the data load (i.e. the COPY steps) in 1h57m. Today,
> 8.3 on identical data and settings took 1h42m. Relation size is down
> by about 10% too, which is very nice, and probably accounts for the
> load time improvement.

Ergghh o.k. I am definitely missing something in the environment. By
your numbers I should be well over 100GB restored at 2.5 hours. I am
not. I am only 38GB in.

What type of IO do you have on that machine? What type of CPU and RAM?

Sincerely,

Joshua D. Drake

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-07 16:11:32
Message-ID:	20080207161132.GI5031@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Joshua D. Drake (jd(at)commandprompt(dot)com) wrote:
> Ergghh o.k. I am definitely missing something in the environment. By
> your numbers I should be well over 100GB restored at 2.5 hours. I am
> not. I am only 38GB in.

I'm guessing you've checked this, so don't shoot me if you have, but....
How was the "restore file" built? Does it create the indexes, primary
keys, FKs, whatever, before loading the data? That'd slow things down
tremendously.. Or if it's creating them while loading the data, there
would be large pauses while it's building the indexes...

Enjoy,

Stephen

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-07 16:20:49
Message-ID:	47AB2FE1.50506@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joshua D. Drake wrote:
> On Thu, 07 Feb 2008 09:47:08 -0500
> Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
>
>>> Restore file 220G
>>>
>>> 8.2.6 and 8.3.0 are configured identically:
>>>
>>> shared_buffers = 8000MB
>>> work_mem = 32MB
>>> maintenance_work_mem = 512MB
>>> fsync = off
>>> full_page_writes = off
>>> checkpoint_segments = 300
>>> synchronous_commit = off (8.3)
>>> wal_writer_delay = off (8.3)
>>> autovacuum = off
>>>
>>> 8.2.6 after 2 hours has restored 41GB.
>>> 8.3.0 after 2.5 hours had restored 38GB.
>>>
>> I just tested a ~110GB load. On our modest backup server, 8.2
>> yesterday did the data load (i.e. the COPY steps) in 1h57m. Today,
>> 8.3 on identical data and settings took 1h42m. Relation size is down
>> by about 10% too, which is very nice, and probably accounts for the
>> load time improvement.
>>
>
> Ergghh o.k. I am definitely missing something in the environment. By
> your numbers I should be well over 100GB restored at 2.5 hours. I am
> not. I am only 38GB in.
>
> What type of IO do you have on that machine? What type of CPU and RAM?
>

2Ghz Xeon dual core
16 Gb RAM
HW RAID0 data store - not sure how many spindles

I didn't touch any of the 8.3-only config stuff. I have work_mem set a
lot higher than you do, though - not sure if that makes any difference
to a bulk load.

This is not a very heavy duty box.

Note: a full restore takes much longer than this - it is almost all
taken up building indexes. I will be testing that over the weekend,
probably.

cheers

andrew

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-07 16:35:17
Message-ID:	20080207083517.1fc29ece@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 07 Feb 2008 11:20:49 -0500
Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

>
>
> Joshua D. Drake wrote:
> > On Thu, 07 Feb 2008 09:47:08 -0500
> > Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> >
> >
> >>> Restore file 220G
> >>>
> >>> 8.2.6 and 8.3.0 are configured identically:
> >>>
> >>> shared_buffers = 8000MB
> >>> work_mem = 32MB
> >>> maintenance_work_mem = 512MB
> >>> fsync = off
> >>> full_page_writes = off
> >>> checkpoint_segments = 300
> >>> synchronous_commit = off (8.3)
> >>> wal_writer_delay = off (8.3)
> >>> autovacuum = off
> >>>
> >>> 8.2.6 after 2 hours has restored 41GB.
> >>> 8.3.0 after 2.5 hours had restored 38GB.
> >>>
> >> I just tested a ~110GB load. On our modest backup server, 8.2
> >> yesterday did the data load (i.e. the COPY steps) in 1h57m. Today,
> >> 8.3 on identical data and settings took 1h42m. Relation size is
> >> down by about 10% too, which is very nice, and probably accounts
> >> for the load time improvement.
> >>
> >
> > Ergghh o.k. I am definitely missing something in the environment. By
> > your numbers I should be well over 100GB restored at 2.5 hours. I am
> > not. I am only 38GB in.
> >
> > What type of IO do you have on that machine? What type of CPU and
> > RAM?
>
> 2Ghz Xeon dual core
> 16 Gb RAM
> HW RAID0 data store - not sure how many spindles
>
> I didn't touch any of the 8.3-only config stuff. I have work_mem set
> a lot higher than you do, though - not sure if that makes any
> difference to a bulk load.
>
> This is not a very heavy duty box.

Depends on how you look at it. Restores are traditionally CPU bound if
you have any kind of IO. If you have RAID 0 you have some IO available
to you. It bothers me that the environment I am testing which has 16
spindles (8 per volume) is garnering such miserable performance in
comparison.

In my environment we aren't able to push the I/O at all (wait times <
3%) and yet these are Opterons that we have which have great memory
bandwidth. 8.2 did use more I/O than 8.3 but I think that is just
architectural with the change to 8.3 in general.

Joshua D. Drake
--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-07 16:49:01
Message-ID:	20080207084901.7fcf418a@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 7 Feb 2008 11:11:32 -0500
Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> * Joshua D. Drake (jd(at)commandprompt(dot)com) wrote:
> > Ergghh o.k. I am definitely missing something in the environment. By
> > your numbers I should be well over 100GB restored at 2.5 hours. I am
> > not. I am only 38GB in.
>
> I'm guessing you've checked this, so don't shoot me if you have,
> but....

Uhh yeah :)

>How was the "restore file" built? Does it create the
> indexes, primary keys, FKs, whatever, before loading the data?
> That'd slow things down tremendously.. Or if it's creating them
> while loading the data, there would be large pauses while it's
> building the indexes...
>

Absolutely correct. it would. This dump was created using pg_dumpall.

Sincerely,

Joshua D. Drake

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-07 16:54:07
Message-ID:	Pine.GSO.4.64.0802071139470.10403@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 6 Feb 2008, Joshua D. Drake wrote:

> 8.2.6 after 2 hours has restored 41GB.

I've been doing a long bulk import job recently (COPY) on a box with more
spindles than yours (but with a dumb controller) and I too am stuck at
that speed; I calculate a consistant 19.6GB/hour. The actual disk I/O is
very low as that works out to only 5.7MB/s of progress. Mine was
bottlenecked by capacity of a single CPU (4X Opteron system). I think
this is one of those barriers it's hard to crack without a caching
controller, for reasons I haven't figured out completely yet.

> I am thinking the way we are going to need to do this is to have an
> extended outage and write a custom script to do a concurrent dump and
> load.

If you look at the -performance list this week I've been yelping about
this issue and trying to figure out how to setup a useful multi-CPU loader
for cases like these.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-07 17:31:42
Message-ID:	20080207093142.64a3b5ff@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 7 Feb 2008 11:54:07 -0500 (EST)
Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> On Wed, 6 Feb 2008, Joshua D. Drake wrote:
>
> > 8.2.6 after 2 hours has restored 41GB.
>
> I've been doing a long bulk import job recently (COPY) on a box with
> more spindles than yours (but with a dumb controller) and I too am
> stuck at that speed; I calculate a consistant 19.6GB/hour.

Bingo! O.k. that is ~ I am pushing. So I am not alone in this world.

> The
> actual disk I/O is very low as that works out to only 5.7MB/s of
> progress. Mine was bottlenecked by capacity of a single CPU (4X
> Opteron system). I think this is one of those barriers it's hard to
> crack without a caching controller, for reasons I haven't figured out
> completely yet.

O.k. we have 8X but it is also Opteron.

>
> > I am thinking the way we are going to need to do this is to have an
> > extended outage and write a custom script to do a concurrent dump
> > and load.
>
> If you look at the -performance list this week I've been yelping
> about this issue and trying to figure out how to setup a useful
> multi-CPU loader for cases like these.

I would think that for the custom format this would be reasonably
simple. It is certainly something we need to figure out. The idea that
PostgreSQL can only restore 20G an hour on a box that can actually
write 120+ Megs a second (random) is pathetic.

I know Luke has mentioned some issues in the past as well around CPU
boundness with an upper limit of 300M/s (IIRC) but even that doesn't
equate to what is going on here as we are not getting anywhere near
that.

Joshua D. Drake

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-07 18:47:22
Message-ID:	8001.1202410042@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> I know Luke has mentioned some issues in the past as well around CPU
> boundness with an upper limit of 300M/s (IIRC) but even that doesn't
> equate to what is going on here as we are not getting anywhere near
> that.

Some vmstat and oprofile investigation seems called for. Right now
we're just guessing about what the bottleneck is.

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-08 00:09:56
Message-ID:	20080207160956.6f223e2f@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 07 Feb 2008 13:47:22 -0500
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> > I know Luke has mentioned some issues in the past as well around CPU
> > boundness with an upper limit of 300M/s (IIRC) but even that doesn't
> > equate to what is going on here as we are not getting anywhere near
> > that.
>
> Some vmstat and oprofile investigation seems called for. Right now
> we're just guessing about what the bottleneck is.

vmstat -adSK 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free inact active si so bi bo in cs us sy id wa
0 2 800 15853104 7150484 8581708 0 0 1571 387 0 0 10 7 71 12
2 0 800 15668016 7335564 8581456 0 0 12982 24382 2699 1421 11 2 82 5
1 0 800 15488688 7513572 8583016 0 0 13006 24351 2601 1322 10 2 82 6
1 0 800 15296024 7704488 8584160 0 0 13494 23324 2430 1135 11 2 84 3
1 0 800 15017240 7984204 8582308 0 0 17219 29910 2632 1427 10 3 86 1
1 2 800 14754224 8239160 8583344 0 0 13710 41932 2666 1371 8 2 80 9
1 3 800 14590216 8390680 8594228 0 0 8112 33920 2361 1351 5 2 72 20
1 0 800 14435000 8562800 8585540 0 0 9138 23773 3114 1461 6 2 75 17
2 0 800 14147160 8848424 8587072 0 0 14967 35206 3156 2675 10 4 76 10
1 0 800 13850560 9141480 8588632 0 0 15454 42181 3047 2633 10 4 72 14
1 1 800 13765824 9212296 8603744 0 0 4805 33795 2903 6312 4 9 68 20
1 0 800 13552088 9436156 8594284 0 0 11916 20255 2505 1695 7 3 79 11
1 0 800 13375152 9612124 8595336 0 0 11829 26774 2644 1359 11 2 83 4
1 0 800 13301232 9684104 8596356 0 0 4747 21765 2173 829 12 1 82 5
1 0 800 13294680 9693060 8597360 0 0 386 1786 1732 587 12 0 87 0
1 0 800 13284448 9702212 8598452 0 0 410 1871 1939 878 13 0 86 1
1 0 800 13273056 9711180 8599452 0 0 386 1870 1996 907 12 0 87 0
1 0 800 13235488 9748400 8600492 0 0 1824 1862 2128 1107 12 1 86 1
1 0 800 13019720 9962004 8602264 0 0 17773 19770 2632 1596 11 2 81 6
1 0 800 12832288 10143020 8603912 0 0 18359 19553 2627 1586 11 2 78 8
1 0 800 12665952 10313812 8605088 0 0 17102 17550 2587 1582 11 2 83 5
1 0 800 12491488 10486568 8606292 0 0 16795 17964 2514 1488 11 2 85 2
1 0 800 12317592 10659160 8607484 0 0 17103 18222 2543 1531 11 2 82 5
2 0 800 12137840 10836760 8608636 0 0 17562 17909 2526 1496 11 2 82 6

This is 8.2.6, 64bit.

Sincerely,

Joshua D. Drake

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-08 00:20:26
Message-ID:	19181.1202430026@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Some vmstat and oprofile investigation seems called for. Right now
>> we're just guessing about what the bottleneck is.

> vmstat -adSK 5
> ...

Looks to me like you're disk-bound (and your kernel is pretty bad at
distinguishing "idle" from "disk wait" states). Plain COPY into an
index-less table ought to be writing fairly linearly, so I'm surprised
you aren't able to get more out of the disk. Have you got WAL on
a separate spindle from the database?

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-08 00:37:39
Message-ID:	20080207163739.208b0d88@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 07 Feb 2008 19:20:26 -0500
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Looks to me like you're disk-bound (and your kernel is pretty bad at
> distinguishing "idle" from "disk wait" states). Plain COPY into an
> index-less table ought to be writing fairly linearly, so I'm surprised
> you aren't able to get more out of the disk. Have you got WAL on
> a separate spindle from the database?

I didn't, but do now and am testing. The way this is currently designed
is:

/data1 (8 disks RAID 10)
/data2 (8 disks RAID 10)
/pg_xlogs (2 disks RAID 1)

/data1 is what is primarily written against for the first couple of
hours and then it will switch to data2 because of table spaces. However
either way, we should expect (I would think) to see at least 100 megs a
second on an 8 Disk RAID 10. It is SCSI.

vmstat -adSK 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free inact active si so bi bo in cs us sy id wa
1 1 160 6534048 23711184 1434532 0 0 1571 387 0 0 10 7 71 12
1 1 160 6206560 23898644 1573680 0 0 18168 23851 2684 1647 11 3 84 2
1 2 160 5887072 24080840 1711060 0 0 17655 58395 3322 1762 11 3 74 11
2 2 160 5546160 24275268 1856108 0 0 18681 61834 3302 1838 11 3 74 11
1 2 160 5363432 24383836 1933028 0 0 13043 52353 3247 1578 12 2 74 11
1 1 160 5295336 24423764 1963808 0 0 4997 39778 2865 1310 12 1 75 12
1 0 160 5284184 24426444 1971608 0 0 385 6967 2087 955 13 1 84 3
1 0 160 5274608 24429180 1978236 0 0 384 1009 2030 958 13 0 87 0
1 0 160 5265408 24431976 1984688 0 0 410 946 1780 656 12 0 87 0
1 1 160 5224912 24453296 2003700 0 0 1998 23168 2419 1163 12 1 79 8
1 2 160 5005048 24605300 2069236 0 0 18270 41342 2942 1721 12 2 76 10
1 2 160 4816016 24748120 2117208 0 0 19014 37165 3014 1702 12 2 72 15
1 1 160 4618064 24894476 2167412 0 0 19091 15226 2517 1626 12 2 83 3
1 1 160 4423376 25039072 2218324 0 0 18603 10398 2585 1644 12 2 84 3
1 1 160 4223568 25187592 2268816 0 0 19475 10326 2511 1666 12 2 84 2
1 1 160 4023056 25337080 2319844 0 0 19450 10601 2462 1543 12 2 84 2
1 1 160 3821328 25487664 2370328 0 0 19834 10657 2454 1540 12 2 84 2
1 1 160 3617616 25640832 2420820 0 0 20398 11202 2536 1678 12 2 84 2
1 2 160 3410040 25795488 2473800 0 0 20090 29063 2756 1680 12 2 76 10
1 1 160 3189064 25957688 2531996 0 0 20321 13745 2523 1595 12 2 83 4

Sincerely,

Joshua D. Drake

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-08 01:39:04
Message-ID:	20080207173904.7b974273@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 7 Feb 2008 16:37:39 -0800
"Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

> I didn't, but do now and am testing. The way this is currently
> designed is:
>
> /data1 (8 disks RAID 10)
> /data2 (8 disks RAID 10)
> /pg_xlogs (2 disks RAID 1)
>
> /data1 is what is primarily written against for the first couple of
> hours and then it will switch to data2 because of table spaces.
> However either way, we should expect (I would think) to see at least
> 100 megs a second on an 8 Disk RAID 10. It is SCSI.
>

Following up with this... 22G in one hour, with xlogs on a different
partition. Just looking at we are averaging 3-5% IOWait, further we are
only writing ~ 2Megs a second.

This is frustrating.

Sincerely,

Joshua D. Drake

From:	Mark Wong <markwkm(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-08 01:41:48
Message-ID:	20080207174148.3e08b9cf.markwkm@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 07 Feb 2008 13:47:22 -0500
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

I also recommend looking into partition alignment on the disks. At
least to quantify how much the storage subsystem suffers from
mis-aligned partitions. It could be that it doesn't.

Regards,
Mark

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-08 02:04:44
Message-ID:	20603.1202436284@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Following up with this... 22G in one hour, with xlogs on a different
> partition. Just looking at we are averaging 3-5% IOWait, further we are
> only writing ~ 2Megs a second.

It might be interesting to capture some strace output and get a sense of
what's actually being read and written. It seems to me that the bad
I/O rate has to indicate that we're seeking all over the place. We
should try to determine exactly what's being touched in what order.

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-08 02:32:41
Message-ID:	20080207183241.0bafd8b6@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 07 Feb 2008 21:04:44 -0500
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> > Following up with this... 22G in one hour, with xlogs on a different
> > partition. Just looking at we are averaging 3-5% IOWait, further we
> > are only writing ~ 2Megs a second.
>
> It might be interesting to capture some strace output and get a sense
> of what's actually being read and written. It seems to me that the
> bad I/O rate has to indicate that we're seeking all over the place.
> We should try to determine exactly what's being touched in what order.

Do you want counts or actual output?

Joshua D. Drake

>
> regards, tom lane
>
> ---------------------------(end of
> broadcast)--------------------------- TIP 4: Have you searched our
> list archives?
>
> http://archives.postgresql.org
>

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-08 02:40:37
Message-ID:	20080207184037.34cb0ace@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 7 Feb 2008 18:32:41 -0800
"Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

>
> Do you want counts or actual output?

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
58.22 1.597638 33 47795 write
31.86 0.874104 23 38024 recvfrom
4.43 0.121559 92 1319 munmap
4.17 0.114445 3 37906 lseek
0.70 0.019341 13 1440 brk
0.56 0.015402 12 1316 mmap
0.04 0.000991 26 38 open
0.01 0.000298 8 38 close
0.01 0.000156 4 38 time
0.00 0.000008 8 1 semop
------ ----------- ----------- --------- --------- ----------------
100.00 2.743942 127915 total

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-19 21:21:46
Message-ID:	20080219132146.05c9f29a@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 7 Feb 2008 18:40:37 -0800
"Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

> On Thu, 7 Feb 2008 18:32:41 -0800
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:
>
> >
> > Do you want counts or actual output?
>
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 58.22 1.597638 33 47795 write
> 31.86 0.874104 23 38024 recvfrom
> 4.43 0.121559 92 1319 munmap
> 4.17 0.114445 3 37906 lseek
> 0.70 0.019341 13 1440 brk
> 0.56 0.015402 12 1316 mmap
> 0.04 0.000991 26 38 open
> 0.01 0.000298 8 38 close
> 0.01 0.000156 4 38 time
> 0.00 0.000008 8 1 semop
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 2.743942 127915 total
>

Were their any thoughts on this? I am also finding that backing up this
database is rudely slow with the same type of odd metrics (almost zero
(or zero) iowait). We can saturate a CPU but the CPU is certainly not
pushing the data to disk as fast as it could.

Joshua D. Drake

- --
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHu0hqATb/zqfZUUQRAm77AJ41yWT4f6UNiMwyHtcq9GASK51uDgCgpKzf
lu14eCbSDl2v2Rsq40zww8E=
=DshT
-----END PGP SIGNATURE-----

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-19 21:36:48
Message-ID:	20080219133648.6697af28@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 19 Feb 2008 13:21:46 -0800
"Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

> Were their any thoughts on this? I am also finding that backing up
> this database is rudely slow with the same type of odd metrics
> (almost zero (or zero) iowait). We can saturate a CPU but the CPU is
> certainly not pushing the data to disk as fast as it could.

Further on this. We have tested on RHEL4 and RHEL5. Their are two
machines, each with 32GB of ram. I have four of these in the RHEL 5
machine:

processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 8216
stepping : 2
cpu MHz : 2411.132
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm
extapic cr8_legacy bogomips : 4823.59 TLB size : 1024 4K
pages clflush size : 64 cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

Here is the mini specs:

# 5U Q524 - Quad Opteron 24 SCSI
# Tyan S4885G3NR 800 Series CPU Opteron 800
# AMD Opteron 880 - 2.4GHz 2Core x 4
# 32GB - DDR ECC REG 400MHz (16x2GB) x 1 (16 x 2GB 3200 ECC REG Smart
Modular (32GB) # HD 1: 73GB SCSI 15K RPM x 24
# PCI 1: MegaRaid SCSI 320-2E - 2ch, U320, PCI Express, 128MB
# MegaRaid LSIiTBBU01 Battery - Order #: LSI00009
# PCI 2: MegaRaid SCSI 320-2E - 2ch, U320, PCI Express, 128MB
# MegaRaid LSIiTBBU01 Battery - Order #: LSI00009
# DVD-ROM/Sony 3.5 Inch Floppy Drive

The configuration is:

/ RAID 1
/ xlogs RAID 1
/data1 10 drives RAID 10
/data2 10 drives RAID 10

The thing that is frustrating here, is it appears that PostgreSQL just
can't utilize the hardware. I *know* it can because we have larger
machines in production that use PostgreSQL happily. However when I have
220G backups taking 8 hours and restores taking 11 hours, I begin to
wonder where the bottleneck is.

Assuming 25 megs a second per drive (random write) on data1 and data2
we should be pushing 250M a second. Just to be insanely conservative
let's cut that in half to 125M per second. That means every 10 seconds
we should do ~ 1G. That means every minute we should to ~ 6G, which
means 360G an hour.

Granted we have index creation and everything else going on but 11
hours and no IO chewing?

As a note these are reproducible on both machines regardless of RHEL5
or RHEL4.

I know there are much stats here but I have provided them in previous
posts on this thread. Perhaps someone sees a red flag in the hardware?

Sincerely,

Joshua D. Drake
- --
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHu0vyATb/zqfZUUQRAiuYAJ9ut6i/cPv2MYc8RO2+wNw09M5/WwCfUaGY
sAkFt+S14i0kFMn6mz9juBw=
=TNys
-----END PGP SIGNATURE-----

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-20 01:36:45
Message-ID:	20080219173645.2aac1bd2@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 19 Feb 2008 13:36:48 -0800
"Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

Hello,

Some more testing on this. This time (using 8.3) I modified my restore
process to use multiple processes by manipulating TOC files. I used
three processes for the data copies, two processes for the pk creation,
two process for normal indexes and two processes for constraint
creation.

The machine averaged 40-60MB/s write versus the pathetic ~ 2/3 MB/s on
a single thread. It had an average I/O wait of < 10%. Lastly it
restored 57G of data 1.25 hours. Under my single thread testing 57G
would have taken ~ 3 hours.

I am pretty sure I can make it faster too as I wasn't balancing with
tablespaces nor did I move the xlogs off.

IMO this pretty much proves that we need to seriously look at a multi
connection restores. I can't imagine a situation where we look at
people that have dual and quad cores on their desktops and say... sorry
we can't use that to help you get your data quicker.

Sincerely,

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHu4QvATb/zqfZUUQRAhyZAJ9U468fVDm8ww/2TrjDt6gM2wtlhwCffYYq
KJEsKpvRm6efiMQ+uAn/cs4=
=ZEGc
-----END PGP SIGNATURE-----

From:	Luke Lonergan <llonergan(at)greenplum(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-20 02:54:38
Message-ID:	C3E0D66E.54519%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bah.

It's the stuff in the format translation path and conversion to/from datums
that is the bottleneck.

We sped up COPY TO recently by a factor of 10 using similar approaches to
what we did for COPY FROM in the past. There's a format conversion that is
the culprit.

We routinely get about 12 MB/s of heap insertion rate per CPU core and it's
CPU bound.

You can peek in on what's happening using gstack on Linux, or the gdb
"attach and print stacktrace" approach for a crude profile.

- Luke

On 2/19/08 1:36 PM, "Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Tue, 19 Feb 2008 13:21:46 -0800
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:
>
>> Were their any thoughts on this? I am also finding that backing up
>> this database is rudely slow with the same type of odd metrics
>> (almost zero (or zero) iowait). We can saturate a CPU but the CPU is
>> certainly not pushing the data to disk as fast as it could.
>
> Further on this. We have tested on RHEL4 and RHEL5. Their are two
> machines, each with 32GB of ram. I have four of these in the RHEL 5
> machine:
>
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 65
> model name : Dual-Core AMD Opteron(tm) Processor 8216
> stepping : 2
> cpu MHz : 2411.132
> cache size : 1024 KB
> physical id : 0
> siblings : 2
> core id : 0
> cpu cores : 2
> fpu : yes
> fpu_exception : yes
> cpuid level : 1
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
> fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm
> extapic cr8_legacy bogomips : 4823.59 TLB size : 1024 4K
> pages clflush size : 64 cache_alignment : 64
> address sizes : 40 bits physical, 48 bits virtual
> power management: ts fid vid ttp tm stc
>
> Here is the mini specs:
>
> # 5U Q524 - Quad Opteron 24 SCSI
> # Tyan S4885G3NR 800 Series CPU Opteron 800
> # AMD Opteron 880 - 2.4GHz 2Core x 4
> # 32GB - DDR ECC REG 400MHz (16x2GB) x 1 (16 x 2GB 3200 ECC REG Smart
> Modular (32GB) # HD 1: 73GB SCSI 15K RPM x 24
> # PCI 1: MegaRaid SCSI 320-2E - 2ch, U320, PCI Express, 128MB
> # MegaRaid LSIiTBBU01 Battery - Order #: LSI00009
> # PCI 2: MegaRaid SCSI 320-2E - 2ch, U320, PCI Express, 128MB
> # MegaRaid LSIiTBBU01 Battery - Order #: LSI00009
> # DVD-ROM/Sony 3.5 Inch Floppy Drive
>
> The configuration is:
>
> / RAID 1
> / xlogs RAID 1
> /data1 10 drives RAID 10
> /data2 10 drives RAID 10
>
> The thing that is frustrating here, is it appears that PostgreSQL just
> can't utilize the hardware. I *know* it can because we have larger
> machines in production that use PostgreSQL happily. However when I have
> 220G backups taking 8 hours and restores taking 11 hours, I begin to
> wonder where the bottleneck is.
>
> Assuming 25 megs a second per drive (random write) on data1 and data2
> we should be pushing 250M a second. Just to be insanely conservative
> let's cut that in half to 125M per second. That means every 10 seconds
> we should do ~ 1G. That means every minute we should to ~ 6G, which
> means 360G an hour.
>
> Granted we have index creation and everything else going on but 11
> hours and no IO chewing?
>
> As a note these are reproducible on both machines regardless of RHEL5
> or RHEL4.
>
> I know there are much stats here but I have provided them in previous
> posts on this thread. Perhaps someone sees a red flag in the hardware?
>
>
> Sincerely,
>
> Joshua D. Drake
> - --
> The PostgreSQL Company since 1997: http://www.commandprompt.com/
> PostgreSQL Community Conference: http://www.postgresqlconference.org/
> Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
> PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
>
> iD8DBQFHu0vyATb/zqfZUUQRAiuYAJ9ut6i/cPv2MYc8RO2+wNw09M5/WwCfUaGY
> sAkFt+S14i0kFMn6mz9juBw=
> =TNys
> -----END PGP SIGNATURE-----
>
---------------------------(end of broadcast)---------------------------
TIP
> 9: In versions below 8.0, the planner will ignore your desire to
choose
> an index scan if your joining column's datatypes do not
match

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-22 17:26:06
Message-ID:	20080222092606.0debcf59@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 19 Feb 2008 18:54:38 -0800
Luke Lonergan <llonergan(at)greenplum(dot)com> wrote:

> Bah.
>
> It's the stuff in the format translation path and conversion to/from
> datums that is the bottleneck.

So how do we fix that? This performance is pretty anemic.

>
> We sped up COPY TO recently by a factor of 10 using similar
> approaches to what we did for COPY FROM in the past. There's a
> format conversion that is the culprit.

Any chance that code (in whatever shape) could be released for review
of applicability to core?

>
> We routinely get about 12 MB/s of heap insertion rate per CPU core
> and it's CPU bound.

That's embarrassing on today's hardware don't you think? Or am I missing
something entirely?

>
> You can peek in on what's happening using gstack on Linux, or the gdb
> "attach and print stacktrace" approach for a crude profile.
>

Yeah, that... is beyond my abilities. Well reading it is anyway. I can
provide any information people want though. Tom? Greg? Andrew? Somebody?
What information do you want from me to help you track this down?

Sincerely,

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHvwWuATb/zqfZUUQRAucvAKCAXMlS25oLrTD5bbdIuy6/qq61OgCgn4NG
4I2V5WgSMgr24TAM3CsIO50=
=AaB/
-----END PGP SIGNATURE-----

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-22 17:39:57
Message-ID:	5641.1203701997@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Yeah, that... is beyond my abilities. Well reading it is anyway. I can
> provide any information people want though. Tom? Greg? Andrew? Somebody?
> What information do you want from me to help you track this down?

The vmstat output you showed before said user CPU time was only around
12%, which seems to destroy Luke's thesis that data conversion time
is the problem. IIRC we were speculating that data was being written
in a pattern that required a lot of seeking thus ruining throughput,
but we didn't have any hard evidence of that. Did you do the strace'ing
I suggested?

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-22 17:46:51
Message-ID:	20080222094651.6cd38379@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 22 Feb 2008 12:39:57 -0500
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> > Yeah, that... is beyond my abilities. Well reading it is anyway. I
> > can provide any information people want though. Tom? Greg? Andrew?
> > Somebody? What information do you want from me to help you track
> > this down?
>
> The vmstat output you showed before said user CPU time was only around
> 12%, which seems to destroy Luke's thesis that data conversion time
> is the problem. IIRC we were speculating that data was being written
> in a pattern that required a lot of seeking thus ruining throughput,
> but we didn't have any hard evidence of that. Did you do the
> strace'ing I suggested?

Yes, I asked if you wanted counts or the whole output. You didn't
answer :). I provided the counts.

http://archives.postgresql.org/pgsql-hackers/2008-02/msg00359.php

I won't be able to run another test until this weekend :( but anything
I have is yours.

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHvwqLATb/zqfZUUQRAl8tAJ0VmS6ffaFBq1a/UUYZxX9F84vcFgCgkhgl
+UF5Zp59H/JIJa1/ZSKvuC4=
=xw/k
-----END PGP SIGNATURE-----

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-22 17:54:00
Message-ID:	6104.1203702840@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> IIRC we were speculating that data was being written
>> in a pattern that required a lot of seeking thus ruining throughput,
>> but we didn't have any hard evidence of that. Did you do the
>> strace'ing I suggested?

> Yes, I asked if you wanted counts or the whole output. You didn't
> answer :). I provided the counts.

Uh, sorry. Counts are useless here, we need to see the sequence of
write locations to find out if there's a lot of nonconsecutive
writes happening. BTW, the strace had better run across the whole
PG process tree --- it's quite possible that there's some interaction
between the bgwriter and the backend doing COPY, for instance.

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-22 18:04:49
Message-ID:	20080222100449.3859ee1b@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 22 Feb 2008 12:54:00 -0500
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> IIRC we were speculating that data was being written
> >> in a pattern that required a lot of seeking thus ruining
> >> throughput, but we didn't have any hard evidence of that. Did you
> >> do the strace'ing I suggested?
>
> > Yes, I asked if you wanted counts or the whole output. You didn't
> > answer :). I provided the counts.
>
> Uh, sorry. Counts are useless here, we need to see the sequence of
> write locations to find out if there's a lot of nonconsecutive
> writes happening. BTW, the strace had better run across the whole
> PG process tree --- it's quite possible that there's some interaction
> between the bgwriter and the backend doing COPY, for instance.

So you want an strace on postmaster not the connection?

Sincerely,

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHvw7DATb/zqfZUUQRAoB0AJ9motp+YpW8A4jlu03VsiaLHRIxAACfQ+tO
ZDXUqBcR5e7xsgPnKju9AZc=
=q3FH
-----END PGP SIGNATURE-----

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 05:17:44
Message-ID:	13112.1203743864@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> ... BTW, the strace had better run across the whole
>> PG process tree --- it's quite possible that there's some interaction
>> between the bgwriter and the backend doing COPY, for instance.

> So you want an strace on postmaster not the connection?

You need to launch the postmaster about like this:

strace -f postmaster ...postmaster args... 2>strace.out

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 06:31:48
Message-ID:	Pine.GSO.4.64.0802230114480.21179@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 22 Feb 2008, Tom Lane wrote:

> Counts are useless here, we need to see the sequence of write locations
> to find out if there's a lot of nonconsecutive writes happening.

How were you planning to analyze the strace output to quantify that? I
would think you'd need to parse the file to grab those locations then
generate statistics about seeking from them. Those strace files will be
huge and I'd expect Joshua would need to do something like that on his
side before the results will be short enough to share.

> it's quite possible that there's some interaction between the bgwriter
> and the backend doing COPY

It might be worthwhile to turn off the background writer for these runs,
both to see if it changes anything and to simplify the output.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 06:53:41
Message-ID:	14043.1203749621@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> On Fri, 22 Feb 2008, Tom Lane wrote:
>> Counts are useless here, we need to see the sequence of write locations
>> to find out if there's a lot of nonconsecutive writes happening.

> How were you planning to analyze the strace output to quantify that?

I didn't really have any preconceived ideas about that. I just want to
see some raw data to see if something shows up.

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 17:53:19
Message-ID:	47C05D8F.1000104@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>> Yeah, that... is beyond my abilities. Well reading it is anyway. I can
>> provide any information people want though. Tom? Greg? Andrew? Somebody?
>> What information do you want from me to help you track this down?
>
> The vmstat output you showed before said user CPU time was only around
> 12%, which seems to destroy Luke's thesis that data conversion time
> is the problem. IIRC we were speculating that data was being written
> in a pattern that required a lot of seeking thus ruining throughput,
> but we didn't have any hard evidence of that. Did you do the strace'ing
> I suggested?

Joshua said this is an 8 core box - so 12% would about equal to one core
being maxed out completely CPU wise which seems to correlate fairly well
with my experience (COPY being CPU bottlenecked on fast hardware without
any real culprit to find in profiles).

Stefan

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 18:06:16
Message-ID:	47C06098.5020808@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner wrote:
> Tom Lane wrote:
>> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>>> Yeah, that... is beyond my abilities. Well reading it is anyway. I can
>>> provide any information people want though. Tom? Greg? Andrew? Somebody?
>>> What information do you want from me to help you track this down?
>>
>> The vmstat output you showed before said user CPU time was only around
>> 12%, which seems to destroy Luke's thesis that data conversion time
>> is the problem. IIRC we were speculating that data was being written
>> in a pattern that required a lot of seeking thus ruining throughput,
>> but we didn't have any hard evidence of that. Did you do the strace'ing
>> I suggested?
>
>
> Joshua said this is an 8 core box - so 12% would about equal to one core
> being maxed out completely CPU wise which seems to correlate fairly well
> with my experience (COPY being CPU bottlenecked on fast hardware without
> any real culprit to find in profiles).

Right. I mentioned previously that I was able to gain what could be
considered barely acceptable performance break breaking up a custom dump
into multiple pieces and restoring across multiple connections.

I have Alvaro reviewing the strace. I can't post it without some cleansing.

Sincerely,

Joshua D. Drake

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 18:30:21
Message-ID:	4091.1203791421@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Tom Lane wrote:
>> The vmstat output you showed before said user CPU time was only around
>> 12%, which seems to destroy Luke's thesis that data conversion time
>> is the problem.

> Joshua said this is an 8 core box - so 12% would about equal to one core
> being maxed out completely CPU wise

Oh, good point. That should be fairly easy to confirm with "top" I
would think.

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 18:46:37
Message-ID:	47C06A0D.6070305@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> Tom Lane wrote:
>>> The vmstat output you showed before said user CPU time was only around
>>> 12%, which seems to destroy Luke's thesis that data conversion time
>>> is the problem.
>
>> Joshua said this is an 8 core box - so 12% would about equal to one core
>> being maxed out completely CPU wise
>
> Oh, good point. That should be fairly easy to confirm with "top" I
> would think.

mpstat -P ALL 5

01:43:33 PM CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
01:43:38 PM all 12.70 0.00 0.15 0.05 0.00 0.00 0.00
87.10 1322.60
01:43:38 PM 0 99.60 0.00 0.40 0.00 0.00 0.00 0.00
0.00 1000.40
01:43:38 PM 1 0.20 0.00 0.00 0.40 0.00 0.00 0.00
99.40 322.00
01:43:38 PM 2 0.00 0.00 0.20 0.00 0.00 0.00 0.00
99.80 0.00
01:43:38 PM 3 2.00 0.00 0.20 0.00 0.00 0.00 0.00
97.60 0.00
01:43:38 PM 4 0.00 0.00 0.20 0.00 0.00 0.00 0.00
100.00 0.00
01:43:38 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.00 0.00
01:43:38 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.00 0.00
01:43:38 PM 7 0.00 0.00 0.20 0.00 0.00 0.00 0.00
100.00 0.00

01:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
01:43:43 PM all 13.10 0.00 0.68 0.00 0.00 0.00 0.00
86.22 1328.20
01:43:43 PM 0 95.60 0.00 4.40 0.00 0.00 0.00 0.00
0.00 1000.40
01:43:43 PM 1 0.00 0.00 0.00 0.00 0.00 0.20 0.00
100.00 327.60
01:43:43 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.00 0.00
01:43:43 PM 3 9.00 0.00 1.20 0.00 0.00 0.00 0.00
89.80 0.00
01:43:43 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.00 0.00
01:43:43 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.00 0.00
01:43:43 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.00 0.00
01:43:43 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.00 0.00

01:43:43 PM CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
01:43:48 PM all 13.48 0.00 1.68 0.00 0.00 0.00 0.00
84.85 1345.29
01:43:48 PM 0 87.58 0.00 11.02 0.00 0.00 0.00 0.00
1.40 1002.40
01:43:48 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.00 343.09
01:43:48 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.20 0.00
01:43:48 PM 3 20.44 0.00 2.20 0.00 0.00 0.00 0.00
77.56 0.00
01:43:48 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.20 0.00
01:43:48 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.20 0.00
01:43:48 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.20 0.00
01:43:48 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00
100.20 0.00

iostat -k 5:

avg-cpu: %user %nice %system %iowait %steal %idle
13.48 0.00 2.23 0.07 0.00 84.23

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 36.87 0.00 1443.69 0 7204
sdb 0.60 0.00 21.64 0 108
sdc 0.00 0.00 0.00 0 0
sdd 0.00 0.00 0.00 0 0
hda 0.00 0.00 0.00 0 0

avg-cpu: %user %nice %system %iowait %steal %idle
13.28 0.00 2.15 1.33 0.00 83.25

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 168.66 0.00 9947.31 0 49836
sdb 0.00 0.00 0.00 0 0
sdc 0.00 0.00 0.00 0 0
sdd 0.00 0.00 0.00 0 0
hda 0.00 0.00 0.00 0 0

avg-cpu: %user %nice %system %iowait %steal %idle
13.57 0.00 2.32 0.05 0.00 84.06

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 27.05 0.00 1425.25 0 7112
sdb 0.00 0.00 0.00 0 0
sdc 0.00 0.00 0.00 0 0
sdd 0.00 0.00 0.00 0 0
hda 0.00 0.00 0.00 0 0

vmstat -adps -S M 5
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
r b swpd free inact active si so bi bo in cs us sy
id wa st
1 0 0 12793 14385 4301 0 0 12 101 4 12 0 0
99 0 0
1 0 0 12694 14421 4364 0 0 0 19943 1740 435 13 2
85 0 0
1 0 0 12606 14444 4429 0 0 0 28070 1839 449 12 2
84 2 0
1 0 0 12508 14469 4503 0 0 0 9188 1501 424 13 1
85 0 0
1 0 0 12402 14498 4579 0 0 0 20746 1734 446 13 2
85 0 0

Joshua D. Drake

>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 18:52:34
Message-ID:	4413.1203792754@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> Oh, good point. That should be fairly easy to confirm with "top" I
>> would think.

> mpstat -P ALL 5

> 01:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal
> %idle intr/s
> 01:43:43 PM all 13.10 0.00 0.68 0.00 0.00 0.00 0.00
> 86.22 1328.20
> 01:43:43 PM 0 95.60 0.00 4.40 0.00 0.00 0.00 0.00
> 0.00 1000.40

OK, so it is CPU-bound after all. Do you have oprofile or anything
equivalent?

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 18:59:20
Message-ID:	47C06D08.2090708@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>> Tom Lane wrote:
>>> Oh, good point. That should be fairly easy to confirm with "top" I
>>> would think.
>
>> mpstat -P ALL 5
>
>> 01:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal
>> %idle intr/s
>> 01:43:43 PM all 13.10 0.00 0.68 0.00 0.00 0.00 0.00
>> 86.22 1328.20
>> 01:43:43 PM 0 95.60 0.00 4.40 0.00 0.00 0.00 0.00
>> 0.00 1000.40
>
> OK, so it is CPU-bound after all. Do you have oprofile or anything
> equivalent?

I am at your disposal. If you need oprofile tell me the command and I
will get you the output.

Joshua D. Drake

>
> regards, tom lane
>

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 19:22:23
Message-ID:	47C0726F.4010207@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joshua D. Drake wrote:

>> OK, so it is CPU-bound after all. Do you have oprofile or anything
>> equivalent?
>
> I am at your disposal. If you need oprofile tell me the command and I
> will get you the output.

(thanks to stefan for helping me get this output)

opreport -l /usr/bin/postgres
CPU: AMD64 processors, speed 2411.14 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
47654 15.5165 DoCopy
36231 11.7971 CopyReadLine
25046 8.1551 heap_formtuple
24137 7.8592 XLogInsert
13829 4.5028 InputFunctionCall
9618 3.1317 ParseDateTime
9521 3.1001 DecodeDateTime
9446 3.0757 pg_next_dst_boundary
7651 2.4912 DecodeDate
6948 2.2623 AllocSetAlloc
6290 2.0481 ExecConstraints
5462 1.7785 DecodeNumber
3287 1.0703 pg_atoi
3035 0.9882 hash_search_with_hash_value
2866 0.9332 heap_insert
2786 0.9071 MemoryContextCreate
2771 0.9023 hash_any
2727 0.8879 slot_deform_tuple
2639 0.8593 DecodeTime
2517 0.8196 SearchCatCache
2512 0.8179 LWLockAcquire
2362 0.7691 varchar_input
2273 0.7401 timestamptz_in
2267 0.7382 AllocSetDelete
2203 0.7173 .plt
2031 0.6613 int4in
1936 0.6304 CopyGetData
1865 0.6073 tm2timestamp
1846 0.6011 MemoryContextAllocZero
1753 0.5708 ReadBuffer_common
1732 0.5640 ExecMakeFunctionResultNoSets
1658 0.5399 RelationGetBufferForTuple
1655 0.5389 varcharin
1640 0.5340 MemoryContextAlloc
1496 0.4871 AllocSetFree
1476 0.4806 slot_getattr
1466 0.4773 DetermineTimeZoneOffset
1460 0.4754 timestamp_in
1418 0.4617 AllocSetReset
1403 0.4568 LWLockRelease
1398 0.4552 PageAddItem
1384 0.4506 date2j
1370 0.4461 DecodeTimezone
1112 0.3621 boolin
1074 0.3497 slot_attisnull
1049 0.3416 SPI_connect
990 0.3224 AllocSetContextCreate
986 0.3210 PinBuffer
897 0.2921 pq_getmessage
830 0.2703 ExecQual
830 0.2703 SearchSysCache
830 0.2703 pg_strncasecmp
828 0.2696 SPI_finish
777 0.2530 FunctionCall2
758 0.2468 AdjustTimestampForTypmod
746 0.2429 DirectFunctionCall1
725 0.2361 CopyLoadRawBuf
723 0.2354 FunctionCall1
719 0.2341 MarkBufferDirty
694 0.2260 heap_attisnull
686 0.2234 MemoryContextDelete
677 0.2204 pq_getbytes
670 0.2182 float8in
663 0.2159 resetStringInfo
657 0.2139 heap_getsysattr
619 0.2016 CatalogCacheComputeHashValue
608 0.1980 ReleaseBuffer
608 0.1980 ResourceOwnerEnlargeBuffers
605 0.1970 ExecEvalRelabelType
590 0.1921 ReleaseCatCache
551 0.1794 RelationPutHeapTuple
522 0.1700 string_hash
514 0.1674 pfree
506 0.1648 pq_copymsgbytes
505 0.1644 ResourceOwnerForgetCatCacheRef
486 0.1582 pq_getbyte
483 0.1573 Int64GetDatum
476 0.1550 IsToastNamespace
471 0.1534 CacheInvalidateHeapTuple
469 0.1527 hash_search
468 0.1524 hash_uint32
453 0.1475 MemoryContextDeleteChildren
443 0.1442 PageGetHeapFreeSpace
432 0.1407 _SPI_begin_call
428 0.1394 textout
422 0.1374 ResourceOwnerForgetBuffer
421 0.1371 ExecEvalScalarVar
420 0.1368 LockBuffer
393 0.1280 tag_hash
388 0.1263 hashoid
374 0.1218 PageGetFreeSpace
373 0.1215 Float8GetDatum
362 0.1179 UnpinBuffer
355 0.1156 DLMoveToFront
354 0.1153 appendBinaryStringInfo
353 0.1149 string_compare
349 0.1136 isTempToastNamespace
337 0.1097 get_hash_value
335 0.1091 check_stack_depth
331 0.1078 IsSystemRelation
327 0.1065 ExecStoreTuple
322 0.1048 enlargeStringInfo
319 0.1039 BufTableLookup
313 0.1019 ItemPointerEquals
311 0.1013 oideq
308 0.1003 BufferGetBlockNumber
297 0.0967 UnlockReleaseBuffer
291 0.0948 MemoryContextReset
284 0.0925 dt2local
280 0.0912 GetCurrentTransactionIdIfAny
280 0.0912 ReleaseSysCache
271 0.0882 BufTableHashCode
264 0.0860 ResourceOwnerEnlargeCatCacheRefs
259 0.0843 GetCurrentTransactionId
259 0.0843 pg_detoast_datum_packed
252 0.0821 pgstat_count_heap_insert
239 0.0778 ReadBuffer
222 0.0723 IsSystemNamespace
220 0.0716 LockAcquire
219 0.0713 ResourceOwnerRememberCatCacheRef
201 0.0654 GetDatabaseEncoding
182 0.0593 ExecARInsertTriggers
171 0.0557 ResourceOwnerRememberBuffer
157 0.0511 StartBufferIO
147 0.0479 XLogWrite
127 0.0414 AdvanceXLInsertBuffer
101 0.0329 SyncOneBuffer
83 0.0270 mdextend
79 0.0257 LockRelease
65 0.0212 PinBuffer_Locked
64 0.0208 FileWrite
64 0.0208 ForwardFsyncRequest
63 0.0205 FileSeek
62 0.0202 GetCurrentSubTransactionId
60 0.0195 secure_read
55 0.0179 StrategyGetBuffer
50 0.0163 FlushBuffer
49 0.0160 _mdfd_getseg
47 0.0153 UnlockRelationForExtension
46 0.0150 mdwrite
45 0.0147 CleanUpLock
44 0.0143 TerminateBufferIO
42 0.0137 BgBufferSync
40 0.0130 mdopen
37 0.0120 pq_recvbuf
36 0.0117 mdnblocks
36 0.0117 prepare_for_client_read
35 0.0114 PageInit
33 0.0107 RecordAndGetPageWithFreeSpace
32 0.0104 AllocSetInit
30 0.0098 LockRelationForExtension
30 0.0098 lookup_fsm_page_entry
28 0.0091 register_dirty_segment
27 0.0088 XLogBackgroundFlush
25 0.0081 FileAccess
25 0.0081 OutputFunctionCall
25 0.0081 find_free_space
22 0.0072 GrantLockLocal
21 0.0068 pg_usleep
20 0.0065 RememberFsyncRequest
18 0.0059 UnGrantLock
17 0.0055 create_fsm_rel
16 0.0052 ProcQueueInit
16 0.0052 SHMQueueInit
16 0.0052 XLogFlush
16 0.0052 pgstat_start
15 0.0049 SHMQueueDelete
12 0.0039 SHMQueueInsertBefore
12 0.0039 _mdnblocks
11 0.0036 BufTableInsert
10 0.0033 WalWriterMain
10 0.0033 smgrextend
9 0.0029 AbsorbFsyncRequests
9 0.0029 pgstat_send_bgwriter
8 0.0026 BufTableDelete
8 0.0026 Delete
7 0.0023 LockTagHashCode
6 0.0020 BackgroundWriterMain
6 0.0020 RemoveLocalLock
6 0.0020 smgropen
5 0.0016 LockCheckConflicts
5 0.0016 smgrnblocks
5 0.0016 smgrwrite
4 0.0013 CheckArchiveTimeout
4 0.0013 PostmasterIsAlive
4 0.0013 client_read_ended
3 9.8e-04 GrantLock
3 9.8e-04 PGSemaphoreUnlock
3 9.8e-04 issue_xlog_fsync
2 6.5e-04 BgWriterNap
2 6.5e-04 Insert
2 6.5e-04 StrategySyncStart
2 6.5e-04 force_statwrite
2 6.5e-04 pg_fdatasync
1 3.3e-04 CatalogCacheCreateEntry
1 3.3e-04 CommitTransaction
1 3.3e-04 GetCurrentTransactionNestLevel
1 3.3e-04 GetSnapshotData
1 3.3e-04 PGSemaphoreLock
1 3.3e-04 ScanKeyEntryInitializeWithInfo
1 3.3e-04 SysLoggerMain
1 3.3e-04 _bt_compare
1 3.3e-04 hash_seq_search
1 3.3e-04 nocachegetattr
1 3.3e-04 pgstat_setheader
1 3.3e-04 pgstat_write_statsfile

>
> Joshua D. Drake
>
>>
>> regards, tom lane
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 19:29:30
Message-ID:	5484.1203794970@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> opreport -l /usr/bin/postgres
> CPU: AMD64 processors, speed 2411.14 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
> unit mask of 0x00 (No unit mask) count 100000
> samples % symbol name
> 47654 15.5165 DoCopy
> 36231 11.7971 CopyReadLine
> 25046 8.1551 heap_formtuple
> 24137 7.8592 XLogInsert
> 13829 4.5028 InputFunctionCall
> 9618 3.1317 ParseDateTime
> 9521 3.1001 DecodeDateTime
> 9446 3.0757 pg_next_dst_boundary
> 7651 2.4912 DecodeDate
> 6948 2.2623 AllocSetAlloc
> 6290 2.0481 ExecConstraints
> 5462 1.7785 DecodeNumber
> 3287 1.0703 pg_atoi

Hmm, so what's the declaration of the table you're copying into?
Obviously there's some datetime column(s) in it ...

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 19:34:28
Message-ID:	47C07544.6040203@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>> opreport -l /usr/bin/postgres
>> CPU: AMD64 processors, speed 2411.14 MHz (estimated)
>> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
>> unit mask of 0x00 (No unit mask) count 100000
>> samples % symbol name
>> 47654 15.5165 DoCopy
>> 36231 11.7971 CopyReadLine
>> 25046 8.1551 heap_formtuple
>> 24137 7.8592 XLogInsert
>> 13829 4.5028 InputFunctionCall
>> 9618 3.1317 ParseDateTime
>> 9521 3.1001 DecodeDateTime
>> 9446 3.0757 pg_next_dst_boundary
>> 7651 2.4912 DecodeDate
>> 6948 2.2623 AllocSetAlloc
>> 6290 2.0481 ExecConstraints
>> 5462 1.7785 DecodeNumber
>> 3287 1.0703 pg_atoi
>
> Hmm, so what's the declaration of the table you're copying into?
> Obviously there's some datetime column(s) in it ...

Let me restart the load. I don't have any statement logging turned on.
However there is a likelyhood of multiple timestamps.

Joshua D. Drake

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-23 19:37:01
Message-ID:	47C075DD.4040809@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I have 100 of these tables (identical).

Joshua D. Drake

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 00:43:18
Message-ID:	47C0BDA6.2050502@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joshua D. Drake wrote:
> Joshua D. Drake wrote:
>
>>> OK, so it is CPU-bound after all. Do you have oprofile or anything
>>> equivalent?
>>
>> I am at your disposal. If you need oprofile tell me the command and I
>> will get you the output.
>
> (thanks to stefan for helping me get this output)
>
> opreport -l /usr/bin/postgres
> CPU: AMD64 processors, speed 2411.14 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
> unit mask of 0x00 (No unit mask) count 100000
> samples % symbol name
> 47654 15.5165 DoCopy
> 36231 11.7971 CopyReadLine
> 25046 8.1551 heap_formtuple
> 24137 7.8592 XLogInsert
> 13829 4.5028 InputFunctionCall
> 9618 3.1317 ParseDateTime
> 9521 3.1001 DecodeDateTime
>>>> 9446 3.0757 pg_next_dst_boundary

This function should've become faster with the tzcode update last week,
though since it's only taking 3% of the time in your case, you won't notice.

This profile looks very similar to what I've seen in my tests. DoCopy
tends to be at the top. Digging deeper into that, I believe that much of
the time attributed to DoCopy is actually spent in
CopyReadAttributesText, but since it's inlined oprofile can't show it
separately.

Incidentally, I've been working on a patch to speed up CopyReadLine. I
was going to run some more tests first, but since we're talking about
it, I guess I should just post the patch. I'll post to pgsql-patches
shortly.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 01:45:51
Message-ID:	20080223174551.3747cfc6@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 24 Feb 2008 00:43:18 +0000
"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> wrote:

> Incidentally, I've been working on a patch to speed up CopyReadLine.
> I was going to run some more tests first, but since we're talking
> about it, I guess I should just post the patch. I'll post to
> pgsql-patches shortly.
>

On your post to patches you mentioned only about a 5% improvement.
Don't get me wrong, 5% is 5% and I respect it greatly but as far as I
can tell we are about 300% behind the curve.

My tests were maxing out at ~22G an hour. On hardware that can do
in 360G an hour and that is assuming > 50% overhead between OS, libs,
etc... I have no choice but to conclude we have a much, much deeper and
fundamental issue going on with COPY. I am inspired by Itagaki Takahiro
and his batch update of indexes which should help greatly overall but
doesn't help my specific issue.

Forgive me for not being a C programmer and Alvaro is not online so I
would vet these questions with him first.

I know that copy is in theory a bulk loader but, when performing the
readline how many lines are we reading? Do we read up to 8192? Or do we
shove in say 8megs of data before we invoke DoCopy?

I am just curious if there is some simple low hanging fruit that is
possibly missing.

Sincerely,

Joshua D. Drake

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 09:39:42
Message-ID:	47C13B5E.3020604@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joshua D. Drake wrote:
> On Sun, 24 Feb 2008 00:43:18 +0000
> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> wrote:
>
>> Incidentally, I've been working on a patch to speed up CopyReadLine.
>> I was going to run some more tests first, but since we're talking
>> about it, I guess I should just post the patch. I'll post to
>> pgsql-patches shortly.
>
> On your post to patches you mentioned only about a 5% improvement.
> Don't get me wrong, 5% is 5% and I respect it greatly but as far as I
> can tell we are about 300% behind the curve.

Yeah. Looking at the profile, the time is spent really all over the
place. There's no one clear bottleneck to focus on. I think we could do
a few more ~5% improvements, but

At some point, I think we have to bite the bullet and find a way to use
multiple CPUs for a single load. I don't have any good ideas or plans
for that, but hopefully someone does.

> My tests were maxing out at ~22G an hour. On hardware that can do
> in 360G an hour and that is assuming > 50% overhead between OS, libs,
> etc... I have no choice but to conclude we have a much, much deeper and
> fundamental issue going on with COPY. I am inspired by Itagaki Takahiro
> and his batch update of indexes which should help greatly overall but
> doesn't help my specific issue.

Yep, the index build idea is an I/O improvement, not a CPU one.

> Forgive me for not being a C programmer and Alvaro is not online so I
> would vet these questions with him first.
>
> I know that copy is in theory a bulk loader but, when performing the
> readline how many lines are we reading? Do we read up to 8192? Or do we
> shove in say 8megs of data before we invoke DoCopy?

We read 64 KB at a time, and then CopyReadLineText returns one line at a
time from that buffer.

Looking at your profile more, and after the memchr patch, the "raw input
side" of copy, consisting of reading the data from disk in 64KB blocks,
splitting that into lines, and splitting lines into columns, still takes
~20% of the CPU time. I suspect CopyReadAttributesText is the biggest
culprit there.

You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by
creating the table (or truncating it) in the same transaction with the COPY.

After that, heap_formtuple is high on the list. I wonder if we could do
something about that.

> I am just curious if there is some simple low hanging fruit that is
> possibly missing.

I don't see any piece of code that's causing problems. We can shave off
a few percents here and there I think, but don't expect a 300%
improvement anytime soon. A few ideas I've thought about are:

- use a specialized version of strtol, for base 10. That won't help on
your table, but I've seen strtol consume a significant amount of time on
tables with numeric/integer columns.

- Instead of pallocing and memcpying the text fields, leave a little bit
of room between fields in the attribute_buf, and write the varlen header
there directly. This might help you since your table has a lot of text
fields.

- Instead of the normal PG function calling conventions, provide
specialized fastpath input functions for the most common data types.
InputFunctionCall consumed 4.5% of the CPU time in your profile.

- Use a simpler memory context implementation, that's like a stack with
no pfree support, for the per-tuple context.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 09:44:39
Message-ID:	47C13C87.9040402@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> I don't see any piece of code that's causing problems.

Meant to say: I don't see any *single* piece of code that's causing the
problems...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	andy <andy(at)squeakycode(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 14:21:18
Message-ID:	47C17D5E.7080305@squeakycode.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

There was talk elsewhere about making pg_dump/restore smarter. It could
create tables, then COPY, and create the indexes last. Add to that
pg_restore using multiple connections and you'd have it. One connection
could do a COPY, then a second connection could be created to start the
CREATE INDEX's for that table, while the first connection went on to
COPY the next table.

Or something like that...

-Andy

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 17:32:31
Message-ID:	5097.1203874351@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> At some point, I think we have to bite the bullet and find a way to use
> multiple CPUs for a single load. I don't have any good ideas or plans
> for that, but hopefully someone does.

As already mentioned upthread, we could do that today, with zero backend
changes, by making pg_restore drive multiple sessions. Now there are
scenarios where this wouldn't help too much --- eg, a database with only
one enormous table. We couldn't parallelize the loading of that table,
although we could parallelize creation of its indexes. But for an
example such as JD is complaining about, we ought to have no trouble
thrashing his disks into the ground ;-)

What you would need for this is an -Fc or -Ft dump, because a plain
script output from pg_dump doesn't carry any dependency information,
much less any index of where in the file different bits of data are.
Just armwaving, I envision a multiprocess pg_restore like this:

* one controller process that tracks the not-yet-loaded
TOC items in the dump
* N child processes that just transmit a selected TOC item
to a connected backend, and then ask the controller
what to do next

The controller would need to use the dependency information to avoid,
eg, handing out a CREATE INDEX command before the parent table was
created and loaded.

One issue is that this couldn't use "-1" single-transaction restoring,
since obviously each child would need its own transaction, and
furthermore would have to commit before going back to the controller
for more work (since dependent TOC items might well get loaded by
a different child later). That defeats a couple of optimizations that
Simon put in recently. The one for no XLOG during COPY is not too
hard to see how to re-enable, but I'm not sure what else there was.

Most likely, the bottleneck with this sort of thing would be multiple
parallel reads from the pg_dump archive file. Possibly the controller
process could be taught to schedule COPY and CREATE INDEX operations
so that not too many processes are trying to read lots of archive
data at the same time.

regards, tom lane

From:	Hannes Dorbath <light(at)theendofthetunnel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 17:32:31
Message-ID:	47C1AA2F.4070807@theendofthetunnel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Greg Smith <gsmith(at)gregsmith(dot)com> writes:
>> On Fri, 22 Feb 2008, Tom Lane wrote:
>>> Counts are useless here, we need to see the sequence of write locations
>>> to find out if there's a lot of nonconsecutive writes happening.
>
>> How were you planning to analyze the strace output to quantify that?
>
> I didn't really have any preconceived ideas about that. I just want to
> see some raw data to see if something shows up.

Isn't blktrace the tool to get that kind of information? Anyway, as the
following threads point out the problems seems to be somewhere else..

--
Best regards,
Hannes Dorbath

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 17:40:11
Message-ID:	47C1ABFB.1050006@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Joshua D. Drake wrote:
>> On Sun, 24 Feb 2008 00:43:18 +0000
>> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> wrote:

>> I know that copy is in theory a bulk loader but, when performing the
>> readline how many lines are we reading? Do we read up to 8192? Or do we
>> shove in say 8megs of data before we invoke DoCopy?
>
> We read 64 KB at a time, and then CopyReadLineText returns one line at a
> time from that buffer.

O.k. I am sure I am oversimplifying things but why are we returning one
line at a time? That reads expensive to me. Just following the general,
don't do inserts one at a time, do them in batch idea for example.

I would also question the 64KB at a time. Why not a 1024KB (arbitrary)
at a time? Is it a resource issue? In the old days when we actually had
people trying to run postgresql on 128 and 256 megs of ram, o.k. but now?

> Looking at your profile more, and after the memchr patch, the "raw input
> side" of copy, consisting of reading the data from disk in 64KB blocks,
> splitting that into lines, and splitting lines into columns, still takes
> ~20% of the CPU time. I suspect CopyReadAttributesText is the biggest
> culprit there.

In reading:

http://doxygen.postgresql.org/backend_2commands_2copy_8c-source.html

It looks like CopyReadAttributesText is used as part of the column
breakup. It also appears that this happens "before" insert right? So if
that is the case we are going to pay an additional penalty on the data
checking.

>
> You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by
> creating the table (or truncating it) in the same transaction with the
> COPY.

In the same transaction? Oh that's interesting. So that might be a TODO
right there. Change pg_dump so it does:

create,copy,create,copy,index

Instead of:

create,create,copy,copy,index

>
> After that, heap_formtuple is high on the list. I wonder if we could do
> something about that.

Just from the name I am going to guess this is where we turn it into
something that is actually injected into PostgreSQL.

> I don't see any piece of code that's causing problems. We can shave off
> a few percents here and there I think, but don't expect a 300%
> improvement anytime soon. A few ideas I've thought about are:

Well don't get me wrong, I am not expecting miracles here. I am just
confounded at the complete lack of performance in this arena. I don't
think a lot of people recognize what a significant issue this is since
we don't have in place backups.

>
> - use a specialized version of strtol, for base 10. That won't help on
> your table, but I've seen strtol consume a significant amount of time on
> tables with numeric/integer columns.
>
> - Instead of pallocing and memcpying the text fields, leave a little bit
> of room between fields in the attribute_buf, and write the varlen header
> there directly. This might help you since your table has a lot of text
> fields.
>
> - Instead of the normal PG function calling conventions, provide
> specialized fastpath input functions for the most common data types.
> InputFunctionCall consumed 4.5% of the CPU time in your profile.
>
> - Use a simpler memory context implementation, that's like a stack with
> no pfree support, for the per-tuple context.
>

By my calculations you are presenting a possibility of at least ~ 30%
improvement. That is significant in my book. Hopefully as we explore
these options we will find others.

Sincerely,

Joshua D. Drake

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 17:47:15
Message-ID:	47C1ADA3.60104@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>> At some point, I think we have to bite the bullet and find a way to use
>> multiple CPUs for a single load. I don't have any good ideas or plans
>> for that, but hopefully someone does.
>
> As already mentioned upthread, we could do that today, with zero backend
> changes, by making pg_restore drive multiple sessions. Now there are
> scenarios where this wouldn't help too much --- eg, a database with only
> one enormous table. We couldn't parallelize the loading of that table,
> although we could parallelize creation of its indexes. But for an
> example such as JD is complaining about, we ought to have no trouble
> thrashing his disks into the ground ;-)

Bring it on ! but I would note that with the current pg_restore I was
not able to trash my disks. I only used four backends (I could have used
8) but even with that, I was only doing ~ 45M a second. So if I double,
I still have bandwidth. It would still be an huge improvement though.

>
> What you would need for this is an -Fc or -Ft dump, because a plain
> script output from pg_dump doesn't carry any dependency information,

I would focus on -Fc. With the limitations of -Ft this would be a good
way to start phasing -Ft out.

> much less any index of where in the file different bits of data are.
> Just armwaving, I envision a multiprocess pg_restore like this:
>
> * one controller process that tracks the not-yet-loaded
> TOC items in the dump
> * N child processes that just transmit a selected TOC item
> to a connected backend, and then ask the controller
> what to do next

> Most likely, the bottleneck with this sort of thing would be multiple
> parallel reads from the pg_dump archive file. Possibly the controller
> process could be taught to schedule COPY and CREATE INDEX operations
> so that not too many processes are trying to read lots of archive
> data at the same time.

A less hacker and more DBA bottleneck will be to limit the number of
backends being created for restore. We don't really want to have more
than one backend per CPU, otherwise we just start switching.

Sincerely,

Joshua D. Drake

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 18:06:49
Message-ID:	47C1B239.2040105@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joshua D. Drake wrote:
> Heikki Linnakangas wrote:
>>
>> We read 64 KB at a time, and then CopyReadLineText returns one line
>> at a time from that buffer.
>
> O.k. I am sure I am oversimplifying things but why are we returning
> one line at a time? That reads expensive to me. Just following the
> general, don't do inserts one at a time, do them in batch idea for
> example.

Quite simply because one line corresponds to one record. And yes, I
believe you are oversimplifying, or under several misapprehensions about
what can be done at this level.

>
> I would also question the 64KB at a time. Why not a 1024KB (arbitrary)
> at a time? Is it a resource issue? In the old days when we actually
> had people trying to run postgresql on 128 and 256 megs of ram, o.k.
> but now?

It would be simple enough to change. Try it and see if it actually makes
a difference. All you have to change is the define of RAW_BUF_SIZE.

>
> In reading:
>
> http://doxygen.postgresql.org/backend_2commands_2copy_8c-source.html
>
> It looks like CopyReadAttributesText is used as part of the column
> breakup. It also appears that this happens "before" insert right? So
> if that is the case we are going to pay an additional penalty on the
> data checking.
>

What? I don't understand what you are talking about.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 18:11:02
Message-ID:	5577.1203876662@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Joshua D. Drake wrote:
>> I would also question the 64KB at a time. Why not a 1024KB (arbitrary)
>> at a time? Is it a resource issue? In the old days when we actually
>> had people trying to run postgresql on 128 and 256 megs of ram, o.k.
>> but now?

> It would be simple enough to change. Try it and see if it actually makes
> a difference. All you have to change is the define of RAW_BUF_SIZE.

Seems unlikely that making it bigger than (a fraction of) L2 cache
would be a smart move.

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 18:22:48
Message-ID:	47C1B5F8.9090708@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan wrote:
>
>
> Joshua D. Drake wrote:
>> Heikki Linnakangas wrote:
>>>
>>> We read 64 KB at a time, and then CopyReadLineText returns one line
>>> at a time from that buffer.
>>
>> O.k. I am sure I am oversimplifying things but why are we returning
>> one line at a time? That reads expensive to me. Just following the
>> general, don't do inserts one at a time, do them in batch idea for
>> example.
>
> Quite simply because one line corresponds to one record. And yes, I
> believe you are oversimplifying, or under several misapprehensions about
> what can be done at this level.

Well I find without questioning, I won't get the answers so :).

> It would be simple enough to change. Try it and see if it actually
makes
> a difference. All you have to change is the define of RAW_BUF_SIZE.
>

Fair enough. May I assume this is the only place I need to change it?

http://doxygen.postgresql.org/backend_2commands_2copy_8c-source.html#l00158

>> It looks like CopyReadAttributesText is used as part of the column
>> breakup. It also appears that this happens "before" insert right? So
>> if that is the case we are going to pay an additional penalty on the
>> data checking.
>>
>
> What? I don't understand what you are talking about.

Data checking on insert to the DB itself. I have no doubt that I may be
wrong on this.

Joshua D. Drake

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-24 18:24:35
Message-ID:	47C1B663.2060303@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> Joshua D. Drake wrote:
>>> I would also question the 64KB at a time. Why not a 1024KB (arbitrary)
>>> at a time? Is it a resource issue? In the old days when we actually
>>> had people trying to run postgresql on 128 and 256 megs of ram, o.k.
>>> but now?
>
>> It would be simple enough to change. Try it and see if it actually makes
>> a difference. All you have to change is the define of RAW_BUF_SIZE.
>
> Seems unlikely that making it bigger than (a fraction of) L2 cache
> would be a smart move.

O.k. these CPUs have 1meg of L2 so I will try with 512k.

Joshua D. Drake

>
> regards, tom lane
>

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-25 19:36:56
Message-ID:	1203968216.7878.52.camel@dogma.ljc.laika.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2008-02-24 at 09:47 -0800, Joshua D. Drake wrote:
> A less hacker and more DBA bottleneck will be to limit the number of
> backends being created for restore. We don't really want to have more
> than one backend per CPU, otherwise we just start switching.

Are you sure that it would always be a loss?

If there is any significant I/O latency for a single backend, it seems
like a context switch could be a win for processor utilization. It might
not be a win overall, but at least potentially a win.

Regards,
Jeff Davis

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-25 20:05:24
Message-ID:	20080225120524.4911ce08@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, 25 Feb 2008 11:36:56 -0800
Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

>
> If there is any significant I/O latency for a single backend, it seems
> like a context switch could be a win for processor utilization. It
> might not be a win overall, but at least potentially a win.

Do we want a 20% potential win or an 80% potential win?

I would personally rather keep it simple, hard core, and data shoving
as possible without any issue with scheduling etc..

Sincerely,

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHwx+EATb/zqfZUUQRAvOgAJ4vWCO74XzXy9Pbzqz3otWoqKI3HgCfRwUI
ZLd0SOgf5jnInZvOxCS+iNU=
=Syk1
-----END PGP SIGNATURE-----

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-25 20:17:10
Message-ID:	1203970630.7878.56.camel@dogma.ljc.laika.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-02-25 at 12:05 -0800, Joshua D. Drake wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Mon, 25 Feb 2008 11:36:56 -0800
> Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> >
> > If there is any significant I/O latency for a single backend, it seems
> > like a context switch could be a win for processor utilization. It
> > might not be a win overall, but at least potentially a win.
>
> Do we want a 20% potential win or an 80% potential win?
>
> I would personally rather keep it simple, hard core, and data shoving
> as possible without any issue with scheduling etc..
>

Just a thought. After it's actually implemented it won't be hard to see
if it's a win.

Regards,
Jeff Davis

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-25 20:28:22
Message-ID:	20080225122822.3fe6e355@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, 25 Feb 2008 12:17:10 -0800
Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> > I would personally rather keep it simple, hard core, and data
> > shoving as possible without any issue with scheduling etc..
> >
>
> Just a thought. After it's actually implemented it won't be hard to
> see if it's a win.

Yep :) but as a note:

I am currently testing on the data set that is giving us all these
issues. Previously we were pushing ~ 22G an hour over a single thread.
I am currently pushing ~ 28G every 16 minutes over 6 threads.

With 30-40% IO wait.

Sincerely,

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHwyToATb/zqfZUUQRAsDzAJ0cZEujQIW1SQ9Wd1nd1jWRVWy09ACgpryh
SJENqCnmwKoSMF5fSHBRtsg=
=hVeo
-----END PGP SIGNATURE-----

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-25 22:05:58
Message-ID:	1203977158.7878.59.camel@dogma.ljc.laika.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-02-25 at 12:28 -0800, Joshua D. Drake wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Mon, 25 Feb 2008 12:17:10 -0800
> Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
>
> > > I would personally rather keep it simple, hard core, and data
> > > shoving as possible without any issue with scheduling etc..
> > >
> >
> > Just a thought. After it's actually implemented it won't be hard to
> > see if it's a win.
>
> Yep :) but as a note:
>
> I am currently testing on the data set that is giving us all these
> issues. Previously we were pushing ~ 22G an hour over a single thread.
> I am currently pushing ~ 28G every 16 minutes over 6 threads.
>
> With 30-40% IO wait.

That begs the question: what about 12 threads?

Regards,
Jeff Davis

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-25 22:11:16
Message-ID:	20080225141116.231efb4e@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, 25 Feb 2008 14:05:58 -0800
Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> > Yep :) but as a note:
> >
> > I am currently testing on the data set that is giving us all these
> > issues. Previously we were pushing ~ 22G an hour over a single
> > thread. I am currently pushing ~ 28G every 16 minutes over 6
> > threads.
> >
> > With 30-40% IO wait.
>
> That begs the question: what about 12 threads?

That is exactly what Alvaro said :P, let me finish this one first.

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHwz0EATb/zqfZUUQRAo1HAJkB58g/gkCWPTqSqjNzrcxGZ4eiNACgi7Va
gZGMboxrPwV4euv67anSyfo=
=9S0a
-----END PGP SIGNATURE-----

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-26 05:18:57
Message-ID:	20080225211857.0db12f74@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 25 Feb 2008 14:11:16 -0800
"Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

O.k. with 24 connections 3.5 hours. Testing with 12 now. 6 never
finished due to a bug.

Observations:

As simple as this solution is, it is not eloquent nor is it smart.
Using this method, if you have a 100GB table (which is very common)
you are still bound in a bad way by a single connection and you are
holding up everyone else.

Sincerely,

Joshua D. Drake

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-26 06:25:09
Message-ID:	20080225222509.01164ff7@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 25 Feb 2008 22:29:32 -0800
Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> For me it would still be very helpful. If that 100GB table has several
> indexes, particularly on localized text, that can take a lot of
> processor time to rebuild (even for a substantially smaller dataset,
> like in the "7 hour restore" thread). It seems like a no-brainer to be
> able to utilize all available cores.

Oh, I agree that we should be using all cores. I would argue that we
should have been doing that for years now but more importantly to me is
that pg_restore even single threaded is slow.

>
> I think we should consider all of these pg_restore improvements,
> because they're merely simplifying the DBA's job. Currently, to get
> these benefits, I have to organize and parallelize the restore
> manually.

Definitely.

>
> Actually, the tests you're running are helping me as much as any
> pg_restore changes might anyway. I don't mind a small amount of extra
> work to dump/restore, but other users might get a bad impression of
> PostgreSQL if they don't know how to make it perform to their
> expectations.

Certainly but having to hand roll this is bad. It presents us in a
decidedly hackish light.

Sincerely,

Joshua D. Drake

>
> Regards,
> Jeff Davis
>

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-26 06:29:32
Message-ID:	1204007372.10798.47.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2008-02-25 at 21:18 -0800, Joshua D. Drake wrote:
> As simple as this solution is, it is not eloquent nor is it smart.
> Using this method, if you have a 100GB table (which is very common)
> you are still bound in a bad way by a single connection and you are
> holding up everyone else.

In your case I can see your point.

For me it would still be very helpful. If that 100GB table has several
indexes, particularly on localized text, that can take a lot of
processor time to rebuild (even for a substantially smaller dataset,
like in the "7 hour restore" thread). It seems like a no-brainer to be
able to utilize all available cores.

I think one big improvement is to break it into steps as Simon suggests
here:
http://archives.postgresql.org/pgsql-hackers/2008-02/msg00205.php

and my idea to further break it down:
http://archives.postgresql.org/pgsql-hackers/2008-02/msg00699.php

I think we should consider all of these pg_restore improvements, because
they're merely simplifying the DBA's job. Currently, to get these
benefits, I have to organize and parallelize the restore manually.

Actually, the tests you're running are helping me as much as any
pg_restore changes might anyway. I don't mind a small amount of extra
work to dump/restore, but other users might get a bad impression of
PostgreSQL if they don't know how to make it perform to their
expectations.

Regards,
Jeff Davis

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-02-26 15:43:23
Message-ID:	20080226074323.5e8de2f7@jd-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 25 Feb 2008 22:25:09 -0800
"Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

24 connections: 3.5 hours
12 connections: 4.5 hours

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	multi-worker pg_restore was: 8.3 / 8.2.6 restore comparison
Date:	2008-02-26 23:17:18
Message-ID:	20080226151718.18c66011@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 26 Feb 2008 07:43:23 -0800
"Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

> On Mon, 25 Feb 2008 22:25:09 -0800
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:
>
> 24 connections: 3.5 hours
> 12 connections: 4.5 hours
>

24 connections RAW_BUF_SIZE 524288 3.5 hours

However one observation that I am going to (try) to test is that we are
spending a lot of time waiting for the last thread to finish. I
wouldn't be surprised if I could cut the time in half if I could script
this out to do:

table.data->table.pk->table.index
table.data->table.pk->table.index
table.data->table.pk->table.index
table.data->table.pk->table.index
constraints

versus

table.data
table.data
table.data
table.pk
table.pk
table.pk
table.index

...

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHxJ3+ATb/zqfZUUQRAlURAKCtOSbHoWTNSkzV0U48Ib0P/8SSNgCdFeWK
v6Q1l5BBGxqqtPu1UMhA81E=
=VA2+
-----END PGP SIGNATURE-----

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: multi-worker pg_restore was: 8.3 / 8.2.6 restore comparison
Date:	2008-02-26 23:39:53
Message-ID:	10055.1204069193@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> However one observation that I am going to (try) to test is that we are
> spending a lot of time waiting for the last thread to finish.

IOW you haven't balanced the work given to each thread very well?
Or is there something else happening?

How exactly are you allocating tasks to threads in this prototype,
anyway?

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: multi-worker pg_restore was: 8.3 / 8.2.6 restore comparison
Date:	2008-02-26 23:50:28
Message-ID:	20080226155028.6e82be92@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 26 Feb 2008 18:39:53 -0500
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> > However one observation that I am going to (try) to test is that we
> > are spending a lot of time waiting for the last thread to finish.
>
> IOW you haven't balanced the work given to each thread very well?
> Or is there something else happening?
>
> How exactly are you allocating tasks to threads in this prototype,
> anyway?
>

Right there is no balance here. Let me explain what I did. I performed
a pg_restore -l to get the TOC file. I then broke that file up into
five other files.

prefix = schema (minus indexes, constraints)
data = data
pk = primary keys
index = indexes
triggers_constraints = well triggers and cosntraints (foreign keys in
this instance)

The first step of the script loads prefix. It then splits the data
file into -n- number of files and launches -n- number of
pg_restore processes with -L.

It runs through all data, then starts on pk in the exact same manner
and then indxex etc...

Everything moves along each step very quickly until the last thread. So
I could burn through all of data in 60 minutes except for the last
three tables in the 24th file which will take 30 minutes on their own
(arbitrary numbers). While I am waiting on the last 3 tables, nothing
is happening. We are in a holding pattern for the 24 connection pk load
to start. What should happen is as each TABLE DATA line finishes the
appropriate CONSTRAINT (pk) and INDEX are also built.

The two single threaded processes although you could probably make them
multi threaded are prefix and triggers constraints.

Sincerely,

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHxKXGATb/zqfZUUQRAjePAJ9xx6ea+Vo4J5T3CxLYRfKj2Cm1gQCeOjbU
DHcAzEsVpedyUnJjUuL7DI8=
=ynM0
-----END PGP SIGNATURE-----

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: multi-worker pg_restore was: 8.3 / 8.2.6 restore comparison
Date:	2008-02-27 00:03:57
Message-ID:	10396.1204070637@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> How exactly are you allocating tasks to threads in this prototype,
>> anyway?

> Right there is no balance here. Let me explain what I did. I performed
> a pg_restore -l to get the TOC file. I then broke that file up into
> five other files.

> prefix = schema (minus indexes, constraints)
> data = data
> pk = primary keys
> index = indexes
> triggers_constraints = well triggers and cosntraints (foreign keys in
> this instance)

> The first step of the script loads prefix. It then splits the data
> file into -n- number of files and launches -n- number of
> pg_restore processes with -L.

> It runs through all data, then starts on pk in the exact same manner
> and then indxex etc...

So you have four serialization points not just one; at each one the
slowest subtask forces everyone else to wait, even if there's work that
could potentially be done on other tables. This is fine for a
quick-and-dirty proof of concept but it's certainly not how we'd want to
implement the real thing. But I doubt you can get much further without
putting some actual dependency awareness into it.

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: multi-worker pg_restore was: 8.3 / 8.2.6 restore comparison
Date:	2008-02-27 00:24:01
Message-ID:	20080226162401.1a6f70a1@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 26 Feb 2008 19:03:57 -0500
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> So you have four serialization points not just one; at each one the
> slowest subtask forces everyone else to wait, even if there's work
> that could potentially be done on other tables. This is fine for a
> quick-and-dirty proof of concept but it's certainly not how we'd want
> to implement the real thing. But I doubt you can get much further
> without putting some actual dependency awareness into it.

Exactly. I have gone as far as I actually can without quite a bit of
brain trauma and hand holding. I could probably prototype it in perl
fairly easy but C is a complete no go for me.

I hope the numbers were at least helpful.

Sincerely,

Joshua D. Drake

>
> regards, tom lane
>
> ---------------------------(end of
> broadcast)--------------------------- TIP 7: You can help support the
> PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHxK2jATb/zqfZUUQRArgSAJ9z6f4dJLZ1FSbU6ISpIKXaIePzJQCePRY9
8PUIRudPtAlEBH1ivDMJOC4=
=afNY
-----END PGP SIGNATURE-----

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-03-04 01:40:02
Message-ID:	200803040140.m241e2A09710@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki, are there any TODO items here?

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Joshua D. Drake wrote:
> > On Sun, 24 Feb 2008 00:43:18 +0000
> > "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> wrote:
> >
> >> Incidentally, I've been working on a patch to speed up CopyReadLine.
> >> I was going to run some more tests first, but since we're talking
> >> about it, I guess I should just post the patch. I'll post to
> >> pgsql-patches shortly.
> >
> > On your post to patches you mentioned only about a 5% improvement.
> > Don't get me wrong, 5% is 5% and I respect it greatly but as far as I
> > can tell we are about 300% behind the curve.
>
> Yeah. Looking at the profile, the time is spent really all over the
> place. There's no one clear bottleneck to focus on. I think we could do
> a few more ~5% improvements, but
>
> At some point, I think we have to bite the bullet and find a way to use
> multiple CPUs for a single load. I don't have any good ideas or plans
> for that, but hopefully someone does.
>
> > My tests were maxing out at ~22G an hour. On hardware that can do
> > in 360G an hour and that is assuming > 50% overhead between OS, libs,
> > etc... I have no choice but to conclude we have a much, much deeper and
> > fundamental issue going on with COPY. I am inspired by Itagaki Takahiro
> > and his batch update of indexes which should help greatly overall but
> > doesn't help my specific issue.
>
> Yep, the index build idea is an I/O improvement, not a CPU one.
>
> > Forgive me for not being a C programmer and Alvaro is not online so I
> > would vet these questions with him first.
> >
> > I know that copy is in theory a bulk loader but, when performing the
> > readline how many lines are we reading? Do we read up to 8192? Or do we
> > shove in say 8megs of data before we invoke DoCopy?
>
> We read 64 KB at a time, and then CopyReadLineText returns one line at a
> time from that buffer.
>
> Looking at your profile more, and after the memchr patch, the "raw input
> side" of copy, consisting of reading the data from disk in 64KB blocks,
> splitting that into lines, and splitting lines into columns, still takes
> ~20% of the CPU time. I suspect CopyReadAttributesText is the biggest
> culprit there.
>
> You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by
> creating the table (or truncating it) in the same transaction with the COPY.
>
> After that, heap_formtuple is high on the list. I wonder if we could do
> something about that.
>
> > I am just curious if there is some simple low hanging fruit that is
> > possibly missing.
>
> I don't see any piece of code that's causing problems. We can shave off
> a few percents here and there I think, but don't expect a 300%
> improvement anytime soon. A few ideas I've thought about are:
>
> - use a specialized version of strtol, for base 10. That won't help on
> your table, but I've seen strtol consume a significant amount of time on
> tables with numeric/integer columns.
>
> - Instead of pallocing and memcpying the text fields, leave a little bit
> of room between fields in the attribute_buf, and write the varlen header
> there directly. This might help you since your table has a lot of text
> fields.
>
> - Instead of the normal PG function calling conventions, provide
> specialized fastpath input functions for the most common data types.
> InputFunctionCall consumed 4.5% of the CPU time in your profile.
>
> - Use a simpler memory context implementation, that's like a stack with
> no pfree support, for the per-tuple context.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-03-04 10:20:31
Message-ID:	47CD226F.6020302@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
> Heikki, are there any TODO items here?

Hmm. You could put an entry there for "profile and optimize COPY", with
the below list of ideas as a starting point. It's more about profiling
and performance testing than coding.

> ---------------------------------------------------------------------------
>
> Heikki Linnakangas wrote:
>> Joshua D. Drake wrote:
>>> On Sun, 24 Feb 2008 00:43:18 +0000
>>> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> wrote:
>>>
>>>> Incidentally, I've been working on a patch to speed up CopyReadLine.
>>>> I was going to run some more tests first, but since we're talking
>>>> about it, I guess I should just post the patch. I'll post to
>>>> pgsql-patches shortly.
>>> On your post to patches you mentioned only about a 5% improvement.
>>> Don't get me wrong, 5% is 5% and I respect it greatly but as far as I
>>> can tell we are about 300% behind the curve.
>> Yeah. Looking at the profile, the time is spent really all over the
>> place. There's no one clear bottleneck to focus on. I think we could do
>> a few more ~5% improvements, but
>>
>> At some point, I think we have to bite the bullet and find a way to use
>> multiple CPUs for a single load. I don't have any good ideas or plans
>> for that, but hopefully someone does.
>>
>>> My tests were maxing out at ~22G an hour. On hardware that can do
>>> in 360G an hour and that is assuming > 50% overhead between OS, libs,
>>> etc... I have no choice but to conclude we have a much, much deeper and
>>> fundamental issue going on with COPY. I am inspired by Itagaki Takahiro
>>> and his batch update of indexes which should help greatly overall but
>>> doesn't help my specific issue.
>> Yep, the index build idea is an I/O improvement, not a CPU one.
>>
>>> Forgive me for not being a C programmer and Alvaro is not online so I
>>> would vet these questions with him first.
>>>
>>> I know that copy is in theory a bulk loader but, when performing the
>>> readline how many lines are we reading? Do we read up to 8192? Or do we
>>> shove in say 8megs of data before we invoke DoCopy?
>> We read 64 KB at a time, and then CopyReadLineText returns one line at a
>> time from that buffer.
>>
>> Looking at your profile more, and after the memchr patch, the "raw input
>> side" of copy, consisting of reading the data from disk in 64KB blocks,
>> splitting that into lines, and splitting lines into columns, still takes
>> ~20% of the CPU time. I suspect CopyReadAttributesText is the biggest
>> culprit there.
>>
>> You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by
>> creating the table (or truncating it) in the same transaction with the COPY.
>>
>> After that, heap_formtuple is high on the list. I wonder if we could do
>> something about that.
>>
>>> I am just curious if there is some simple low hanging fruit that is
>>> possibly missing.
>> I don't see any piece of code that's causing problems. We can shave off
>> a few percents here and there I think, but don't expect a 300%
>> improvement anytime soon. A few ideas I've thought about are:
>>
>> - use a specialized version of strtol, for base 10. That won't help on
>> your table, but I've seen strtol consume a significant amount of time on
>> tables with numeric/integer columns.
>>
>> - Instead of pallocing and memcpying the text fields, leave a little bit
>> of room between fields in the attribute_buf, and write the varlen header
>> there directly. This might help you since your table has a lot of text
>> fields.
>>
>> - Instead of the normal PG function calling conventions, provide
>> specialized fastpath input functions for the most common data types.
>> InputFunctionCall consumed 4.5% of the CPU time in your profile.
>>
>> - Use a simpler memory context implementation, that's like a stack with
>> no pfree support, for the per-tuple context.
>>
>> --
>> Heikki Linnakangas
>> EnterpriseDB http://www.enterprisedb.com
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 1: if posting/reading through Usenet, please send an appropriate
>> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
>> message can get through to the mailing list cleanly
>

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-03-06 20:02:16
Message-ID:	20080306120216.08d4382b@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, 3 Mar 2008 20:40:02 -0500 (EST)
Bruce Momjian <bruce(at)momjian(dot)us> wrote:

As a closing note on this from my side. When I ran the restore with 90
connections (as a stress test) it only took 20 minutes longer than the
24 connections.

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH0E3IATb/zqfZUUQRAjn3AJ9d0IpBn9pd16AYFWT3qs2frS51lQCfcOIN
uQvwc263TjLMZOYzx7Ktbpo=
=6xOV
-----END PGP SIGNATURE-----

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 8.3 / 8.2.6 restore comparison
Date:	2008-04-24 17:32:55
Message-ID:	200804241732.m3OHWtN21739@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Added to TODO:

> o Impove COPY performance
>
> http://archives.postgresql.org/pgsql-hackers/2008-02/msg00954.php

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Bruce Momjian wrote:
> > Heikki, are there any TODO items here?
>
> Hmm. You could put an entry there for "profile and optimize COPY", with
> the below list of ideas as a starting point. It's more about profiling
> and performance testing than coding.
>
> > ---------------------------------------------------------------------------
> >
> > Heikki Linnakangas wrote:
> >> Joshua D. Drake wrote:
> >>> On Sun, 24 Feb 2008 00:43:18 +0000
> >>> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> wrote:
> >>>
> >>>> Incidentally, I've been working on a patch to speed up CopyReadLine.
> >>>> I was going to run some more tests first, but since we're talking
> >>>> about it, I guess I should just post the patch. I'll post to
> >>>> pgsql-patches shortly.
> >>> On your post to patches you mentioned only about a 5% improvement.
> >>> Don't get me wrong, 5% is 5% and I respect it greatly but as far as I
> >>> can tell we are about 300% behind the curve.
> >> Yeah. Looking at the profile, the time is spent really all over the
> >> place. There's no one clear bottleneck to focus on. I think we could do
> >> a few more ~5% improvements, but
> >>
> >> At some point, I think we have to bite the bullet and find a way to use
> >> multiple CPUs for a single load. I don't have any good ideas or plans
> >> for that, but hopefully someone does.
> >>
> >>> My tests were maxing out at ~22G an hour. On hardware that can do
> >>> in 360G an hour and that is assuming > 50% overhead between OS, libs,
> >>> etc... I have no choice but to conclude we have a much, much deeper and
> >>> fundamental issue going on with COPY. I am inspired by Itagaki Takahiro
> >>> and his batch update of indexes which should help greatly overall but
> >>> doesn't help my specific issue.
> >> Yep, the index build idea is an I/O improvement, not a CPU one.
> >>
> >>> Forgive me for not being a C programmer and Alvaro is not online so I
> >>> would vet these questions with him first.
> >>>
> >>> I know that copy is in theory a bulk loader but, when performing the
> >>> readline how many lines are we reading? Do we read up to 8192? Or do we
> >>> shove in say 8megs of data before we invoke DoCopy?
> >> We read 64 KB at a time, and then CopyReadLineText returns one line at a
> >> time from that buffer.
> >>
> >> Looking at your profile more, and after the memchr patch, the "raw input
> >> side" of copy, consisting of reading the data from disk in 64KB blocks,
> >> splitting that into lines, and splitting lines into columns, still takes
> >> ~20% of the CPU time. I suspect CopyReadAttributesText is the biggest
> >> culprit there.
> >>
> >> You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by
> >> creating the table (or truncating it) in the same transaction with the COPY.
> >>
> >> After that, heap_formtuple is high on the list. I wonder if we could do
> >> something about that.
> >>
> >>> I am just curious if there is some simple low hanging fruit that is
> >>> possibly missing.
> >> I don't see any piece of code that's causing problems. We can shave off
> >> a few percents here and there I think, but don't expect a 300%
> >> improvement anytime soon. A few ideas I've thought about are:
> >>
> >> - use a specialized version of strtol, for base 10. That won't help on
> >> your table, but I've seen strtol consume a significant amount of time on
> >> tables with numeric/integer columns.
> >>
> >> - Instead of pallocing and memcpying the text fields, leave a little bit
> >> of room between fields in the attribute_buf, and write the varlen header
> >> there directly. This might help you since your table has a lot of text
> >> fields.
> >>
> >> - Instead of the normal PG function calling conventions, provide
> >> specialized fastpath input functions for the most common data types.
> >> InputFunctionCall consumed 4.5% of the CPU time in your profile.
> >>
> >> - Use a simpler memory context implementation, that's like a stack with
> >> no pfree support, for the per-tuple context.
> >>
> >> --
> >> Heikki Linnakangas
> >> EnterpriseDB http://www.enterprisedb.com
> >>
> >> ---------------------------(end of broadcast)---------------------------
> >> TIP 1: if posting/reading through Usenet, please send an appropriate
> >> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> >> message can get through to the mailing list cleanly
> >
>
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +