Re: Restore v. Running COPY/INDEX seperatly

Lists: pgsql-general
From: Benjamin Arai <me(at)benjaminarai(dot)com>
To: PostgreSQL <pgsql-general(at)postgresql(dot)org>
Subject: Restore v. Running COPY/INDEX seperatly
Date: 2007-08-26 22:04:49
Message-ID: 4825E439-3559-465A-8719-666E75872B91@benjaminarai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,

So, I built my tables which contains a TSearch2 field by

1. Create table without indexes
2. COPY data into table
3. ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
4. UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
5. Index all the fields including the TSearch2 field

The process takes several days.

In contrast, if I backup the table and restore it to a new table it
takes a fraction of the time as running the above operation
manually. I am building my indexes at the end but I think the step 4
may be causing uneeded overhead. Can I somehow just copy data into
the idxFTI field during the copy process? Is there anything else I
can do to get my loading process to perform similar to backup/restore?

Does pg_dump also dump the indexes? That would explain why it is so
much faster...

Benjamin


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Benjamin Arai <me(at)benjaminarai(dot)com>
Cc: PostgreSQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: Restore v. Running COPY/INDEX seperatly
Date: 2007-08-26 22:16:11
Message-ID: Pine.LNX.4.64.0708270216001.2767@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Sun, 26 Aug 2007, Benjamin Arai wrote:

> Hi,
>
> So, I built my tables which contains a TSearch2 field by
>
> 1. Create table without indexes
> 2. COPY data into table
> 3. ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
> 4. UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);

vacuum here

> 5. Index all the fields including the TSearch2 field
>
> The process takes several days.
>
> In contrast, if I backup the table and restore it to a new table it takes a
> fraction of the time as running the above operation manually. I am building
> my indexes at the end but I think the step 4 may be causing uneeded overhead.
> Can I somehow just copy data into the idxFTI field during the copy process?
> Is there anything else I can do to get my loading process to perform similar
> to backup/restore?
>
> Does pg_dump also dump the indexes? That would explain why it is so much
> faster...
>
> Benjamin
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Benjamin Arai" <me(at)benjaminarai(dot)com>, "PostgreSQL" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Restore v. Running COPY/INDEX seperatly
Date: 2007-08-27 01:05:06
Message-ID: 874pilyegt.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su> writes:

> On Sun, 26 Aug 2007, Benjamin Arai wrote:
>
>> Hi,
>>
>> So, I built my tables which contains a TSearch2 field by
>>
>> 1. Create table without indexes
>> 2. COPY data into table
>> 3. ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
>> 4. UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
>
> vacuum here

Or you could do something tricky and do the update like this which would avoid
the need to vacuum:

ALTER TABLE tblMessages ALTER COLUMN idxFTI TYPE tsvector USING to_tsvector('default, strMesage);

This only works because ALTER TABLE rewrites the table from scratch any time
it does an operation like this. Don't try this if there are transactions
working against the table at the same time (such as a pg_dump!).

Or you could set up a trigger to generate the tsvector when you first load the
data instead of adding it later.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "Benjamin Arai" <me(at)benjaminarai(dot)com>, "PostgreSQL" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Restore v. Running COPY/INDEX seperatly
Date: 2007-08-27 03:43:05
Message-ID: 12760.1188186185@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> On Sun, 26 Aug 2007, Benjamin Arai wrote:
>>> So, I built my tables which contains a TSearch2 field by
>>> 1. Create table without indexes
>>> 2. COPY data into table
>>> 3. ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
>>> 4. UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);

> Or you could set up a trigger to generate the tsvector when you first
> load the data instead of adding it later.

You're going to want such a trigger anyway, so installing it before the
COPY step seems like the Obviously Right Thing. Any other approach
implies rewriting the entire table after you've loaded it, with no
compensating advantage that I can see.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Benjamin Arai <me(at)benjaminarai(dot)com>, PostgreSQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: Restore v. Running COPY/INDEX seperatly
Date: 2007-08-27 04:30:27
Message-ID: 20070827043027.GA18738@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tom Lane wrote:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> >> On Sun, 26 Aug 2007, Benjamin Arai wrote:
> >>> So, I built my tables which contains a TSearch2 field by
> >>> 1. Create table without indexes
> >>> 2. COPY data into table
> >>> 3. ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
> >>> 4. UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
>
> > Or you could set up a trigger to generate the tsvector when you first
> > load the data instead of adding it later.
>
> You're going to want such a trigger anyway, so installing it before the
> COPY step seems like the Obviously Right Thing. Any other approach
> implies rewriting the entire table after you've loaded it, with no
> compensating advantage that I can see.

Isn't the main speed advantage of the dump the fact that the
to_tsvector() results already come in the COPY data? The dump already
comes with the idxFTI column contents, instead of having to generate it
from scratch. That would depend on how expensive that function is, of
course.

--
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
"PHP is what I call the "Dumb Monkey" language. [A]ny dumb monkey can code
something in PHP. Python takes actual thought to produce something useful."
(J. Drake)


From: Benjamin Arai <me(at)benjaminarai(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "PostgreSQL" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Restore v. Running COPY/INDEX seperatly
Date: 2007-08-27 06:03:47
Message-ID: 75C64535-CA65-4982-8972-0740F73E4B25@benjaminarai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Why is a trigger faster than doing a ALTER after table is created? I
thought a trigger would be slower because it would be invoked every
iteration (a new row is inserted) during the COPY process.

Benjamin

On Aug 26, 2007, at 8:43 PM, Tom Lane wrote:

> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>>> On Sun, 26 Aug 2007, Benjamin Arai wrote:
>>>> So, I built my tables which contains a TSearch2 field by
>>>> 1. Create table without indexes
>>>> 2. COPY data into table
>>>> 3. ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
>>>> 4. UPDATE tblMessages SET idxFTI=to_tsvector('default',
>>>> strMessage);
>
>> Or you could set up a trigger to generate the tsvector when you first
>> load the data instead of adding it later.
>
> You're going to want such a trigger anyway, so installing it before
> the
> COPY step seems like the Obviously Right Thing. Any other approach
> implies rewriting the entire table after you've loaded it, with no
> compensating advantage that I can see.
>
> regards, tom lane
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Benjamin Arai <me(at)benjaminarai(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "PostgreSQL" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Restore v. Running COPY/INDEX seperatly
Date: 2007-08-27 14:59:26
Message-ID: 278.1188226766@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Benjamin Arai <me(at)benjaminarai(dot)com> writes:
> Why is a trigger faster than doing a ALTER after table is created? I
> thought a trigger would be slower because it would be invoked every
> iteration (a new row is inserted) during the COPY process.

Yeah, you'd have the trigger overhead, but the above argument ignores
the costs of the full-table UPDATE --- not to mention the VACUUM
you'll need after the UPDATE to clean up the dead rows.

regards, tom lane


From: Benjamin Arai <me(at)benjaminarai(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "PostgreSQL" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Restore v. Running COPY/INDEX seperatly
Date: 2007-10-15 19:34:36
Message-ID: 48C98F54-E517-4208-80C7-D92856C55190@benjaminarai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

In what order should I :

- COPY data
- Create indexes
- Create Trigger
- Vaccum

?

Currently I am:

1. Create table
2 . Create trigger for updates
3. Create indexes including gin
4. Vaccum

Benjamin

On Aug 27, 2007, at 7:59 AM, Tom Lane wrote:

> Benjamin Arai <me(at)benjaminarai(dot)com> writes:
>> Why is a trigger faster than doing a ALTER after table is created? I
>> thought a trigger would be slower because it would be invoked every
>> iteration (a new row is inserted) during the COPY process.
>
> Yeah, you'd have the trigger overhead, but the above argument ignores
> the costs of the full-table UPDATE --- not to mention the VACUUM
> you'll need after the UPDATE to clean up the dead rows.
>
> regards, tom lane
>