Re: performance issue using DBI

From: nikolaus(at)dilger(dot)cc
To: nnolst(at)hotmail(dot)com
Cc: pgsql-admin(at)postgresql(dot)org, pgsql-general(at)postgresql(dot)org
Subject: Re: performance issue using DBI
Date: 2002-06-07 03:43:14
Message-ID: 20020606204319.4378.h015.c001.wm@mail.dilger.cc.criticalpath.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-general

Nicolas,

The more data your tables contain the more work
PostgreSQL needs to do. Adding indexes on your tables
makes the INSERTS slower due to additional overhead.
Indexes speed up SELECT but slow down INSERT.
Therefore OLTP systems tend to have few indexes and
data warehouses many.

Instead of adding your data one line at a time I would
use a staging area to speed things up as well as set
operations instead of processing one item at a time in
a loop.

Your raw data obviously contains the following columns:
remote_ip, phone_type, phone_number url, timestamp

So create a table raw_data with the above columns.
Truncate it before each data load.

As first step of your daily processing load your raw
data into the new table.

Then find out all unique sessions within your raw data:
CREATE TABLE unique_sessions AS
SELECT DISTINCT remote_ip, phone_type, phone_number
FROM raw_data;

Now find all the new sessions:
CREATE TABLE new_sessions AS
SELECT remote_ip, phone_type, phone_number
FROM unique_sessions
MINUS
SELECT remote_ip, phone_type, phone_number
FROM sessions;

Finally insert the new sessions into the sessions table:
INSERT INTO sessions
SELECT nextval('sessin_id'), remote_ip, phone_type,
phone_number
FROM new_sessions;

Now insert into the actions table:
INSERT INTO actions
SELECT s.session_id, r.url, r.timestamp
FROM raw_data r, sessions s
WHERE s. remote_ip = r. remote_ip
AND s.phone_type = r. phone_type
AND s.phone_number =r. phone_number

As the last step clean up the staging tables no longer
needed:
DROP TABLE unique_sessions;
DROP TABLE new_sessions;
TRUNCATE TABLE raw_data;

Regards,
Nikolaus

On Thu, 06 June 2002, "Nicolas Nolst" wrote

>
> <html><div style='background-color:'><DIV></DIV>
> <DIV></DIV>Hi all,<BR><BR>I have developped a perl
script to populate a database with two tables:
sessions<BR>and actions.<BR><BR>the table actions
contains the following columns: session_id, url,
timestamp.<BR>The column session_id references to the
table sessions.<BR><BR>the table sessions contains the
following columns: session_id,
remote_ip,<BR>phone_type, phone_number. The column
session_id is serial.<BR><BR>The lines of the table
actions which are part of the same session have the
same<BR>session_id.<BR><BR>There are then more lines in
the table actions than in the table
session.<BR><BR><BR>To fill the two tables, I first
need to know if the session already exists for
a<BR>certain phone_type, a certain remote_ip and a
certain
phone_number:<BR><BR>SELECT&nbsp;session_id&nbsp;FROM
sessions&nbsp;WHERE (phone_number = ?)&nbsp;AND
(remote_ip&nbsp; = ?)<BR>AND (phone_type =
?)&nbsp;ORDER BY&nbsp;session_id;<BR><BR>I also need to
apply a criteria to&nbsp;know if I have to&nbsp;add a
new entry in the<BR>table sessions or
not:<BR><BR>SELECT&nbsp;(max(timestamp) +
?)&lt;?&nbsp;FROM actions&nbsp;WHERE (session_id =
?);<BR><BR><BR><BR>If the session already exists I add
a line in the table actions with a INSERT<BR><BR>If the
session doesn't exist or if the criteria is true, I add
a line in the<BR>table sessions with an INSERT and then
add&nbsp; a line with a INSERT in the table actions (I
use nextval and currval).<BR><BR>I have put indexes on
sessions(session_id),
sessions(msisdn),<BR>actions(session_id) and actions(timestamp). I
process one log file of about 20000 lines every day.
All the lines are processed in one transaction
(autocommit set to 0).<BR><BR>My problem is that
populating my database is slower&nbsp; when the data
gets bigger<BR>and the performance falls dramatically.
I thought that is would be improve with<BR>my indexes
but the problem still persists.<BR><BR>Could you please
give me some clues that could solve this
issue.<BR><BR>Thanks.<BR><BR><BR><BR>Nicolas Nolst
> <DIV></DIV>
> <DIV></DIV><IMG
src="http://www.3dflags/World/Gif/belgium_gs.gif">
> <DIV></DIV></div><br clear=all><hr>MSN Photos is the
easiest way to share and print your photos: <a
href='http://g.msn.com/1HM500901/157'>Click
Here</a><br></html>

Browse pgsql-admin by date

  From Date Subject
Next Message Ferdinand Smit 2002-06-07 12:20:46 Postgres network preformance
Previous Message Robson Martins 2002-06-06 20:43:28 What err ???

Browse pgsql-general by date

  From Date Subject
Next Message Joel Burton 2002-06-07 03:54:40 Re: Aliias names in select criteria
Previous Message Steven Vajdic 2002-06-07 02:26:39 Re: [HACKERS] PostgreSQL and Windows2000 and defunct processes