Re: EXTERNAL storage and substring on long strings

From: Scott Cain <cain(at)cshl(dot)org>
To: PgSQL Performance ML <pgsql-performance(at)postgresql(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: EXTERNAL storage and substring on long strings
Date: 2003-08-04 15:25:36
Message-ID: 1060010735.1433.26.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance pgsql-sql

Hello,

Note: there is a SQL question way at the bottom of this narrative :-)

Last week I asked about doing substring operations on very long strings
(>10 million characters). I was given a suggestion to use EXTERNAL
storage on the column via the ALTER TABLE ... SET STORAGE command. In
one test case, the performance of substring actually got worse using
EXTERNAL storage.

In an effort to find the best way to do this operation, I decided to
look at what is my "worst case" scenario: the DNA sequence for human
chromosome 1, which is about 250 million characters long (previous
strings where about 20 million characters long). I wrote a perl script
to do several substring operations over this very long string, with
substring lengths varying between 1000 and 40,000 characters spread out
over various locations along the string. While EXTENDED storage won in
this case, it was a hollow victory: 38 seconds per operation versus 40
seconds, both of which are way too long to for an interactive
application.

Time for a new method. A suggestion from my boss was to "shred" the DNA
into smallish chunks and a column giving offsets from the beginning of
the string, so that it can be reassembled when needed. Here is the test
table:

string=> \d dna
Table "public.dna"
Column | Type | Modifiers
---------+---------+-----------
foffset | integer |
pdna | text |
Indexes: foffset_idx btree (foffset)

In practice, there would also be a foreign key column to give the
identifier of the dna. Then I wrote the following function (here's the
SQL part promised above):

CREATE OR REPLACE FUNCTION dna_string (integer, integer) RETURNS TEXT AS '
DECLARE
smin ALIAS FOR $1;
smax ALIAS FOR $2;
longdna TEXT := '''';
dna_row dna%ROWTYPE;
dnastring TEXT;
firstchunk INTEGER;
lastchunk INTEGER;
in_longdnastart INTEGER;
in_longdnalen INTEGER;
chunksize INTEGER;
BEGIN
SELECT INTO chunksize min(foffset) FROM dna WHERE foffset>0;
firstchunk := chunksize*(smin/chunksize);
lastchunk := chunksize*(smax/chunksize);

in_longdnastart := smin % chunksize;
in_longdnalen := smax - smin + 1;

FOR dna_row IN
SELECT * FROM dna
WHERE foffset >= firstchunk AND foffset <= lastchunk
ORDER BY foffset
LOOP

longdna := longdna || dna_row.pdna;
END LOOP;

dnastring := substring(longdna FROM in_longdnastart FOR in_longdnalen);

RETURN dnastring;
END;
' LANGUAGE 'plpgsql';

So here's the question: I've never written a plpgsql function before, so
I don't have much experience with it; is there anything obviously wrong
with this function, or are there things that could be done better? At
least this appears to work and is much faster, completing substring
operations like above in about 0.27 secs (that's about two orders of
magnitude improvement!)

Thanks,
Scott

--
------------------------------------------------------------------------
Scott Cain, Ph. D. cain(at)cshl(dot)org
GMOD Coordinator (http://www.gmod.org/) 216-392-3087
Cold Spring Harbor Laboratory

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Fernando Papa 2003-08-04 15:26:38 Re: I can't wait too much: Total runtime 432478.44 msec
Previous Message Manfred Koizar 2003-08-04 15:17:06 Re: I can't wait too much: Total runtime 432478.44 msec

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 2003-08-04 15:53:43 Re: [SQL] EXTERNAL storage and substring on long strings
Previous Message Christoph Haller 2003-08-04 09:26:08 Re: Problem with looping on a table function result