Re: Performance problems testing with Spamassassin 3.1.0

From: Matthew Schumacher <matt(dot)s(at)aptalaska(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Date: 2005-08-04 08:13:55
Message-ID: 42F1CE43.8040801@aptalaska.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Okay,

Here is the status of the SA updates and a question:

Michael got SA changed to pass an array of tokens to the proc so right
there we gained a ton of performance due to connections and transactions
being grouped into one per email instead of one per token.

Now I am working on making the proc even faster. Since we have all of
the tokens coming in as an array, it should be possible to get this down
to just a couple of queries.

I have the proc using IN and NOT IN statements to update everything at
once from a temp table, but it progressively gets slower because the
temp table is growing between vacuums. At this point it's slightly
slower than the old update or else insert on every token.

What I really want to do is have the token array available as a record
so that I can query against it, but not have it take up the resources of
a real table. If I could copy from an array into a record then I can
even get rid of the loop. Anyone have any thoughts on how to do this?

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
intokenary BYTEA[],
inspam_count INTEGER,
inham_count INTEGER,
inatime INTEGER)
RETURNS VOID AS '
DECLARE
_token BYTEA;
BEGIN

for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
LOOP
_token := intokenary[i];
INSERT INTO bayes_token_tmp VALUES (_token);
END LOOP;

UPDATE
bayes_token
SET
spam_count = greatest_int(spam_count + inspam_count, 0),
ham_count = greatest_int(ham_count + inham_count , 0),
atime = greatest_int(atime, 1000)
WHERE
id = inuserid
AND
(token) IN (SELECT intoken FROM bayes_token_tmp);

UPDATE
bayes_vars
SET
token_count = token_count + (SELECT count(intoken) FROM
bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)),
newest_token_age = greatest_int(newest_token_age, inatime),
oldest_token_age = least_int(oldest_token_age, inatime)
WHERE
id = inuserid;

INSERT INTO
bayes_token
SELECT
inuserid,
intoken,
inspam_count,
inham_count,
inatime
FROM
bayes_token_tmp
WHERE
(inspam_count > 0 OR inham_count > 0)
AND
(intoken) NOT IN (SELECT token FROM bayes_token);

delete from bayes_token_tmp;

RETURN;
END;
' LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
RETURNS INTEGER
IMMUTABLE STRICT
AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
LANGUAGE SQL;

CREATE OR REPLACE FUNCTION least_int (integer, integer)
RETURNS INTEGER
IMMUTABLE STRICT
AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
LANGUAGE SQL;

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message prasanna s 2005-08-04 08:24:32 Indexed views.
Previous Message John A Meinel 2005-08-03 15:07:47 Re: Is There A Windows Version of Performance Tuning Documents?