Re: Performance problems testing with Spamassassin 3.1.0

From: John A Meinel <john(at)arbash-meinel(dot)com>
To: Matthew Schumacher <matt(dot)s(at)aptalaska(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Date: 2005-08-04 14:10:03
Message-ID: 42F221BB.5000702@arbash-meinel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Matthew Schumacher wrote:
> Okay,
>
> Here is the status of the SA updates and a question:
>
> Michael got SA changed to pass an array of tokens to the proc so right
> there we gained a ton of performance due to connections and transactions
> being grouped into one per email instead of one per token.
>
> Now I am working on making the proc even faster. Since we have all of
> the tokens coming in as an array, it should be possible to get this down
> to just a couple of queries.
>
> I have the proc using IN and NOT IN statements to update everything at
> once from a temp table, but it progressively gets slower because the
> temp table is growing between vacuums. At this point it's slightly
> slower than the old update or else insert on every token.

I recommend that you drop and re-create the temp table. There is no
reason to have it around, considering you delete and re-add everything.
That means you never have to vacuum it, since it always only contains
the latest rows.

>
> What I really want to do is have the token array available as a record
> so that I can query against it, but not have it take up the resources of
> a real table. If I could copy from an array into a record then I can
> even get rid of the loop. Anyone have any thoughts on how to do this?
>

My one question here, is the inspam_count and inham_count *always* the
same for all tokens? I would have thought each token has it's own count.
Anyway, there are a few lines I would change:

>
> CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
> intokenary BYTEA[],
> inspam_count INTEGER,
> inham_count INTEGER,
> inatime INTEGER)
> RETURNS VOID AS '
> DECLARE
> _token BYTEA;
> BEGIN
>

-- create the table at the start of the procedure
CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
-- You might also add primary key if you are going to be adding
-- *lots* of entries, but it sounds like you are going to have
-- less than 1 page, so it doesn't matter

> for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
> LOOP
> _token := intokenary[i];
> INSERT INTO bayes_token_tmp VALUES (_token);
> END LOOP;
>
> UPDATE
> bayes_token
> SET
> spam_count = greatest_int(spam_count + inspam_count, 0),
> ham_count = greatest_int(ham_count + inham_count , 0),
> atime = greatest_int(atime, 1000)
> WHERE
> id = inuserid
> AND

-- (token) IN (SELECT intoken FROM bayes_token_tmp);
EXISTS (SELECT token FROM bayes_token_tmp
WHERE intoken=token LIMIT 1);

-- I would also avoid your intoken (NOT) IN (SELECT token FROM
-- bayes_token) There are a few possibilities, but to me
-- as your bayes_token table becomes big, this will start
-- to be the slow point

-- Rather than doing 2 NOT IN queries, it *might* be faster to do
DELETE FROM bayes_token_tmp
WHERE NOT EXISTS (SELECT token FROM bayes_token
WHERE token=intoken);

>
> UPDATE
> bayes_vars
> SET

-- token_count = token_count + (SELECT count(intoken) FROM
-- bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)),
token_count = token_count + (SELECT count(intoken)
FROM bayes_token_tmp)

-- You don't need the where NOT IN, since we already removed those rows

> newest_token_age = greatest_int(newest_token_age, inatime),
> oldest_token_age = least_int(oldest_token_age, inatime)
> WHERE
> id = inuserid;
>
> INSERT INTO
> bayes_token
> SELECT
> inuserid,
> intoken,
> inspam_count,
> inham_count,
> inatime
> FROM
> bayes_token_tmp
> WHERE
> (inspam_count > 0 OR inham_count > 0)

-- AND
-- (intoken) NOT IN (SELECT token FROM bayes_token);

-- You don't need either of those lines, again because we already
-- filtered

-- delete from bayes_token_tmp;
-- And rather than deleting all of the entries just
DROP TABLE bayes_token_tmp;

>
> RETURN;
> END;
> ' LANGUAGE 'plpgsql';
>
> CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
> RETURNS INTEGER
> IMMUTABLE STRICT
> AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
> LANGUAGE SQL;
>
> CREATE OR REPLACE FUNCTION least_int (integer, integer)
> RETURNS INTEGER
> IMMUTABLE STRICT
> AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
> LANGUAGE SQL;
>

So to clarify, here is my finished function:
------------------------------------
CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
intokenary BYTEA[],
inspam_count INTEGER,
inham_count INTEGER,
inatime INTEGER)
RETURNS VOID AS '
DECLARE
_token BYTEA;
BEGIN

CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
LOOP
_token := intokenary[i];
INSERT INTO bayes_token_tmp VALUES (_token);
END LOOP;

UPDATE
bayes_token
SET
spam_count = greatest_int(spam_count + inspam_count, 0),
ham_count = greatest_int(ham_count + inham_count , 0),
atime = greatest_int(atime, 1000)
WHERE
id = inuserid
AND
EXISTS (SELECT token FROM bayes_token_tmp
WHERE intoken=token LIMIT 1);

DELETE FROM bayes_token_tmp
WHERE NOT EXISTS (SELECT token FROM bayes_token
WHERE token=intoken);

UPDATE
bayes_vars
SET
token_count = token_count + (SELECT count(intoken)
FROM bayes_token_tmp),
newest_token_age = greatest_int(newest_token_age, inatime),
oldest_token_age = least_int(oldest_token_age, inatime)
WHERE
id = inuserid;

INSERT INTO
bayes_token
SELECT
inuserid,
intoken,
inspam_count,
inham_count,
inatime
FROM
bayes_token_tmp
WHERE
(inspam_count > 0 OR inham_count > 0)

DROP TABLE bayes_token_tmp;

RETURN;
END;
' LANGUAGE 'plpgsql';

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2005-08-04 14:37:02 Re: Performance problems testing with Spamassassin 3.1.0
Previous Message Laszlo Hornyak 2005-08-04 08:52:09 Re: Indexed views.