Re: Caching Python modules

From: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Caching Python modules
Date: 2011-08-17 12:44:00
Message-ID: 4854F53C-381A-4D22-B60E-18C997C43A85@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Aug 17, 2011, at 2:19 PM, Jan Urbański wrote:

> On 17/08/11 14:09, PostgreSQL - Hans-Jürgen Schönig wrote:
>> CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$
>>
>> from SecondCorpus import SecondCorpus
>> from SecondDocument import SecondDocument
>>
>> i am doing some intense text mining here.
>> the problem is: is it possible to cache those imported modules from function to function call.
>> GD works nicely for variables but can this actually be done with imported modules as well?
>> the import takes around 95% of the total time so it is definitely something which should go away somehow.
>> i have checked the docs but i am not more clever now.
>
> After a module is imported in a backend, it stays in the interpreter's
> sys.modules dictionary and importing it again will not cause the module
> Python code to be executed.
>
> As long as you are using the same backend you should be able to call
> add_to_corpus repeatedly and the import statements should take a long
> time only the first time you call them.
>
> This simple test demonstrates it:
>
> $ cat /tmp/slow.py
> import time
> time.sleep(5)
>
> $ PYTHONPATH=/tmp/ bin/postgres -p 5433 -D data/
> LOG: database system was shut down at 2011-08-17 14:16:18 CEST
> LOG: database system is ready to accept connections
>
> $ bin/psql -p 5433 postgres
> Timing is on.
> psql (9.2devel)
> Type "help" for help.
>
> postgres=# select slow();
> slow
> ------
>
> (1 row)
>
> Time: 5032.835 ms
> postgres=# select slow();
> slow
> ------
>
> (1 row)
>
> Time: 1.051 ms
>
> Cheers,
> Jan

hello jan …

the code is actually like this …
the first function is called once per backend. it compiles some fairly fat in memory stuff …
this takes around 2 secs or so … but this is fine and not an issue.

-- setup the environment
CREATE OR REPLACE FUNCTION textprocess.setup_sentiment(pypath text, lang text) RETURNS void AS $$
import sys
sys.path.append(pypath)
sys.path.append(pypath + "/external")

from SecondCorpus import SecondCorpus
import const

GD['path_to_classes'] = pypath
GD['corpus'] = SecondCorpus(lang)
GD['lang'] = lang

return;
$$ LANGUAGE 'plpythonu' STABLE;

this is called more frequently ...

-- add a document to the corpus
CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$

from SecondCorpus import SecondCorpus
from SecondDocument import SecondDocument

doc1 = SecondDocument(GD['corpus'].senti_provider, lang, t)
doc1.create_sentences()
GD['corpus'].add_document(doc1)
GD['corpus'].process()
return doc1.total_score
$$ LANGUAGE 'plpythonu' STABLE;

the point here actually is: if i use the classes in a normal python command line program this routine does not look like an issue
creating the document object and doing the magic in there is not a problem actually …

on the SQL side this is already fairly heavy for some reason ...

funcid | schemaname | funcname | calls | total_time | self_time | ?column?
--------+-------------+-----------------+-------+------------+-----------+----------
235287 | textprocess | setup_sentiment | 54 | 100166 | 100166 | 1854
235288 | textprocess | add_to_corpus | 996 | 438909 | 438909 | 440

looks like some afternoon with some more low level tools :(.

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashesh Vashi 2011-08-17 12:58:52 PATCH: Compiling PostgreSQL using ActiveState Python 3.2
Previous Message Robert Haas 2011-08-17 12:40:15 Re: Online base backup from the hot-standby