From: | PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at> |
---|---|
To: | Jan Urbański <wulczer(at)wulczer(dot)org> |
Cc: | pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Caching Python modules |
Date: | 2011-08-17 12:44:00 |
Message-ID: | 4854F53C-381A-4D22-B60E-18C997C43A85@cybertec.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Aug 17, 2011, at 2:19 PM, Jan Urbański wrote:
> On 17/08/11 14:09, PostgreSQL - Hans-Jürgen Schönig wrote:
>> CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$
>>
>> from SecondCorpus import SecondCorpus
>> from SecondDocument import SecondDocument
>>
>> i am doing some intense text mining here.
>> the problem is: is it possible to cache those imported modules from function to function call.
>> GD works nicely for variables but can this actually be done with imported modules as well?
>> the import takes around 95% of the total time so it is definitely something which should go away somehow.
>> i have checked the docs but i am not more clever now.
>
> After a module is imported in a backend, it stays in the interpreter's
> sys.modules dictionary and importing it again will not cause the module
> Python code to be executed.
>
> As long as you are using the same backend you should be able to call
> add_to_corpus repeatedly and the import statements should take a long
> time only the first time you call them.
>
> This simple test demonstrates it:
>
> $ cat /tmp/slow.py
> import time
> time.sleep(5)
>
> $ PYTHONPATH=/tmp/ bin/postgres -p 5433 -D data/
> LOG: database system was shut down at 2011-08-17 14:16:18 CEST
> LOG: database system is ready to accept connections
>
> $ bin/psql -p 5433 postgres
> Timing is on.
> psql (9.2devel)
> Type "help" for help.
>
> postgres=# select slow();
> slow
> ------
>
> (1 row)
>
> Time: 5032.835 ms
> postgres=# select slow();
> slow
> ------
>
> (1 row)
>
> Time: 1.051 ms
>
> Cheers,
> Jan
hello jan …
the code is actually like this …
the first function is called once per backend. it compiles some fairly fat in memory stuff …
this takes around 2 secs or so … but this is fine and not an issue.
-- setup the environment
CREATE OR REPLACE FUNCTION textprocess.setup_sentiment(pypath text, lang text) RETURNS void AS $$
import sys
sys.path.append(pypath)
sys.path.append(pypath + "/external")
from SecondCorpus import SecondCorpus
import const
GD['path_to_classes'] = pypath
GD['corpus'] = SecondCorpus(lang)
GD['lang'] = lang
return;
$$ LANGUAGE 'plpythonu' STABLE;
this is called more frequently ...
-- add a document to the corpus
CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$
from SecondCorpus import SecondCorpus
from SecondDocument import SecondDocument
doc1 = SecondDocument(GD['corpus'].senti_provider, lang, t)
doc1.create_sentences()
GD['corpus'].add_document(doc1)
GD['corpus'].process()
return doc1.total_score
$$ LANGUAGE 'plpythonu' STABLE;
the point here actually is: if i use the classes in a normal python command line program this routine does not look like an issue
creating the document object and doing the magic in there is not a problem actually …
on the SQL side this is already fairly heavy for some reason ...
funcid | schemaname | funcname | calls | total_time | self_time | ?column?
--------+-------------+-----------------+-------+------------+-----------+----------
235287 | textprocess | setup_sentiment | 54 | 100166 | 100166 | 1854
235288 | textprocess | add_to_corpus | 996 | 438909 | 438909 | 440
looks like some afternoon with some more low level tools :(.
many thanks,
hans
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
From | Date | Subject | |
---|---|---|---|
Next Message | Ashesh Vashi | 2011-08-17 12:58:52 | PATCH: Compiling PostgreSQL using ActiveState Python 3.2 |
Previous Message | Robert Haas | 2011-08-17 12:40:15 | Re: Online base backup from the hot-standby |