Lists: | pgsql-hackers |
---|
From: | PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at> |
---|---|
To: | pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Caching Python modules |
Date: | 2011-08-17 12:09:16 |
Message-ID: | 682AACCD-6E75-4A1F-9A19-16FDE8BAC922@cybertec.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
hello …
i have just fallen over a nasty problem (maybe missing feature) with PL/Pythonu …
consider:
-- add a document to the corpus
CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$
from SecondCorpus import SecondCorpus
from SecondDocument import SecondDocument
i am doing some intense text mining here.
the problem is: is it possible to cache those imported modules from function to function call.
GD works nicely for variables but can this actually be done with imported modules as well?
the import takes around 95% of the total time so it is definitely something which should go away somehow.
i have checked the docs but i am not more clever now.
many thanks,
hans
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
From: | Jan Urbański <wulczer(at)wulczer(dot)org> |
---|---|
To: | PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at> |
Cc: | pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Caching Python modules |
Date: | 2011-08-17 12:19:02 |
Message-ID: | 4E4BB1B6.5030101@wulczer.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 17/08/11 14:09, PostgreSQL - Hans-Jürgen Schönig wrote:
> CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$
>
> from SecondCorpus import SecondCorpus
> from SecondDocument import SecondDocument
>
> i am doing some intense text mining here.
> the problem is: is it possible to cache those imported modules from function to function call.
> GD works nicely for variables but can this actually be done with imported modules as well?
> the import takes around 95% of the total time so it is definitely something which should go away somehow.
> i have checked the docs but i am not more clever now.
After a module is imported in a backend, it stays in the interpreter's
sys.modules dictionary and importing it again will not cause the module
Python code to be executed.
As long as you are using the same backend you should be able to call
add_to_corpus repeatedly and the import statements should take a long
time only the first time you call them.
This simple test demonstrates it:
$ cat /tmp/slow.py
import time
time.sleep(5)
$ PYTHONPATH=/tmp/ bin/postgres -p 5433 -D data/
LOG: database system was shut down at 2011-08-17 14:16:18 CEST
LOG: database system is ready to accept connections
$ bin/psql -p 5433 postgres
Timing is on.
psql (9.2devel)
Type "help" for help.
postgres=# select slow();
slow
------
(1 row)
Time: 5032.835 ms
postgres=# select slow();
slow
------
(1 row)
Time: 1.051 ms
Cheers,
Jan
From: | Jan Urbański <wulczer(at)wulczer(dot)org> |
---|---|
To: | PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at> |
Cc: | pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Caching Python modules |
Date: | 2011-08-17 12:20:55 |
Message-ID: | 4E4BB227.1000108@wulczer.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 17/08/11 14:19, Jan Urbański wrote:
> On 17/08/11 14:09, PostgreSQL - Hans-Jürgen Schönig wrote:
>> CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$
>>
>> from SecondCorpus import SecondCorpus
>> from SecondDocument import SecondDocument
>>
>> i am doing some intense text mining here.
>> the problem is: is it possible to cache those imported modules from function to function call.
>> GD works nicely for variables but can this actually be done with imported modules as well?
>> the import takes around 95% of the total time so it is definitely something which should go away somehow.
>> i have checked the docs but i am not more clever now.
>
> After a module is imported in a backend, it stays in the interpreter's
> sys.modules dictionary and importing it again will not cause the module
> Python code to be executed.
>
> As long as you are using the same backend you should be able to call
> add_to_corpus repeatedly and the import statements should take a long
> time only the first time you call them.
>
> This simple test demonstrates it:
>
> [example missing the slow() function code]
Oops, forgot to show the CREATE statement of the test function:
postgres=# create or replace function slow() returns void language
plpythonu as $$ import slow $$;
Jan
From: | PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at> |
---|---|
To: | Jan Urbański <wulczer(at)wulczer(dot)org> |
Cc: | pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Caching Python modules |
Date: | 2011-08-17 12:44:00 |
Message-ID: | 4854F53C-381A-4D22-B60E-18C997C43A85@cybertec.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Aug 17, 2011, at 2:19 PM, Jan Urbański wrote:
> On 17/08/11 14:09, PostgreSQL - Hans-Jürgen Schönig wrote:
>> CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$
>>
>> from SecondCorpus import SecondCorpus
>> from SecondDocument import SecondDocument
>>
>> i am doing some intense text mining here.
>> the problem is: is it possible to cache those imported modules from function to function call.
>> GD works nicely for variables but can this actually be done with imported modules as well?
>> the import takes around 95% of the total time so it is definitely something which should go away somehow.
>> i have checked the docs but i am not more clever now.
>
> After a module is imported in a backend, it stays in the interpreter's
> sys.modules dictionary and importing it again will not cause the module
> Python code to be executed.
>
> As long as you are using the same backend you should be able to call
> add_to_corpus repeatedly and the import statements should take a long
> time only the first time you call them.
>
> This simple test demonstrates it:
>
> $ cat /tmp/slow.py
> import time
> time.sleep(5)
>
> $ PYTHONPATH=/tmp/ bin/postgres -p 5433 -D data/
> LOG: database system was shut down at 2011-08-17 14:16:18 CEST
> LOG: database system is ready to accept connections
>
> $ bin/psql -p 5433 postgres
> Timing is on.
> psql (9.2devel)
> Type "help" for help.
>
> postgres=# select slow();
> slow
> ------
>
> (1 row)
>
> Time: 5032.835 ms
> postgres=# select slow();
> slow
> ------
>
> (1 row)
>
> Time: 1.051 ms
>
> Cheers,
> Jan
hello jan …
the code is actually like this …
the first function is called once per backend. it compiles some fairly fat in memory stuff …
this takes around 2 secs or so … but this is fine and not an issue.
-- setup the environment
CREATE OR REPLACE FUNCTION textprocess.setup_sentiment(pypath text, lang text) RETURNS void AS $$
import sys
sys.path.append(pypath)
sys.path.append(pypath + "/external")
from SecondCorpus import SecondCorpus
import const
GD['path_to_classes'] = pypath
GD['corpus'] = SecondCorpus(lang)
GD['lang'] = lang
return;
$$ LANGUAGE 'plpythonu' STABLE;
this is called more frequently ...
-- add a document to the corpus
CREATE OR REPLACE FUNCTION textprocess.add_to_corpus(lang text, t text) RETURNS float4 AS $$
from SecondCorpus import SecondCorpus
from SecondDocument import SecondDocument
doc1 = SecondDocument(GD['corpus'].senti_provider, lang, t)
doc1.create_sentences()
GD['corpus'].add_document(doc1)
GD['corpus'].process()
return doc1.total_score
$$ LANGUAGE 'plpythonu' STABLE;
the point here actually is: if i use the classes in a normal python command line program this routine does not look like an issue
creating the document object and doing the magic in there is not a problem actually …
on the SQL side this is already fairly heavy for some reason ...
funcid | schemaname | funcname | calls | total_time | self_time | ?column?
--------+-------------+-----------------+-------+------------+-----------+----------
235287 | textprocess | setup_sentiment | 54 | 100166 | 100166 | 1854
235288 | textprocess | add_to_corpus | 996 | 438909 | 438909 | 440
looks like some afternoon with some more low level tools :(.
many thanks,
hans
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de