WIP: preloading of ispell dictionary

Lists: pgsql-hackers
From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: WIP: preloading of ispell dictionary
Date: 2010-03-19 10:38:58
Message-ID: 162867791003190338r1c1efa1doea253d4f3c0d4582@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello

I wrote some small patch, that allow preloading of selected ispell
dictionary. It solve the problem with slow tsearch initialisation with
some language configuration.

This patch is most simple - simpler than variant with shared memory
and it is usable on Linux platform.

I registered some issues about access to different king of memory :(.
The local memory is the best, than shared_memory and then virtual
memory. Queries with preloaded dictionary are about 20% slower (but
still good enough). It depend on platform (and language sure) - I
afraid so this module doesn't help on MS Windows.

Tested on 64bit Fedora Linux - probably on 32bit these issues will be smaller.

I would to add this patch to next commitfest.

can somebody test it for different platforms and different languages than Czech?

Regards
Pavel Stehule

Attachment Content-Type Size
preload.diff application/octet-stream 22.2 KB

From: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-23 00:57:31
Message-ID: 20100323095730.992C.52131E4D@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:

> I wrote some small patch, that allow preloading of selected ispell
> dictionary. It solve the problem with slow tsearch initialisation with
> some language configuration.
>
> I afraid so this module doesn't help on MS Windows.

I think it should work on all platforms if we include it into the core.
We should continue to research shared memory or mmap approaches.

The fundamental issue seems to be in the slow initialization of
dictionaries. If so, how about adding a pre-complile tool to convert
a dictionary into a binary file, and each backend simply mmap it?

BTW, SimpleAllocContextCreate() is not used at all in the patch.
Do you still need it?

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-23 06:07:54
Message-ID: 4BA85ABA.1070604@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Takahiro Itagaki wrote:
> Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>
>> I wrote some small patch, that allow preloading of selected ispell
>> dictionary. It solve the problem with slow tsearch initialisation with
>> some language configuration.
>>
>> I afraid so this module doesn't help on MS Windows.
>
> I think it should work on all platforms if we include it into the core.

It will work, as in it will compile and run. It just won't be any
faster. I think that's enough, otherwise you could argue that we
shouldn't have preload_shared_libraries option at all because it won't
help on Windows.

> The fundamental issue seems to be in the slow initialization of
> dictionaries. If so, how about adding a pre-complile tool to convert
> a dictionary into a binary file, and each backend simply mmap it?

Yeah, that would be better.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-23 07:42:40
Message-ID: 162867791003230042h24b3cf97r2a08239082dafa4e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/3/23 Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>:
>
> Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>
>> I wrote some small patch, that allow preloading of  selected ispell
>> dictionary. It solve the problem with slow tsearch initialisation with
>> some language configuration.
>>
>> I afraid so this module doesn't help on MS Windows.
>
> I think it should work on all platforms if we include it into the core.
> We should continue to research shared memory or mmap approaches.
>
> The fundamental issue seems to be in the slow initialization of
> dictionaries. If so, how about adding a pre-complile tool to convert
> a dictionary into a binary file, and each backend simply mmap it?

It means loading about 25MB from disc. for every first tsearch query -
sorry, I don't believe can be good.

>
> BTW, SimpleAllocContextCreate() is not used at all in the patch.
> Do you still need it?
>

yes - I needed it. Without Simple Allocator cz configuration takes
48MB. There are a few parts has to be supported by Simple Allocator -
other hasn't significant impact - so I don't ugly more code. In my
first path I verify so dictionary data are read only so I was
motivated to use Simple Allocator everywhere. It is not necessary for
preload method.

Pavel

> Regards,
> ---
> Takahiro Itagaki
> NTT Open Source Software Center
>
>
>


From: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-23 07:52:14
Message-ID: b0f3f5a11003230052j613d92a3p7bc3c73e50de94ae@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/3/23 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:

> 2010/3/23 Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>:
>
>> The fundamental issue seems to be in the slow initialization of
>> dictionaries. If so, how about adding a pre-complile tool to convert
>> a dictionary into a binary file, and each backend simply mmap it?
>
> It means loading about 25MB from disc. for every first tsearch query -
> sorry, I don't believe can be good.

The operating system's VM subsystem should make that a non-problem.
"Loading" is also not the word I would use to indicate what mmap does.

Nicolas


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-23 08:04:50
Message-ID: 162867791003230104p6ff8d946yd6b97c47f660fc6c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/3/23 Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>:
> 2010/3/23 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
>
>> 2010/3/23 Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>:
>>
>>> The fundamental issue seems to be in the slow initialization of
>>> dictionaries. If so, how about adding a pre-complile tool to convert
>>> a dictionary into a binary file, and each backend simply mmap it?
>>
>> It means loading about 25MB from disc. for every first tsearch query -
>> sorry, I don't believe can be good.
>
> The operating system's VM subsystem should make that a non-problem.
> "Loading" is also not the word I would use to indicate what mmap does.

Maybe we can do some manipulation inside memory - I have not any
knowledges about mmap. With Simple Allocator we can have a dictionary
data as one block. Problems are a pointers, but I believe so can be
replaced by offsets.

Personally I dislike idea some dictionary precompiler - it is next
application for maintaining and maybe not necessary. And still you
need a next application for loading.

p.s. I able to serialise czech dictionary, because it use only simply regexp.

Pavel

>
> Nicolas
>


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-24 05:14:55
Message-ID: 4BA99FCF.1080801@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavel Stehule wrote:

> Personally I dislike idea some dictionary precompiler - it is next
> application for maintaining and maybe not necessary.

That's the sort of thing that can be done when first required by any
backend and the results saved in a file for other backends to mmap().
It'd probably want to be opened r/w access-exclusive initially, then
re-opened read-only access-shared when ready for use.

My only concern would be that the cache would want to be forcibly
cleared at postmaster start, so that "restart the postmaster" fixes any
messsed-up-cache issues that might arise (not that they should) without
people having to go rm'ing in the datadir. Even if Pg never has any bugs
that result in bad cache files, the file system / bad memory / cosmic
rays / etc can still mangle a cache file.

BTW, mmap() isn't an issue on Windows:
http://msdn.microsoft.com/en-us/library/aa366556%28VS.85%29.aspx
It's spelled CreateFileMapping, but otherwise is fairly similar, and is
perfect for this sort of use.

A shared read-only mapping of processed-and-cached tsearch2 dictionaries
would save a HUGE amount of memory if many backends were using tsearch2
at the same time. I'd make a big difference here.

--
Craig Ringer


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-24 07:39:34
Message-ID: 162867791003240039x6221f139v8912b49f28bbe072@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/3/24 Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>:
> Pavel Stehule wrote:
>
>> Personally I dislike idea some dictionary precompiler - it is next
>> application for maintaining and maybe not necessary.
>
> That's the sort of thing that can be done when first required by any
> backend and the results saved in a file for other backends to mmap().
> It'd probably want to be opened r/w access-exclusive initially, then
> re-opened read-only access-shared when ready for use.
>
> My only concern would be that the cache would want to be forcibly
> cleared at postmaster start, so that "restart the postmaster" fixes any
> messsed-up-cache issues that might arise (not that they should) without
> people having to go rm'ing in the datadir. Even if Pg never has any bugs
> that result in bad cache files, the file system / bad memory / cosmic
> rays / etc can still mangle a cache file.
>
> BTW, mmap() isn't an issue on Windows:
>  http://msdn.microsoft.com/en-us/library/aa366556%28VS.85%29.aspx
> It's spelled CreateFileMapping, but otherwise is fairly similar, and is
> perfect for this sort of use.
>
> A shared read-only mapping of processed-and-cached tsearch2 dictionaries
> would save a HUGE amount of memory if many backends were using tsearch2
> at the same time. I'd make a big difference here.
>

If you know this area well, please, enhance my first patch. I am not
able to oppose to Tom, who has a clean opinion on this patch :(

Pavel

> --
> Craig Ringer
>


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-24 14:01:15
Message-ID: 201003241401.o2OE1Fj22523@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavel Stehule wrote:
> 2010/3/24 Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>:
> > Pavel Stehule wrote:
> >
> >> Personally I dislike idea some dictionary precompiler - it is next
> >> application for maintaining and maybe not necessary.
> >
> > That's the sort of thing that can be done when first required by any
> > backend and the results saved in a file for other backends to mmap().
> > It'd probably want to be opened r/w access-exclusive initially, then
> > re-opened read-only access-shared when ready for use.
> >
> > My only concern would be that the cache would want to be forcibly
> > cleared at postmaster start, so that "restart the postmaster" fixes any
> > messsed-up-cache issues that might arise (not that they should) without
> > people having to go rm'ing in the datadir. Even if Pg never has any bugs
> > that result in bad cache files, the file system / bad memory / cosmic
> > rays / etc can still mangle a cache file.
> >
> > BTW, mmap() isn't an issue on Windows:
> > ?http://msdn.microsoft.com/en-us/library/aa366556%28VS.85%29.aspx
> > It's spelled CreateFileMapping, but otherwise is fairly similar, and is
> > perfect for this sort of use.
> >
> > A shared read-only mapping of processed-and-cached tsearch2 dictionaries
> > would save a HUGE amount of memory if many backends were using tsearch2
> > at the same time. I'd make a big difference here.
> >
>
> If you know this area well, please, enhance my first patch. I am not
> able to oppose to Tom, who has a clean opinion on this patch :(

Should we add a TODO?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-24 14:14:40
Message-ID: 162867791003240714r3a936b71m7a860cd184978681@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/3/24 Bruce Momjian <bruce(at)momjian(dot)us>:
> Pavel Stehule wrote:
>> 2010/3/24 Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>:
>> > Pavel Stehule wrote:
>> >
>> >> Personally I dislike idea some dictionary precompiler - it is next
>> >> application for maintaining and maybe not necessary.
>> >
>> > That's the sort of thing that can be done when first required by any
>> > backend and the results saved in a file for other backends to mmap().
>> > It'd probably want to be opened r/w access-exclusive initially, then
>> > re-opened read-only access-shared when ready for use.
>> >
>> > My only concern would be that the cache would want to be forcibly
>> > cleared at postmaster start, so that "restart the postmaster" fixes any
>> > messsed-up-cache issues that might arise (not that they should) without
>> > people having to go rm'ing in the datadir. Even if Pg never has any bugs
>> > that result in bad cache files, the file system / bad memory / cosmic
>> > rays / etc can still mangle a cache file.
>> >
>> > BTW, mmap() isn't an issue on Windows:
>> > ?http://msdn.microsoft.com/en-us/library/aa366556%28VS.85%29.aspx
>> > It's spelled CreateFileMapping, but otherwise is fairly similar, and is
>> > perfect for this sort of use.
>> >
>> > A shared read-only mapping of processed-and-cached tsearch2 dictionaries
>> > would save a HUGE amount of memory if many backends were using tsearch2
>> > at the same time. I'd make a big difference here.
>> >
>>
>> If you know this area well, please, enhance my first patch. I am not
>> able to oppose to Tom, who has a clean opinion on this patch :(
>
> Should we add a TODO?

why not ?

Pavel
>
> --
>  Bruce Momjian  <bruce(at)momjian(dot)us>        http://momjian.us
>  EnterpriseDB                             http://enterprisedb.com
>
>  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
>


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-24 14:41:21
Message-ID: 201003241441.o2OEfMt04925@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavel Stehule wrote:
> 2010/3/24 Bruce Momjian <bruce(at)momjian(dot)us>:
> > Pavel Stehule wrote:
> >> 2010/3/24 Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>:
> >> > Pavel Stehule wrote:
> >> >
> >> >> Personally I dislike idea some dictionary precompiler - it is next
> >> >> application for maintaining and maybe not necessary.
> >> >
> >> > That's the sort of thing that can be done when first required by any
> >> > backend and the results saved in a file for other backends to mmap().
> >> > It'd probably want to be opened r/w access-exclusive initially, then
> >> > re-opened read-only access-shared when ready for use.
> >> >
> >> > My only concern would be that the cache would want to be forcibly
> >> > cleared at postmaster start, so that "restart the postmaster" fixes any
> >> > messsed-up-cache issues that might arise (not that they should) without
> >> > people having to go rm'ing in the datadir. Even if Pg never has any bugs
> >> > that result in bad cache files, the file system / bad memory / cosmic
> >> > rays / etc can still mangle a cache file.
> >> >
> >> > BTW, mmap() isn't an issue on Windows:
> >> > ?http://msdn.microsoft.com/en-us/library/aa366556%28VS.85%29.aspx
> >> > It's spelled CreateFileMapping, but otherwise is fairly similar, and is
> >> > perfect for this sort of use.
> >> >
> >> > A shared read-only mapping of processed-and-cached tsearch2 dictionaries
> >> > would save a HUGE amount of memory if many backends were using tsearch2
> >> > at the same time. I'd make a big difference here.
> >> >
> >>
> >> If you know this area well, please, enhance my first patch. I am not
> >> able to oppose to Tom, who has a clean opinion on this patch :(
> >
> > Should we add a TODO?
>
> why not ?

OK, what would the TODO text be?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: preloading of ispell dictionary
Date: 2010-03-24 14:46:20
Message-ID: 21212.1269441980@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> OK, what would the TODO text be?

I think there are really two tasks here:

* preprocess the textual dictionary definition files into something
that can be slurped directly into memory;

* use mmap() instead of read() to read preprocessed files into memory,
on machines where such a syscall is available.

There would be considerable gain from task #1 even without mmap.

regards, tom lane