Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )

Lists: pgsql-generalpgsql-hackers
From: Christian Kienle <christian-kienle(at)losg(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: Photos of PostgreSQL booth
Date: 2003-12-25 22:15:21
Message-ID: 200312252315.21431.christian-kienle@losg.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hello...

well perhaps it is very OT...
Here are two pictures of the PostgreSQL booth on the Linux World Expo &
Conference in Frankfurt 2003.

I took the photos and had a talk to the two Postgre guys ;D
They were really friendly.

http://www.qtforum.org/test/IMG_1583.JPG
http://www.qtforum.org/test/IMG_1584.JPG

Have fun.

Christian

--
Linux is like a wigwam - no gates, no windows and an apache inside.


From: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
To: Christian Kienle <christian-kienle(at)losg(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Photos of PostgreSQL booth
Date: 2003-12-25 22:25:34
Message-ID: 20031225182527.I91759@ganymede.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers


whose manning the booth?

On Thu, 25 Dec 2003, Christian Kienle wrote:

> Hello...
>
> well perhaps it is very OT...
> Here are two pictures of the PostgreSQL booth on the Linux World Expo &
> Conference in Frankfurt 2003.
>
> I took the photos and had a talk to the two Postgre guys ;D
> They were really friendly.
>
> http://www.qtforum.org/test/IMG_1583.JPG
> http://www.qtforum.org/test/IMG_1584.JPG
>
> Have fun.
>
> Christian
>
> --
> Linux is like a wigwam - no gates, no windows and an apache inside.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match
>

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ: 7615664


From: Michael Meskes <meskes(at)postgresql(dot)org>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: Christian Kienle <christian-kienle(at)losg(dot)org>, pgsql-general(at)postgresql(dot)org
Subject: Re: Photos of PostgreSQL booth
Date: 2003-12-26 18:00:19
Message-ID: 20031226180019.GA1935@1
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Thu, Dec 25, 2003 at 06:25:34PM -0400, Marc G. Fournier wrote:
> whose manning the booth?

The one one the photos is Noel Koethe. He manned the booth together with
Peter Eisentraut and me.

Michael
--
Michael Meskes
Email: Michael at Fam-Meskes dot De
ICQ: 179140304, AIM/Yahoo: michaelmeskes, Jabber: meskes(at)jabber(dot)org
Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL!


From: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
To: Michael Meskes <meskes(at)postgresql(dot)org>
Cc: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>, Christian Kienle <christian-kienle(at)losg(dot)org>, pgsql-general(at)postgresql(dot)org
Subject: Re: Photos of PostgreSQL booth
Date: 2003-12-26 18:04:09
Message-ID: 20031226140340.U35354@ganymede.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Fri, 26 Dec 2003, Michael Meskes wrote:

> On Thu, Dec 25, 2003 at 06:25:34PM -0400, Marc G. Fournier wrote:
> > whose manning the booth?
>
> The one one the photos is Noel Koethe. He manned the booth together with
> Peter Eisentraut and me.

One thought for future booths would be to have a "group photo" done that
we can add to the web site ... ? :)

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ: 7615664


From: Michael Meskes <meskes(at)postgresql(dot)org>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: Michael Meskes <meskes(at)postgresql(dot)org>, Christian Kienle <christian-kienle(at)losg(dot)org>, pgsql-general(at)postgresql(dot)org
Subject: Re: Photos of PostgreSQL booth
Date: 2003-12-26 18:17:11
Message-ID: 20031226181711.GB1935@1
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Fri, Dec 26, 2003 at 02:04:09PM -0400, Marc G. Fournier wrote:
> One thought for future booths would be to have a "group photo" done that
> we can add to the web site ... ? :)

Yes, good idea. :-)

Michael
--
Michael Meskes
Email: Michael at Fam-Meskes dot De
ICQ: 179140304, AIM/Yahoo: michaelmeskes, Jabber: meskes(at)jabber(dot)org
Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL!


From: Eric Ridge <ebr(at)tcdi(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: fulltext searching via a custom index type
Date: 2003-12-26 19:54:56
Message-ID: 61356C28-37DD-11D8-A406-000A95BB5944@tcdi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

(I started with this addressed to the -general list, but it just
doesn't seem like a general usage topic. If -hackers is the wrong
place, please point me in the right direction).

I've been working on a custom index type (access method) that allows
postgres to do fulltext searching (including phrase and proximity) that
I believe is more flexible and natural than "tsearch2" (no offense to
the tsearch2 guys... it's been a source of inspiration). I also plan
on providing "term browsing" and hit-position information.

I'm using Xapian (www.xapian.org) as the backend fulltext engine, and
I'm trying to learn more of the innards of postgres as I go. I've only
spent about 18-24 hours on this, so it's nowhere near complete, and I
really hesitate to even mention it in a public forum. But if in the
end it doesn't suck it might make a great contrib/ package. *shrug*
who knows?

In a nutshell, I've implemented all the necessary access method
functions to teach postgres about a new index type named "xapian", and
so far, everything is really great. It works just like any other index
type would:

create table test (stuff varchar(255), more_stuff text);
create index idxstuff on test using xapian (stuff);
create index idxmore_stuff on test using xapian (more_stuff);

insert into test (stuff, more_stuff) values ('this is stuff', 'this is
more stuff');
insert into test (stuff, more_stuff) values ('my name is eric ridge',
'i like to drink beer');

select * from test where stuff => 'stuff' or more_stuff => '"drink
beer"';
stuff | more_stuff
-----------------------+----------------------
this is stuff | this is more stuff
my name is eric ridge | i like to drink beer
(2 rows)

All this aside, I've got some questions related to indexes and query
plans.

1) Is it possible for an access method to receive some kind of "DROP
INDEX" notification? As far as I can tell, the answer is "no", but it
can't hurt to ask.

2) When does the query planner decide that it needs to "Filter"
results, and is there any way to turn this off or otherwise fool the
query planner into NOT doing Filters (even if only for certain
operators)?

For example, this query:
select * from test where stuff => 'stuff' AND NOT more_stuff =>
'"drink beer"';
has this plan:
Index Scan using idxstuff on test (cost=0.00..-0.98 rows=250
width=177)
Index Cond: ((stuff)::text => 'stuff'::text)
Filter: (NOT (more_stuff => '"drink beer"'::text))

In this case, postgres is forced to re-parse the contents of the
"more_stuff" field (via the defined procedure for the => operator) for
every row returned by the previous index scan, just so it can determine
if the field contains the phrase 'drink beer' or not. Since so much
overhead is involved here, it would be cool if postgres could somehow
do another index scan.

Maybe there's some way for the operator function to know not only the
Datum value, but the actual field (and ItemPointer)?

3) How does one get the $PGDATA directory? getenv() doesn't seem ideal
since PGDATA can be specified as a command-line argument.

4) Can a function be registered as part of a transaction, pre-commit --
so the function can have an opportunity to abort the transaction. I've
seen RegisterEOXactCallback, but it's not quite what I'm looking for.

5) are there discussions in the archives of this list (or other pgsql-
lists) that discuss fulltext searching that y'all think are worth
reading?

thanks in advance for your time and input!

eric


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Eric Ridge <ebr(at)tcdi(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fulltext searching via a custom index type
Date: 2003-12-26 20:22:49
Message-ID: 27861.1072470169@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Eric Ridge <ebr(at)tcdi(dot)com> writes:
> 1) Is it possible for an access method to receive some kind of "DROP
> INDEX" notification?

No.

> select * from test where stuff => 'stuff' AND NOT more_stuff =>
> '"drink beer"';
> has this plan:

To do better here, you'd need to invent a "not-=>' operator, so that the
above could be simplified to, say,

select * from test where stuff => 'stuff' AND more_stuff ~=> '"drink beer"';

and then define both => and ~=> as members of your index opclass, and
then build a two-column index on (stuff, more_stuff) ... whereupon
the planner would pass your index AM both clauses and it would be up
to you to do something intelligent. You might want to back off a little
on how much of this you really want to do in a first cut.

> 3) How does one get the $PGDATA directory?

DataDir. Why should you care? An index AM that wants to know this is
probably broken IMHO; it's certainly trying to do something that's
outside the charter of index AMs, and is likely to cause lots of
headaches.

> 4) Can a function be registered as part of a transaction, pre-commit --
> so the function can have an opportunity to abort the transaction.

Why would that be a good idea? When exactly would you expect it to be
called (relative to the other ninety-nine things that happen at commit)?
How are you going to recover if something else aborts the transaction,
either before or after your function runs?

I get the impression from your questions that you are trying to make an
end run around the transaction mechanism. This is almost certainly
doomed to failure. You need to find a way of storing all your data
within the ordinary index structure.

regards, tom lane


From: Eric Ridge <ebr(at)tcdi(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fulltext searching via a custom index type
Date: 2003-12-26 20:53:34
Message-ID: 924A5437-37E5-11D8-A406-000A95BB5944@tcdi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Dec 26, 2003, at 3:22 PM, Tom Lane wrote:
>> 3) How does one get the $PGDATA directory?
>
> DataDir. Why should you care? An index AM that wants to know this is
> probably broken IMHO; it's certainly trying to do something that's
> outside the charter of index AMs, and is likely to cause lots of
> headaches.

Xapian has it's own storage subsystem, and that's what I'm using to
store the index... not using anything internal to postgres (although
this could change). As such, Xapian needs to know *where* to save its
indexes... $PGDATA seemed like a good place to start.

>> 4) Can a function be registered as part of a transaction, pre-commit
>> --
>> so the function can have an opportunity to abort the transaction.
>
> Why would that be a good idea? When exactly would you expect it to be
> called (relative to the other ninety-nine things that happen at
> commit)?
> How are you going to recover if something else aborts the transaction,
> either before or after your function runs?

I don't really have an answer to this. :)

> I get the impression from your questions that you are trying to make an
> end run around the transaction mechanism.

Perhaps. I'm actually fishing for ideas to bridge xapian's transaction
facilities to postgres. Your comment confirms my suspicions that it
just ain't gunna work out.

> This is almost certainly doomed to failure. You need to find a way of
> storing all your data
> within the ordinary index structure.

You are probably right. :) I'm just playing around right now. I do
appreciate your response, it's given me a lot to think about.

eric


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Eric Ridge <ebr(at)tcdi(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fulltext searching via a custom index type
Date: 2003-12-26 21:04:34
Message-ID: 3261.1072472674@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Eric Ridge <ebr(at)tcdi(dot)com> writes:
> Xapian has it's own storage subsystem, and that's what I'm using to
> store the index... not using anything internal to postgres (although
> this could change).

I would say you have absolutely zero chance of making it work that way.
You will not be able to get it to interoperate reliably with
transactions, checkpointing, or WAL replay; to say nothing of features
we might add in the future, such as tablespaces and point-in-time recovery.
You need to migrate all the data into the Postgres storage mechanism.

It might be worth pointing out here than an index AM is not bound to use
exactly the typical Postgres page layout. I think you probably do have
to use the standard page header, but the page contents don't have to
look like tuples if you don't want 'em to. For precedent see the hash
index AM, which stores ordinary index tuples on some index pages but
uses other pages for its own purposes.

regards, tom lane


From: Eric Ridge <ebr(at)tcdi(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fulltext searching via a custom index type
Date: 2003-12-26 21:38:03
Message-ID: C9103CEC-37EB-11D8-A406-000A95BB5944@tcdi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Dec 26, 2003, at 4:04 PM, Tom Lane wrote:

> Eric Ridge <ebr(at)tcdi(dot)com> writes:
>> Xapian has it's own storage subsystem, and that's what I'm using to
>> store the index... not using anything internal to postgres (although
>> this could change).
>
> I would say you have absolutely zero chance of making it work that way.

thanks for the encouragement! :)

> You will not be able to get it to interoperate reliably with
> transactions, checkpointing, or WAL replay; to say nothing of features
> we might add in the future, such as tablespaces and point-in-time
> recovery.
> You need to migrate all the data into the Postgres storage mechanism.

And these are the things I'm struggling with now. The basic indexing
and searching currently works flawlessly, but the moment another user
connects up, everything goes to hell.

> It might be worth pointing out here than an index AM is not bound to
> use
> exactly the typical Postgres page layout. I think you probably do have
> to use the standard page header, but the page contents don't have to
> look like tuples if you don't want 'em to. For precedent see the hash
> index AM, which stores ordinary index tuples on some index pages but
> uses other pages for its own purposes.

That's useful information. Thanks. I've been using the hash AM as my
guide b/c it's not as complex as the other index types (atleast on the
public interface side). Obviously, I'm trying to save the time and
energy of re-inventing the wheel when it comes full text indexing and
searching. Xapian is an awesome standalone engine (and it's amazingly
fast too!), so it seemed like a good place to start. It's backend
storage subsystem is pluggable, and after our little exchange here
today, I'm now considering writing a postgres backend for Xapian.

I assume the doc chapter on Page Files and the various storage-related
README files are good places for more information. Any other tips or
pointers?

eric


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Eric Ridge <ebr(at)tcdi(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fulltext searching via a custom index type
Date: 2003-12-26 22:10:04
Message-ID: 7728.1072476604@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Eric Ridge <ebr(at)tcdi(dot)com> writes:
> I assume the doc chapter on Page Files and the various storage-related
> README files are good places for more information. Any other tips or
> pointers?

Not right offhand, but feel free to ask questions when you get stuck.
Also don't forget that there's a wealth of info in source code comments;
you should always try looking for existing code that does something
similar to what you want to do.

BTW, one problem with using the hash AM as a model is that it's not yet
WAL-aware. The btree AM is the only one that's really WAL-ified yet,
so you need to look at that if you want to think now about how to make
your code safe for WAL. I think you'd probably be okay to postpone this
consideration for later, but keep in mind that all your operations that
change the contents of index files should be divisible into bite-size
operations that can be logged as WAL entries. Each "bite-size
operation" has to leave the index in a legal state, too, so there's a
limit as to how small you can subdivide.

regards, tom lane


From: "Peter Eisentraut" <peter_e(at)gmx(dot)net>
To: <christian-kienle(at)losg(dot)org>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Photos of PostgreSQL booth
Date: 2003-12-28 14:29:17
Message-ID: 1198.62.156.29.36.1072621757.squirrel@new.host.name
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Christian Kienle writes:
> Here are two pictures of the PostgreSQL booth on the Linux World Expo &
> Conference in Frankfurt 2003.

There are more photos here:

http://developer.postgresql.org/~petere/past-events/lwe2003-pictures/


From: Gaetano Mendola <mendola(at)bigfoot(dot)com>
To: pgsql-general(at)postgresql(dot)org
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject: Re: Photos of PostgreSQL booth
Date: 2003-12-28 21:05:18
Message-ID: 3FEF458E.4080300@bigfoot.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Peter Eisentraut wrote:

> Christian Kienle writes:
>
>>Here are two pictures of the PostgreSQL booth on the Linux World Expo &
>>Conference in Frankfurt 2003.
>
>
> There are more photos here:
>
> http://developer.postgresql.org/~petere/past-events/lwe2003-pictures/

No one will be at the next fosdem ?

Gaetano


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Gaetano Mendola <mendola(at)bigfoot(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject: Re: Photos of PostgreSQL booth
Date: 2003-12-28 21:56:09
Message-ID: 200312282156.hBSLu9K18335@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Gaetano Mendola wrote:
> Peter Eisentraut wrote:
>
> > Christian Kienle writes:
> >
> >>Here are two pictures of the PostgreSQL booth on the Linux World Expo &
> >>Conference in Frankfurt 2003.
> >
> >
> > There are more photos here:
> >
> > http://developer.postgresql.org/~petere/past-events/lwe2003-pictures/
>
> No one will be at the next fosdem ?

They haven't invited us. Peter was going to check but I haven't heard
back from him.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Eric B(dot)Ridge <ebr(at)tcdi(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Postgres + Xapian (was Re: fulltext searching via a custom index type )
Date: 2004-01-02 04:19:07
Message-ID: CE6BCC23-3CDA-11D8-BF11-000A95D98B3E@tcdi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Dec 26, 2003, at 4:04 PM, Tom Lane wrote:
> Eric Ridge <ebr(at)tcdi(dot)com> writes:
>> Xapian has it's own storage subsystem, and that's what I'm using to
>> store the index... not using anything internal to postgres (although
>> this could change).
>
> I would say you have absolutely zero chance of making it work that way.

I still think this is one of the best quotes I've heard in awhile. :)

> It might be worth pointing out here than an index AM is not bound to
> use
> exactly the typical Postgres page layout.

Thanks again for this little bit of info. It was just enough to get me
thinking about how to make "it work".

Xapian is basically a big btree index, except it's 5 btree indexes.
One for terms, one for posts (terms with positions), one for positions,
one for arbitrary document values, and one for the documents
themselves. Each index is made up of 3 physical files on disk. All
told there's 17 files for a single Xapian index (15 db files, a
versioninfo file, and a lock file).

I couldn't think of a way to create a whole new database type for
Xapian that could deal with managing 5 btree indexes inside of Postgres
(other than using tables w/ standard postgres btree index on certain
fields), so instead, I dug into Xapian and abstracted out it's
filesystem i/o (open, read, write, etc).

(as an aside, I did spend some time pondering ways to adapt Postgres'
nbtree AM to handle this, but I just don't understand how it works)

Once I had Xapian's filesystem i/o encapsulated into a nice little C++
class, I embarked on creating a mini "filesystem" ontop of Postgres'
storage subsystem. In essence, I've now got a Postgres access method
that mirrors the basics of a filesystem, from creating/open files to
reading from and writing to them, in addition to truncation and
deletion.

After that, it was just a matter of the glue code to teach Xapian to
use this "filesystem" for all its filesystem i/o, and voila!, Xapian
works ontop of Postgres' storage subsystem and I didn't have to rewrite
Xapian from scratch. And surprisingly, despite the additional overhead
of this filesystem abstraction layer, it's still very fast... esp. once
Buffers get cached.

I've still got more work to do (like dealing with locking and general
concurrency issues, not to mention bugs I haven't found yet), but it's
working *really* well in a single-user environment.

So here's the important question: How stupid is this?

I've done some benchmarking against tsearch2. Attached are the queries
and execution times on my dual 2gig G5 w/ 2gig ram.

The table contains 51,160 records. It's every text file contained on
my computer (which includes multiple copies of all my java projects).
All told, it's 337,343,569 bytes of data, with an average file size of
6,594 bytes. The Xapian operator is "=>", and tsearch2's operator is
"@@". I ran each query 6 times, and just took the best execution time.

It's also worth noting that my document parser is much different than
tsearch2's. I'm splitting words on non-alphanumerics (and currently am
not using stopwords), and it seems that tsearch2 tries to do something
more intelligent, so the # of results returned vary widely between
tsearch2 and Xapian. I'm not offering an opinion on which way is
"better".

I've got a few more questions about transactions, locking, and a few
other things, but I just thought I'd throw this out as a status report
and to see if there's any kind of reaction.

thanks for your time.

eric

Attachment Content-Type Size
query_timings.txt text/plain 1.6 KB

From: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To: "Eric B(dot)Ridge" <ebr(at)tcdi(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )
Date: 2004-01-02 21:54:29
Message-ID: 20040102215429.GA507@dcc.uchile.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Thu, Jan 01, 2004 at 11:19:07PM -0500, Eric B.Ridge wrote:

> I couldn't think of a way to create a whole new database type for
> Xapian that could deal with managing 5 btree indexes inside of Postgres
> (other than using tables w/ standard postgres btree index on certain
> fields), so instead, I dug into Xapian and abstracted out it's
> filesystem i/o (open, read, write, etc).
>
> (as an aside, I did spend some time pondering ways to adapt Postgres'
> nbtree AM to handle this, but I just don't understand how it works)

I think your approach is too ugly. You will have tons of problems the
minute you start thinking about concurrency (unless you want to allow
only a single user accessing the index) and recovery (unless you want to
force users to REINDEX when the system crashes).

I think one way of attacking the problem would be using the existing
nbtree by allowing it to store the five btrees. First read the README
in the nbtree dir, and then poke at the metapage's only structure. You
will see that it has a BlockNumber to the root page of the index. Try
modifying that to make it have a BlockNumber to every index's root page.
You will have to provide ways to access each root page and maybe other
nonstandard things (such as telling the root split operation what root
page are you going to split), but you will get recovery and concurrency
(at least to a point) for free.

Hope this helps,

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La espina, desde que nace, ya pincha" (Proverbio africano)


From: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc: "Eric B(dot)Ridge" <ebr(at)tcdi(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Postgres + Xapian (was Re: fulltext searching via a
Date: 2004-01-03 06:49:23
Message-ID: 3FF665F3.6090108@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> I think one way of attacking the problem would be using the existing
> nbtree by allowing it to store the five btrees. First read the README
> in the nbtree dir, and then poke at the metapage's only structure. You
> will see that it has a BlockNumber to the root page of the index. Try
> modifying that to make it have a BlockNumber to every index's root page.
> You will have to provide ways to access each root page and maybe other
> nonstandard things (such as telling the root split operation what root
> page are you going to split), but you will get recovery and concurrency
> (at least to a point) for free.

Why not write it using the GiST interface - that is after all the entire
point of GiST...

Chris


From: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc: "Eric B(dot)Ridge" <ebr(at)tcdi(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )
Date: 2004-01-03 16:12:05
Message-ID: 20040103161205.GA10977@dcc.uchile.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Sat, Jan 03, 2004 at 02:49:23PM +0800, Christopher Kings-Lynne wrote:
> >I think one way of attacking the problem would be using the existing
> >nbtree by allowing it to store the five btrees.
>
> Why not write it using the GiST interface - that is after all the entire
> point of GiST...

Well, the project is to adapt Postgres as a Xapian backend ... I think
it is much more straightforward to use the BTrees that Xapian already
uses, than to use GiST. Plus, GiST already has an interface for
fulltext indexing, doesn't it?

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La victoria es para quien se atreve a estar solo"


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, "Eric B(dot)Ridge" <ebr(at)tcdi(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Postgres + Xapian (was Re: fulltext searching via a
Date: 2004-01-03 16:15:57
Message-ID: Pine.GSO.4.58.0401031912091.11643@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Sat, 3 Jan 2004, Christopher Kings-Lynne wrote:

> > I think one way of attacking the problem would be using the existing
> > nbtree by allowing it to store the five btrees. First read the README
> > in the nbtree dir, and then poke at the metapage's only structure. You
> > will see that it has a BlockNumber to the root page of the index. Try
> > modifying that to make it have a BlockNumber to every index's root page.
> > You will have to provide ways to access each root page and maybe other
> > nonstandard things (such as telling the root split operation what root
> > page are you going to split), but you will get recovery and concurrency
> > (at least to a point) for free.
>
> Why not write it using the GiST interface - that is after all the entire
> point of GiST...

I think this would be the best approach, we have already btree_gist which
is a gist realization of btree. We started working on concurrency and recovery
for GiST and hope to get it working for 7.5. Another work we'd like to
work is extending of GiST interface, so other important ADT's like digital
trees, quad-tree, s-tree etc could be implemented. But we decided to get
concurrency and recovery works at first.

>
> Chris
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


From: Eric Ridge <ebr(at)tcdi(dot)com>
To: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )
Date: 2004-01-05 16:00:36
Message-ID: 4CF83855-3F98-11D8-ADB4-000A95BB5944@tcdi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Jan 2, 2004, at 4:54 PM, Alvaro Herrera wrote:
> I think your approach is too ugly. You will have tons of problems the
> minute you start thinking about concurrency (unless you want to allow
> only a single user accessing the index)

It might be ugly, but it's very fast. Surprisingly fast, actually.

Concerning concurrency, Xapian internally supports multiple readers and
only 1 concurrent writer. So the locking requirements should be far
less complex than a true concurrent solution. Now, I'm not arguing
that this ideal, but if Xapian is a search engine you're interested in,
then you've already made up your mind that you're willing to deal with
1 writer at a time.

However, Xapian does have built-in support for searching multiple
databases at once. One thought I've had is to simply create a new
1-document database on every INSERT/UPDATE beyond the initial CREATE
INDEX. Then whenever you do an index scan, tell Xapian to use all the
little databases that exist in the index. This would give some bit of
concurrency. Then on VACUUM (or FULL), all these little databases
could be merged back into the main index.

> and recovery (unless you want to force users to REINDEX when the
> system crashes).

I don't yet understand how the WAL stuff works. I haven't looked at
the API's yet, but if something you can record is "write these bytes to
this BlockNumber at this offset", or if you can say, "index Tuple X
from Relation Y", then it seems like recovery is still possible.

If ya can't do any of that, then I need to go look at WAL further.

> I think one way of attacking the problem would be using the existing
> nbtree by allowing it to store the five btrees. First read the README
> in the nbtree dir, and then poke at the metapage's only structure. You
> will see that it has a BlockNumber to the root page of the index.

Right, I had gotten this far in my investigation already. The daunting
thing about trying to use the nbtree code, is the a code itself. It's
very complex. Plus, I just don't know how well the rest of Xapian
would respond to all of a sudden having a concurrent backend. It's
likely that it would make no difference, but it's just an unknown to me
at this time.

> Try modifying that to make it have a BlockNumber to every index's root
> page.
> You will have to provide ways to access each root page and maybe other
> nonstandard things (such as telling the root split operation what root
> page are you going to split), but you will get recovery and concurrency
> (at least to a point) for free.

And I'm not convinced that recovery and concurrency would be "for free"
in this case either. The need to keep essentially 5 different trees in
sync greatly complicates the concurrency issue, I would think.

thanks for your time!

eric


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Gaetano Mendola <mendola(at)bigfoot(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Photos of PostgreSQL booth
Date: 2004-01-05 19:15:09
Message-ID: 200401052015.09833.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Bruce Momjian wrote:
> Gaetano Mendola wrote:
> > No one will be at the next fosdem ?
>
> They haven't invited us. Peter was going to check but I haven't
> heard back from him.

I will be in attendance, but as far as I know there isn't going to be
anyone holding a talk about PostgreSQL or something like that.


From: Eric Ridge <ebr(at)tcdi(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )
Date: 2004-01-08 23:16:28
Message-ID: B021131D-4230-11D8-ADB4-000A95BB5944@tcdi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Thanks to everyone that provided input about this idea. After taking
everything into consideration and talking with Olly Betts from the
Xapian project, what I'm going to do is implement a btree that supports
multiple roots and an "tag" value of arbitrary length for each item in
the tree. Then implement a new Xapian backend that uses it. In the
end, it'll be much more efficient and much less dirty than this silly
filesystem thing I have now. And hopefully, concurrency and WAL will
be easier to deal with too. Having the existing nbtree AM as guide
will be very useful.

I have one issue that could potentially make all of this completely
useless, and it's related to Postgres deciding to use a Filter instead
of an Index Scan.

I asked this question last week, and Tom Lane responded with a
solution, but I don't think I explained myself very well. And now that
I've got a bit more experience with some of the PG internals, maybe I
can ask the question more intelligently.

Given this query:
select id from table where document_text ==> 'food' and title ==>
'water';

Postgres generates this plan:
Index Scan using idxtitle on table (cost=0.00..4.01 rows=1 width=8)
Index Cond: (title ==> 'water'::text)
Filter: (document_text ==> 'food'::text)

The problem is, the "document_text" column is *really* big. Average
value length is 171k.

With this query plan, my operator procedure is forced to re-parse the
document_text column from each row returned by the index scan against
the title column, and do a bunch of Xapian tricks for each row. The
overhead is huge.

The composite index solution doesn't seem ideal here because there's
absolutely no way I can anticipate every combination of fields a user
might choose to search.

Forgetting about composite indexes for a moment, is postgres even
capable of doing 2 index scans in this situation? Does it know how do
take the intersection of two scans?

My AM's cost estimate function literally sets the selectivity,
correlation, total cost, and startup cost values to zero (and I've
tried tons of combinations of really big, really small, and really
negative values). My thought behind setting them all to zero was that
if the cost function basically says, "There is no cost", I could fool
Postgres into wanting to use the index everywhere it can. Sadly, this
doesn't work out.

Now, I realize I can fix my cost estimate function to return better
costs for the title and document_text fields so that PG will instead
decide to index scan on document_text, but what I really want to do is
make PG do an index scan for each field.

Can PG even do this?

I'd appreciate any thoughts on this. Hopefully, I'm just missing
something really obvious.

thanks!

eric


From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Eric Ridge <ebr(at)tcdi(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres + Xapian (was Re: fulltext searching via a
Date: 2004-01-09 13:31:50
Message-ID: 1073655109.3143.18.camel@fuji.krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Eric Ridge kirjutas R, 09.01.2004 kell 01:16:

> Forgetting about composite indexes for a moment, is postgres even
> capable of doing 2 index scans in this situation? Does it know how do
> take the intersection of two scans?

Not currently, but it has been in TODO for quite some time:

* Use bitmaps to fetch heap pages in sequential order [performance]
* Use bitmaps to combine existing indexes [performance]

---------------
Hannu


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Hannu Krosing <hannu(at)tm(dot)ee>
Cc: Eric Ridge <ebr(at)tcdi(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres + Xapian (was Re: fulltext searching via a
Date: 2004-01-12 23:47:36
Message-ID: 200401122347.i0CNla928182@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hannu Krosing wrote:
> Eric Ridge kirjutas R, 09.01.2004 kell 01:16:
>
> > Forgetting about composite indexes for a moment, is postgres even
> > capable of doing 2 index scans in this situation? Does it know how do
> > take the intersection of two scans?
>
> Not currently, but it has been in TODO for quite some time:
>
> * Use bitmaps to fetch heap pages in sequential order [performance]
> * Use bitmaps to combine existing indexes [performance]

And this the foundation of bitmapped indexes, as I understand it, and is
used for "star" joins, popular with data warehousing.

Rather than use those terms, I spelled out the actual behavior we want.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Postgres + Xapian (was Re: fulltext searching via a
Date: 2004-01-13 04:32:15
Message-ID: 87zncsqxio.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:

> Hannu Krosing wrote:
> > Eric Ridge kirjutas R, 09.01.2004 kell 01:16:
> >
> > > Forgetting about composite indexes for a moment, is postgres even
> > > capable of doing 2 index scans in this situation? Does it know how do
> > > take the intersection of two scans?
> >
> > Not currently, but it has been in TODO for quite some time:
> >
> > * Use bitmaps to fetch heap pages in sequential order [performance]
> > * Use bitmaps to combine existing indexes [performance]
>
> And this the foundation of bitmapped indexes, as I understand it, and is
> used for "star" joins, popular with data warehousing.

Actually I think the thinking with those two TODO items is to use these
algorithms in conjunction with regular indexes. Doing regular index scans in
parallel, construct bitmaps, perform logical operations to join them, and then
do the table scan.

It could be done in conjunction with a scheme for storing bitmap indexes or
not. Probably not though.

--
greg