Re: Searching BLOB - Lucene setup & problem

Lists: pgsql-general
From: "James Watson" <jdwatson1(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Searching BLOB
Date: 2006-06-12 22:18:29
Message-ID: 8f38f800606121518i12bc010dn7d11fb8f4601444f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,
I am not 100% sure what the best solution would be, so I was hoping
someone could point me in the right direction.

I usually develop in MS tools, such as .net, ASP, SQL Server etc...,
but I really want to expand my skillset and learn as much about Postgresqlas
possible.

What I need to do, is design a DB that will index and store
approximately 300 word docs, each with a size no more that 1MB. They
need to be able to seacrh the word documents for keyword/phrases to be
able to identify which one to use.

So, I need to write 2 web interfaces. A front end and a back end. Front
end for the users who will search for their documents, and a backend
for an admin person to upload new/ammended documents to the DB to be
searchable.

NOW..... I could do this in the usual MS tools that I work with using
BLOB's and the built in Full-text searching that comes with SQL Server,
but i don't have these to work with at the mometn. I am working with
PostGres & JSP
pages

What I was hoping someone could help me out with was identifying the
best possible solution to use.

1. How can I store the word doc's in the DB, would it be best to use a
BLOB data type?

2. Does Postgres support full text searching of a word document once it
is loaded into the BLOB column & how would this work? Would I have to
unload each BLOB object, convert it back to text to search, or does
Postgres have the ability to complete the full-text search of a BLOB,
like MSSQL Server & Oracle do?

3. Is there a way to export the Word Doc From the BLOB colum and dump
it into a PDF format (I guess I am asking if someone has seen or
written a PDF generator script/storedProc for Postgres)?

If someone could help me out, it would be greatly appreciated.

cheers,
James


From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: James Watson <jdwatson1(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Searching BLOB
Date: 2006-06-13 12:17:19
Message-ID: 448EACCF.9000800@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

James Watson wrote:
> What I was hoping someone could help me out with was identifying the
> best possible solution to use.
>
> 1. How can I store the word doc's in the DB, would it be best to use a
> BLOB data type?
You can use the column type "bytea", which can store (nearly) arbitrary
amounts of binary data.

> 2. Does Postgres support full text searching of a word document once it
> is loaded into the BLOB column & how would this work? Would I have to
> unload each BLOB object, convert it back to text to search, or does
> Postgres have the ability to complete the full-text search of a BLOB,
> like MSSQL Server & Oracle do?
There is fulltext indexing support for postgres, look for tsearch2 in
the contrib module of postgres. A bytea-column is basically used like
a string, so there is no need to load/unload the blob.

There is also the concept of a LOB as a distinct entity in postgresql.
Accessing those lobs needs special support from your client library
(standard libpq provides that support of course). They have the advantage
that you can open/seek/close them like a regular file. But the disadvantage
is that you can't store them in columns - they are referenced via oids, and
you need to store those oids. You also can't put triggers on those LOBs, and
I'm not sure how transaction-safe they are.

> 3. Is there a way to export the Word Doc From the BLOB colum and dump
> it into a PDF format (I guess I am asking if someone has seen or
> written a PDF generator script/storedProc for Postgres)?
You can use java as a backend language with postgresql (google for pljava).
So you can pretty much do whatever you can do with java.

greetings, Florian Pflug


From: "John Sidney-Woollett" <johnsw(at)wardbrook(dot)com>
To: "James Watson" <jdwatson1(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Searching BLOB
Date: 2006-06-13 12:28:18
Message-ID: 14323.195.152.219.3.1150201698.squirrel@mercury.wardbrook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Save yourself some effort and use Lucene to index a directory of your 300
word documents. I'm pretty sure that Lucene includes an extension to read
Word documents, and you can use PDFBox to read/write PDF files. Marrying
the searching and displaying of results to your web application should be
trivial since you're wanting to use java anyway. Lucene has full character
set support and is blindingly fast

If you're looking for a solution to this problem using Postgres, then
you'll be creating a ton extra work for yourself. If you're wanting to
learn more about postgres, then maybe it'll be worthwhile.

John

James Watson said:
> Hi,
> I am not 100% sure what the best solution would be, so I was hoping
> someone could point me in the right direction.
>
> I usually develop in MS tools, such as .net, ASP, SQL Server etc...,
> but I really want to expand my skillset and learn as much about
> Postgresqlas
> possible.
>
> What I need to do, is design a DB that will index and store
> approximately 300 word docs, each with a size no more that 1MB. They
> need to be able to seacrh the word documents for keyword/phrases to be
> able to identify which one to use.
>
> So, I need to write 2 web interfaces. A front end and a back end. Front
> end for the users who will search for their documents, and a backend
> for an admin person to upload new/ammended documents to the DB to be
> searchable.
>
> NOW..... I could do this in the usual MS tools that I work with using
> BLOB's and the built in Full-text searching that comes with SQL Server,
> but i don't have these to work with at the mometn. I am working with
> PostGres & JSP
> pages
>
> What I was hoping someone could help me out with was identifying the
> best possible solution to use.
>
> 1. How can I store the word doc's in the DB, would it be best to use a
> BLOB data type?
>
> 2. Does Postgres support full text searching of a word document once it
> is loaded into the BLOB column & how would this work? Would I have to
> unload each BLOB object, convert it back to text to search, or does
> Postgres have the ability to complete the full-text search of a BLOB,
> like MSSQL Server & Oracle do?
>
> 3. Is there a way to export the Word Doc From the BLOB colum and dump
> it into a PDF format (I guess I am asking if someone has seen or
> written a PDF generator script/storedProc for Postgres)?
>
> If someone could help me out, it would be greatly appreciated.
>
> cheers,
> James
>


From: "jdwatson1(at)gmail(dot)com" <jdwatson1(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Searching BLOB - Lucene setup & problem
Date: 2006-06-14 03:31:02
Message-ID: 1150255862.298762.146760@u72g2000cwu.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi John,
I have had a read through the lucene website
(http://lucene.apache.org/java/docs/index.html) and it sounds pretty
good to me. I should be able to use this in conjuction with my JSP
pages.

This may sound quite dumb to anyone who develops in java, but I need a
little help setting up the demo on my windowsXP machine. I have
installed JDY 1.5.0_07, i have installed tomcat and can confirm that is
is all up and running correctly, as I have already written a few simple
JSP pages.

I have downloaded the lucene package, extracted the package to my C:\
and followed the steps of the demo page:
http://lucene.apache.org/java/docs/demo.html

But, when i try to run "java org.apache.lucene.demo.IndexFiles
c:\lucene-2.0.0\src" from the cmd prompt, I get the following error:

"Exception in thread 'main' java.lang.NoClassDefFoundError:
org/apache/lucene/analysis/Analyser"

I am not sure why this is coming up. I have followed the instructions
on the demo page on the web.

The only thing i can think of is I may have my "CLASSPATH" incorrect.
Can someone help me out with a basic desription if what the classpath
is and where I should point the classpath environment variable to?

Once I have that correct, i think that I may be able to run the demo.

thanks for any help you can provide.

James

"John Sidney-Woollett" wrote:
> Save yourself some effort and use Lucene to index a directory of your 300
> word documents. I'm pretty sure that Lucene includes an extension to read
> Word documents, and you can use PDFBox to read/write PDF files. Marrying
> the searching and displaying of results to your web application should be
> trivial since you're wanting to use java anyway. Lucene has full character
> set support and is blindingly fast
>
> If you're looking for a solution to this problem using Postgres, then
> you'll be creating a ton extra work for yourself. If you're wanting to
> learn more about postgres, then maybe it'll be worthwhile.
>
> John
>


From: John Sidney-Woollett <johnsw(at)wardbrook(dot)com>
To: "jdwatson1(at)gmail(dot)com" <jdwatson1(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Searching BLOB - Lucene setup & problem
Date: 2006-06-16 06:38:13
Message-ID: 449251D5.7080908@wardbrook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

This is a bit off topic for the Postgres list... ;)

Make sure you explicitly include the name of the Lucene jar file in your
command line invocation, and any other directories that are required
(normally your current working directory), so for Windows you'd use
something like

java -cp .;{pathto}\lucene-1.4.3.jar YouJavaApp

When you use Lucene in your webapp include the Lucene jar file in
{tomcat_home}\commons\lib or the WEB-INF\lib directory under your webapp.

Hope that helps.

John

jdwatson1(at)gmail(dot)com wrote:
> Hi John,
> I have had a read through the lucene website
> (http://lucene.apache.org/java/docs/index.html) and it sounds pretty
> good to me. I should be able to use this in conjuction with my JSP
> pages.
>
> This may sound quite dumb to anyone who develops in java, but I need a
> little help setting up the demo on my windowsXP machine. I have
> installed JDY 1.5.0_07, i have installed tomcat and can confirm that is
> is all up and running correctly, as I have already written a few simple
> JSP pages.
>
> I have downloaded the lucene package, extracted the package to my C:\
> and followed the steps of the demo page:
> http://lucene.apache.org/java/docs/demo.html
>
> But, when i try to run "java org.apache.lucene.demo.IndexFiles
> c:\lucene-2.0.0\src" from the cmd prompt, I get the following error:
>
> "Exception in thread 'main' java.lang.NoClassDefFoundError:
> org/apache/lucene/analysis/Analyser"
>
> I am not sure why this is coming up. I have followed the instructions
> on the demo page on the web.
>
> The only thing i can think of is I may have my "CLASSPATH" incorrect.
> Can someone help me out with a basic desription if what the classpath
> is and where I should point the classpath environment variable to?
>
> Once I have that correct, i think that I may be able to run the demo.
>
> thanks for any help you can provide.
>
> James
>
> "John Sidney-Woollett" wrote:
>
>>Save yourself some effort and use Lucene to index a directory of your 300
>>word documents. I'm pretty sure that Lucene includes an extension to read
>>Word documents, and you can use PDFBox to read/write PDF files. Marrying
>>the searching and displaying of results to your web application should be
>>trivial since you're wanting to use java anyway. Lucene has full character
>>set support and is blindingly fast
>>
>>If you're looking for a solution to this problem using Postgres, then
>>you'll be creating a ton extra work for yourself. If you're wanting to
>>learn more about postgres, then maybe it'll be worthwhile.
>>
>>John
>>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster