Multiple DB join

Lists: pgsql-sql
From: Sumeet Ambre <sambre(at)indiana(dot)edu>
To: pgsql-sql(at)postgresql(dot)org
Subject: Multiple DB join
Date: 2006-08-14 21:26:10
Message-ID: 44E0EA72.2090505@indiana.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Hi All,

I have a database which consists of 20 million records and I've split up
the db into 6-7 dbs. I have a base database which consists of
the ids with link all the databases. I'm performing search on this
single base table. After searching i get some ids which are ids in the other
databases which i split up. Now i need to retrieve those records. Is
there a way i can join tables from multiple databases as we can join
multiple tables
in a single database.

Thanks,
SA.


From: "Shoaib Mir" <shoaibmir(at)gmail(dot)com>
To: "Sumeet Ambre" <sambre(at)indiana(dot)edu>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-14 22:04:35
Message-ID: bf54be870608141504m428b3af1tf1706fb8ed537d7e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

I think using the contrib module 'dblink' (
http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/dblink/) can help
you here.

Thanks,
--
Shoaib Mir
EnterpriseDB (www.enterprisedb.com)

On 8/15/06, Sumeet Ambre <sambre(at)indiana(dot)edu> wrote:
>
> Hi All,
>
> I have a database which consists of 20 million records and I've split up
> the db into 6-7 dbs. I have a base database which consists of
> the ids with link all the databases. I'm performing search on this
> single base table. After searching i get some ids which are ids in the
> other
> databases which i split up. Now i need to retrieve those records. Is
> there a way i can join tables from multiple databases as we can join
> multiple tables
> in a single database.
>
> Thanks,
> SA.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-15 10:36:24
Message-ID: 20060815103624.GA21634@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote:
> Hi All,
>
> I have a database which consists of 20 million records and I've split up
> the db into 6-7 dbs.

You can do this (as someone suggested with dblink), but I'm wondering
why the split? 20 million records isn't very big.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
Unfortunately reformatting the Internet is a little more painful
than reformatting your hard drive when it gets out of whack.
--Scott Morris


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-15 11:38:32
Message-ID: 20060815113832.GA99592@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Aug 15, 2006 at 06:36:24AM -0400, Andrew Sullivan wrote:
> On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote:
> > I have a database which consists of 20 million records and I've split up
> > the db into 6-7 dbs.
>
> You can do this (as someone suggested with dblink), but I'm wondering
> why the split? 20 million records isn't very big.

And why multiple databases instead of multiple schemas within the
same database? Or even all data in the same schema? Is there a
reason for the segregation?

--
Michael Fuhr


From: Jorge Godoy <jgodoy(at)gmail(dot)com>
To: Michael Fuhr <mike(at)fuhr(dot)org>
Cc: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-15 12:01:03
Message-ID: 87fyfy8ayo.fsf@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Michael Fuhr <mike(at)fuhr(dot)org> writes:

> And why multiple databases instead of multiple schemas within the
> same database? Or even all data in the same schema? Is there a
> reason for the segregation?

I can think that spreading processing requirements should be one. And
distributing load another one. Disk space can be solved with new disks and
tablespaces, but we can't yet distribute the load through several servers
without partitioning.

--
Jorge Godoy <jgodoy(at)gmail(dot)com>


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: Jorge Godoy <jgodoy(at)gmail(dot)com>
Cc: Michael Fuhr <mike(at)fuhr(dot)org>, Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-15 13:18:20
Message-ID: 20060815131820.GA21939@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Aug 15, 2006 at 09:01:03AM -0300, Jorge Godoy wrote:
> I can think that spreading processing requirements should be one. And
> distributing load another one. Disk space can be solved with new disks and
> tablespaces, but we can't yet distribute the load through several servers
> without partitioning.

The cost of inter-node communication isn't nothing, though. It
strikes me as at least possible that the overhead of dblink is going
to be larger than whatever gains one makes from adding a new server.
For only 20M rows, I find it pretty hard to believe the gain is going
to be demonstrable.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
Everything that happens in the world happens at some place.
--Jane Jacobs


From: Jorge Godoy <jgodoy(at)gmail(dot)com>
To: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc: Michael Fuhr <mike(at)fuhr(dot)org>, pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-15 13:25:00
Message-ID: 87ejvi6sib.fsf@ieee.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Andrew Sullivan <ajs(at)crankycanuck(dot)ca> writes:

> On Tue, Aug 15, 2006 at 09:01:03AM -0300, Jorge Godoy wrote:
>> I can think that spreading processing requirements should be one. And
>> distributing load another one. Disk space can be solved with new disks and
>> tablespaces, but we can't yet distribute the load through several servers
>> without partitioning.
>
> The cost of inter-node communication isn't nothing, though. It
> strikes me as at least possible that the overhead of dblink is going
> to be larger than whatever gains one makes from adding a new server.
> For only 20M rows, I find it pretty hard to believe the gain is going
> to be demonstrable.

I totally agree with you. But when your base grows, it might start getting
interesting.

I'd like to see some sort of data partitioning in PostgreSQL.

--
Jorge Godoy <jgodoy(at)gmail(dot)com>


From: Sumeet Ambre <sambre(at)indiana(dot)edu>
To: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-15 14:11:41
Message-ID: 44E1D61D.4020400@indiana.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Andrew Sullivan wrote:
> On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote:
>
>> Hi All,
>>
>> I have a database which consists of 20 million records and I've split up
>> the db into 6-7 dbs.
>>
>
> You can do this (as someone suggested with dblink), but I'm wondering
> why the split? 20 million records isn't very big.
>
> A
>
The Design of the database is because our organization wants to split up
different datasets into different entities, and there might be a
possibility that they'll run different instances of postgres for each
dataset. The problem is that one of the databases consists of 14 million
records and when i query the base database which consists of 20million
records, the query runs damn slow.......below is the sample schema for
my base table and the query i try to run on it....it takes me more than
2-3 minutes to run a query....Is there any way i could speed this up......

sample schema for my base table
---------------------------------

doc_id | title | authors | pub_year | abs | db_name........ |

In the above scheme the field db_name is the name of the other databases
which contain the whole record.
I'm trying to run query which searches on title, authors name, pub_year
or abstract simultaneously...

e.g. SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors
~* '.*something.*') AND (db_name='something'));

Any suggestions to speed up this query.....

Thanks,
Sumeet.


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-15 14:17:12
Message-ID: 20060815141712.GB21939@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Aug 15, 2006 at 10:11:41AM -0400, Sumeet Ambre wrote:
> >
> The Design of the database is because our organization wants to split up
> different datasets into different entities, and there might be a
> possibility that they'll run different instances of postgres for each
> dataset.

It's this "different instances" thing that I'm having a tough time
with. Is this because they want to be able to query local things
when disconnected or something? I can think of applications for
this, for sure, I'm just suggesting that you make sure you're not
building an optimisation that is (1) premature and (2) possibly a
pessimal operation.

> records, the query runs damn slow.......below is the sample schema for
> my base table and the query i try to run on it....it takes me more than
> 2-3 minutes to run a query....Is there any way i could speed this up......

The problem is not your design, nor even the size of the data
exactly, but the query:

> e.g. SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors
> ~* '.*something.*') AND (db_name='something'));

You have two initially-unbound search terms there: ILIKE '%' and ~*
'.*' are automatically seqscans, because you have nowhere in the
index to start. If you really want to do this kind of unbound-string
query, you need to look into full text search. The above approach is
never going to be fast.

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.
--Dennis Ritchie


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-15 14:17:43
Message-ID: 20060815141743.GC21939@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Aug 15, 2006 at 10:25:00AM -0300, Jorge Godoy wrote:
> I'd like to see some sort of data partitioning in PostgreSQL.

Sure, I think everybody would. I think it's among the more
interesting problems we have.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
The plural of anecdote is not data.
--Roger Brinner


From: Sumeet <asumeet(at)gmail(dot)com>
To: "Andrew Sullivan" <ajs(at)crankycanuck(dot)ca>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-15 14:35:36
Message-ID: 7539aebb0608150735l34e59630m25487549c8e6de2d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On 8/15/06, Andrew Sullivan <ajs(at)crankycanuck(dot)ca> wrote:
>
> On Tue, Aug 15, 2006 at 10:11:41AM -0400, Sumeet Ambre wrote:
> > >
> > The Design of the database is because our organization wants to split up
> > different datasets into different entities, and there might be a
> > possibility that they'll run different instances of postgres for each
> > dataset.
>
> It's this "different instances" thing that I'm having a tough time
> with. Is this because they want to be able to query local things
> when disconnected or something? I can think of applications for
> this, for sure, I'm just suggesting that you make sure you're not
> building an optimisation that is (1) premature and (2) possibly a
> pessimal operation.

The reason for splitting up the dbs into differnet instances is that in case
one of the postgres instances on the server
goes down for some reason, it doesnt effect the other instances which are
running on the same server. Even I dont know
the reason for this kind of approach. But i've to deal with it.

> > records, the query runs damn slow.......below is the sample schema for
> > my base table and the query i try to run on it....it takes me more than
> > 2-3 minutes to run a query....Is there any way i could speed this
> up......
>
> The problem is not your design, nor even the size of the data
> exactly, but the query:
>
> > e.g. SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors
> > ~* '.*something.*') AND (db_name='something'));
>
> You have two initially-unbound search terms there: ILIKE '%' and ~*
> '.*' are automatically seqscans, because you have nowhere in the
> index to start. If you really want to do this kind of unbound-string
> query, you need to look into full text search. The above approach is
> never going to be fast.
>

I previously thought of doing the full text search indexing thing...but i
had a intution that the full text search thing is for fields which have very
large strings.......but in my case the strings are not above 200 chars in
length.....so would the full text search thing give me the performance which
we need...also i tried doing combined indexes on title, authors in my base
table......would indexes of any kind help me in this case..

Thanks for ur prompt replies,
Sumeet.

--
Thanks,
Sumeet.


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-15 14:45:39
Message-ID: 20060815144539.GE21939@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Aug 15, 2006 at 10:35:36AM -0400, Sumeet wrote:
>
> The reason for splitting up the dbs into differnet instances is that in case
> one of the postgres instances on the server
> goes down for some reason, it doesnt effect the other instances which are
> running on the same server. Even I dont know
> the reason for this kind of approach. But i've to deal with it.

I certainly sympathise with the situation where managers decide to
solve problems that never happen. Given that these are on the same
server, it is _way_ more likely that you'll introduce a problem due
to running several instances of the back end than that one instance
of Postgres will magically die (especially in such a way that other
instances will continue to work). But if Some Important Person
decreed it, you have my sympathy.

> I previously thought of doing the full text search indexing thing...but i
> had a intution that the full text search thing is for fields which have very
> large strings.......but in my case the strings are not above 200 chars in

No, it's for data where you're going to be searching in random pieces
of the text. I think you should look at tsearch2, probably.

If you had any way to enforce bounded searches, it'd be a different
matter: strings with initial matches but an unbound end are fast.
(You can do it the other way, too, by some tricks with reversing the
strings.)

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun


From: Sumeet <asumeet(at)gmail(dot)com>
To: "Andrew Sullivan" <ajs(at)crankycanuck(dot)ca>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-18 15:07:24
Message-ID: 7539aebb0608180807j4ca4da62l4e464f4ea4eaa984@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On 8/15/06, Andrew Sullivan <ajs(at)crankycanuck(dot)ca> wrote:

> > I previously thought of doing the full text search indexing thing...but
> i
> > had a intution that the full text search thing is for fields which have
> very
> > large strings.......but in my case the strings are not above 200 chars
> in
>
> No, it's for data where you're going to be searching in random pieces
> of the text. I think you should look at tsearch2, probably.

Will the full text search indexing help me achive a good speed in searching
keywords???

If you had any way to enforce bounded searches, it'd be a different
> matter: strings with initial matches but an unbound end are fast.
> (You can do it the other way, too, by some tricks with reversing the
> strings.)

can someone plz ellaborate a little about ways we can enforce bounded
searches?. I'm basically trying a simple search
i.e. trying to find name of authors user enters into a interface against the
20 miliions records in my db. Can anyone suggest a good way to perform this
kind of search ?.

Thanks,
Sumeet.


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: Sumeet <asumeet(at)gmail(dot)com>
Cc: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>, pgsql-sql(at)postgresql(dot)org
Subject: Re: Multiple DB join
Date: 2006-08-18 15:20:19
Message-ID: 20060818152019.GB31921@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Fri, Aug 18, 2006 at 11:07:24AM -0400, Sumeet wrote:
>
> Will the full text search indexing help me achive a good speed in searching
> keywords???

I think this depends on how you use it.

> can someone plz ellaborate a little about ways we can enforce bounded
> searches?. I'm basically trying a simple search
> i.e. trying to find name of authors user enters into a interface against the
> 20 miliions records in my db. Can anyone suggest a good way to perform this
> kind of search ?.

The thing is, if they're keywords, why are you treating them as
fragments? Traditionally, keywords are not substrings, but full
barewords. A bareword match should be fast, because it's looking for
the whole string. So you shouldn't need the "%" characters.

Maybe the problem that you have something like the following. If
your data is stored like this

subject | keyword
subject1 | keyword1 keyword2 keyword3

and you want every subject that matches on keyword2, then you have to
search this with SELECT subject WHERE keyword = '%keyword2%'. The
reason you have to do that is that your data is badly normalised. Is
that it?

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
Information security isn't a technological problem. It's an economics
problem.
--Bruce Schneier