Re: Graph datatype addition

Lists: pgsql-hackers
From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Graph datatype addition
Date: 2013-04-28 05:06:18
Message-ID: 4257ED30-050C-451B-AE76-8263E0B375E9@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

Inspired by the awesome work done by Oleg sir in HStore, I have been thinking about making a graph data type as an extension in postgres.

I have been reading about graph databases and how storing data in graphs can lead to some really awesome functionalities such as social network analysis, recommender systems, network management.

Essentially, connected data can be represented effectively as a single data item, which can be used for further analysis.

This is in line with my agenda of adding more analytics functionalities in postgres.

I have been thinking about designing the data type as adjacency sets, associating the adjacency list for each node as a value with node identifier as the key.

This should be able to build over HStore, using HStore to associate adjacency list with its node identifier.

The format could be:

<node identifier> => <adjacency list> <node identifier> => <adjacency list> '\0'

So,

"node1" => "node2/node3/node4","node2" => "node1/node5/node6","node3" => "node1/node4/node5" '\0'

This can be for unweighted graphs, we can add support for weighted graphs as well.

Thoughts/comments/advice please?

Regards,

Atri

Sent from my iPad


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 00:41:23
Message-ID: CA+Tgmoaycjunt4=emMXtp0eN8bV4nmz2qz_8+jnzyiGi188qRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 28, 2013 at 1:06 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
> Inspired by the awesome work done by Oleg sir in HStore, I have been thinking about making a graph data type as an extension in postgres.
>
> I have been reading about graph databases and how storing data in graphs can lead to some really awesome functionalities such as social network analysis, recommender systems, network management.
>
> Essentially, connected data can be represented effectively as a single data item, which can be used for further analysis.
>
> This is in line with my agenda of adding more analytics functionalities in postgres.
>
> I have been thinking about designing the data type as adjacency sets, associating the adjacency list for each node as a value with node identifier as the key.
>
> This should be able to build over HStore, using HStore to associate adjacency list with its node identifier.
>
> The format could be:
>
> <node identifier> => <adjacency list> <node identifier> => <adjacency list> '\0'
>
> So,
>
> "node1" => "node2/node3/node4","node2" => "node1/node5/node6","node3" => "node1/node4/node5" '\0'
>
> This can be for unweighted graphs, we can add support for weighted graphs as well.
>
> Thoughts/comments/advice please?

It's probably pretty easy to add this, but I think the question is
what would make it better than storing the same representation in a
text field. Obviously you get validation that the input is in the
correct format, but you could do that with a CHECK constraint, too, or
otherwise handle it in the application. So I think the really
interesting question is: what operations would you support on this
data type?

One of the problems you're likely to run into if you store the whole
graph as a single object is that it may make many of the things you
want to do with it not very efficient.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 05:55:27
Message-ID: CAOeZVicYU4pCC6wCrMEzaEPw-aAPUo0UeLPQ49kMajH2y3uosg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> It's probably pretty easy to add this, but I think the question is
> what would make it better than storing the same representation in a
> text field.

I completely agree. The main point in making a new datatype would be
to add support for operations that are normally done with graphs.

>Obviously you get validation that the input is in the
> correct format, but you could do that with a CHECK constraint, too, or
> otherwise handle it in the application. So I think the really
> interesting question is: what operations would you support on this
> data type?

I can think of the standard tasks, i.e. searching if two nodes are
connected or not,adding new nodes and edges, traversing the adjacency
lists of nodes.

If we add support for weighted graphs, we can probably add support for
some common graph algorithms, such as Djikstra's algorithm, Bellman
Ford algorithm, a MST making algorithm, network flow algorithms.

The main idea is to allow user to work with graphs pretty easily, and
allow the user to use the data present in his database to make graphs
and then process them.

> One of the problems you're likely to run into if you store the whole
> graph as a single object is that it may make many of the things you
> want to do with it not very efficient.

Yes, I agree. On further thought, I believe it would be more of a pain
if we stick to representing the whole thing as one.Rather,making
multiple components will be more flexible and modular, and allow us to
modify different components of the same graph without modifying or
interfering with other components of the graph.

I will think of a new design. I am still thinking of using HStore to
store adjacency lists. This should have good performance for access of
lists and similar tasks, IMO.

Regards,

Atri

--
Regards,

Atri
*l'apprenant*


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 13:56:40
Message-ID: CAHyXU0wf1AVhUB2FzAEBaKB9nWu2L2zGgQa8+msOYG3D_4kszA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 29, 2013 at 12:55 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
>> It's probably pretty easy to add this, but I think the question is
>> what would make it better than storing the same representation in a
>> text field.
>
> I completely agree. The main point in making a new datatype would be
> to add support for operations that are normally done with graphs.
>
>
>>Obviously you get validation that the input is in the
>> correct format, but you could do that with a CHECK constraint, too, or
>> otherwise handle it in the application. So I think the really
>> interesting question is: what operations would you support on this
>> data type?
>
> I can think of the standard tasks, i.e. searching if two nodes are
> connected or not,adding new nodes and edges, traversing the adjacency
> lists of nodes.
>
> If we add support for weighted graphs, we can probably add support for
> some common graph algorithms, such as Djikstra's algorithm, Bellman
> Ford algorithm, a MST making algorithm, network flow algorithms.
>
> The main idea is to allow user to work with graphs pretty easily, and
> allow the user to use the data present in his database to make graphs
> and then process them.
>
>> One of the problems you're likely to run into if you store the whole
>> graph as a single object is that it may make many of the things you
>> want to do with it not very efficient.
>
> Yes, I agree. On further thought, I believe it would be more of a pain
> if we stick to representing the whole thing as one.Rather,making
> multiple components will be more flexible and modular, and allow us to
> modify different components of the same graph without modifying or
> interfering with other components of the graph.
>
> I will think of a new design. I am still thinking of using HStore to
> store adjacency lists. This should have good performance for access of
> lists and similar tasks, IMO.

This is an interesting idea. Historically I've always decomposed
graphs into relational structures because that's the only practical
way to query them. Graphs are not currently able to be transported
out of the database currently via JSON so one of the areas to focus
your research will be how the client will consume the data.
libpqtypes is one way to do it, but that will really restrict you
audience so you'll probably need a rich set of functions present the
internal data (just like hstore).

Another area to focus research will be on searchability: how to use
GIST/GIN indexes to pull data out via an internal query string. An
overview of the current GIST based type implementations (like ltree)
couldn't hurt.

merlin


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 14:25:11
Message-ID: CAOeZVif1funfFesvahT-JocBuKU=-CsX+CfbHYUziaw6qXuTOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> This is an interesting idea. Historically I've always decomposed
> graphs into relational structures because that's the only practical
> way to query them. Graphs are not currently able to be transported
> out of the database currently via JSON so one of the areas to focus
> your research will be how the client will consume the data.
> libpqtypes is one way to do it, but that will really restrict you
> audience so you'll probably need a rich set of functions present the
> internal data (just like hstore).

I completely agree. Initially, I was thinking of exposing the data to
user via HStore. But now, after Robert's suggestions, I think it will
be better to have an alternate representation. JSON seems to be an
excellent idea for that.

I am thinking of having the functions for working on the graph present
inside the data type itself.

> Another area to focus research will be on searchability: how to use
> GIST/GIN indexes to pull data out via an internal query string. An
> overview of the current GIST based type implementations (like ltree)

I will definitely research that. Actually, I have never looked at
GIST/GIN code before, so I have no idea how they can be used here.

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 14:50:45
Message-ID: CAHyXU0z1M0q=awLRpyKUXx4x5NrJp4tC2m4bitxUhv3n8LhpOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 29, 2013 at 9:25 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
>>
>> This is an interesting idea. Historically I've always decomposed
>> graphs into relational structures because that's the only practical
>> way to query them. Graphs are not currently able to be transported
>> out of the database currently via JSON so one of the areas to focus
>> your research will be how the client will consume the data.
>> libpqtypes is one way to do it, but that will really restrict you
>> audience so you'll probably need a rich set of functions present the
>> internal data (just like hstore).
>
> I completely agree. Initially, I was thinking of exposing the data to
> user via HStore. But now, after Robert's suggestions, I think it will
> be better to have an alternate representation. JSON seems to be an
> excellent idea for that.

I don't agree with this; JSON is not really designed to store graphs.
You will probably need a customized internal representation, just like
hstore, that expresses a graph like structure.

This is not a trivial project.

merlin


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 14:53:58
Message-ID: CAOeZViehAMVYpb0+wAMgU6ZmmNjHuSrZu9p+WTxAp7-aoT=JsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> I don't agree with this; JSON is not really designed to store graphs.
> You will probably need a customized internal representation, just like
> hstore, that expresses a graph like structure.

Yes, we will have a custom internal representation. I was thinking of
ways to export the graph into user parsable type, hence JSON.

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Misa Simic <misa(dot)simic(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 16:42:49
Message-ID: CAH3i69nxF-g+ef17mEpAaECVjLRpuTLsYyO476BjyVarw0oB5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Atri,

What is an example of custom internal representation and its JSON
representation (though and JSON and HStore represent its value as text)?

I also think that the key question is: "what operations would you support
on this
data type?"

Or what kind of problems it will solve? (what can't be solved now - or can
now - but new type will allow the better way...)

Thanks,

Misa

2013/4/29 Atri Sharma <atri(dot)jiit(at)gmail(dot)com>

> >
> > I don't agree with this; JSON is not really designed to store graphs.
> > You will probably need a customized internal representation, just like
> > hstore, that expresses a graph like structure.
>
> Yes, we will have a custom internal representation. I was thinking of
> ways to export the graph into user parsable type, hence JSON.
>
>
>
> Regards,
>
> Atri
>
>
> --
> Regards,
>
> Atri
> l'apprenant
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Misa Simic <misa(dot)simic(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 16:47:53
Message-ID: CAH3i69m7DvG5jKYS76EcwhiBRcEb2qs1x=+iUqA9+z-iHqPtLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Merlin,

" Graphs are not currently able to be transported
out of the database currently via JSON"

What does it mean?

(I probably dont understand graphs well - but from my point of view - any
data can be transported out of DB via JSON)

Thanks,

Misa

2013/4/29 Merlin Moncure <mmoncure(at)gmail(dot)com>

> On Mon, Apr 29, 2013 at 12:55 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
> >> It's probably pretty easy to add this, but I think the question is
> >> what would make it better than storing the same representation in a
> >> text field.
> >
> > I completely agree. The main point in making a new datatype would be
> > to add support for operations that are normally done with graphs.
> >
> >
> >>Obviously you get validation that the input is in the
> >> correct format, but you could do that with a CHECK constraint, too, or
> >> otherwise handle it in the application. So I think the really
> >> interesting question is: what operations would you support on this
> >> data type?
> >
> > I can think of the standard tasks, i.e. searching if two nodes are
> > connected or not,adding new nodes and edges, traversing the adjacency
> > lists of nodes.
> >
> > If we add support for weighted graphs, we can probably add support for
> > some common graph algorithms, such as Djikstra's algorithm, Bellman
> > Ford algorithm, a MST making algorithm, network flow algorithms.
> >
> > The main idea is to allow user to work with graphs pretty easily, and
> > allow the user to use the data present in his database to make graphs
> > and then process them.
> >
> >> One of the problems you're likely to run into if you store the whole
> >> graph as a single object is that it may make many of the things you
> >> want to do with it not very efficient.
> >
> > Yes, I agree. On further thought, I believe it would be more of a pain
> > if we stick to representing the whole thing as one.Rather,making
> > multiple components will be more flexible and modular, and allow us to
> > modify different components of the same graph without modifying or
> > interfering with other components of the graph.
> >
> > I will think of a new design. I am still thinking of using HStore to
> > store adjacency lists. This should have good performance for access of
> > lists and similar tasks, IMO.
>
> This is an interesting idea. Historically I've always decomposed
> graphs into relational structures because that's the only practical
> way to query them. Graphs are not currently able to be transported
> out of the database currently via JSON so one of the areas to focus
> your research will be how the client will consume the data.
> libpqtypes is one way to do it, but that will really restrict you
> audience so you'll probably need a rich set of functions present the
> internal data (just like hstore).
>
> Another area to focus research will be on searchability: how to use
> GIST/GIN indexes to pull data out via an internal query string. An
> overview of the current GIST based type implementations (like ltree)
> couldn't hurt.
>
> merlin
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Misa Simic <misa(dot)simic(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 19:00:16
Message-ID: CAOeZVieLXpq0mDX5LqKjhxdDCrASrvCbNpajzNM=oWjv8jaxmQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 29, 2013 at 10:12 PM, Misa Simic <misa(dot)simic(at)gmail(dot)com> wrote:
> Hi Atri,
>
> What is an example of custom internal representation and its JSON
> representation (though and JSON and HStore represent its value as text)?
>
> I also think that the key question is: "what operations would you support
> on this
> data type?"
>
> Or what kind of problems it will solve? (what can't be solved now - or can
> now - but new type will allow the better way...)
>
> Thanks,
>
> Misa
>
>

Hi Misa,

Thanks for thinking it through.

I have not thought about it yet(I was going with the HStore
representation till the moment, which I showed in my first mail in
this thread) I believe that something on these lines could be done:

Entity 1:

Node: Node1

Adjacency list: node2, node3, node4

Entity 2:

Node: Node 2

Adjacency list: node1, node5

Entity 3:

Node: Node 3

Adjacency list: node1, node4

Adjacency list sets:

"Node1"=>"Entity1","Node2"=>"Entity2","Node3"=>"Entity3"

I mentioned the potential operations we could have in a previous
mail.Specifically,

I can think of the standard tasks, i.e. searching if two nodes are
connected or not,adding new nodes and edges, traversing the adjacency
lists of nodes.

If we add support for weighted graphs, we can probably add support for
some common graph algorithms, such as Djikstra's algorithm, Bellman
Ford algorithm, a MST making algorithm, network flow algorithms.

The main idea is to allow user to work with graphs pretty easily, and
allow the user to use the data present in his database to make graphs
and then process them.

I think we find work arounds or make shifts at the moment if we need
to use graphs in our database in postgres. If we have a datatype
itself, with support for commonly used operations built inside the
type itself, that will greatly simplify user's tasks, and open up a
whole new avenue of applications for us, such as recommender systems,
social network analysis, or anything that can be done with graphs.

Regards,

Atri
--
Regards,

Atri
l'apprenant


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Misa Simic <misa(dot)simic(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 19:20:46
Message-ID: 689CA744-6648-4AFD-BF8C-0FE6E4CA00EB@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Apr29, 2013, at 21:00 , Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
> I think we find work arounds or make shifts at the moment if we need
> to use graphs in our database in postgres. If we have a datatype
> itself, with support for commonly used operations built inside the
> type itself, that will greatly simplify user's tasks, and open up a
> whole new avenue of applications for us, such as recommender systems,
> social network analysis, or anything that can be done with graphs.

Usually though, you'd be interested a large graphs which include
information for lots of records (e.g., nodes are individual users,
or products, or whatever). A graph datatype is not well suited for
that, because it'd store each graph as a single value, and updating
the graph would mean rewriting that whole value. If you're e.g. doing
social network analysis, and each new edge between two users requires
you to pull the whole graph from disk, update it, and write it back,
you'll probably hit problems once you reach a few hundred users or
so… Which really isn't a lot for that kind of application.

I'd love to see more support for those kinds of queries in postgres,
(although WITH RECURSIVE already was a *huge* improvement in this
area!). But storing each graph as a graph type would do isn't the
way forward, IMHO.

best regards,
Florian Pflug


From: Misa Simic <misa(dot)simic(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 19:55:13
Message-ID: CAH3i69krnkzgTZgR4MDuPGEVHzEXc7U0zQvOmkfgwg0jWzXVGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday, April 29, 2013, Atri Sharma wrote:

> On Mon, Apr 29, 2013 at 10:12 PM, Misa Simic <misa(dot)simic(at)gmail(dot)com<javascript:;>>
> wrote:
> > Hi Atri,
> >
> > What is an example of custom internal representation and its JSON
> > representation (though and JSON and HStore represent its value as text)?
> >
> > I also think that the key question is: "what operations would you
> support
> > on this
> > data type?"
> >
> > Or what kind of problems it will solve? (what can't be solved now - or
> can
> > now - but new type will allow the better way...)
> >
> > Thanks,
> >
> > Misa
> >
> >
>
> Hi Misa,
>
> Thanks for thinking it through.
>
> I have not thought about it yet(I was going with the HStore
> representation till the moment, which I showed in my first mail in
> this thread) I believe that something on these lines could be done:
>
> Entity 1:
>
> Node: Node1
>
> Adjacency list: node2, node3, node4
>
> Entity 2:
>
> Node: Node 2
>
> Adjacency list: node1, node5
>
> Entity 3:
>
> Node: Node 3
>
> Adjacency list: node1, node4
>
> Adjacency list sets:
>
> "Node1"=>"Entity1","Node2"=>"Entity2","Node3"=>"Entity3"
>
> I mentioned the potential operations we could have in a previous
> mail.Specifically,
>
> I can think of the standard tasks, i.e. searching if two nodes are
> connected or not,adding new nodes and edges, traversing the adjacency
> lists of nodes.
>
> If we add support for weighted graphs, we can probably add support for
> some common graph algorithms, such as Djikstra's algorithm, Bellman
> Ford algorithm, a MST making algorithm, network flow algorithms.
>
> The main idea is to allow user to work with graphs pretty easily, and
> allow the user to use the data present in his database to make graphs
> and then process them.
>
> I think we find work arounds or make shifts at the moment if we need
> to use graphs in our database in postgres. If we have a datatype
> itself, with support for commonly used operations built inside the
> type itself, that will greatly simplify user's tasks, and open up a
> whole new avenue of applications for us, such as recommender systems,
> social network analysis, or anything that can be done with graphs.
>
>
Hm...

Have you considered maybe ltree datatype?

To me all described sounds solveable on pure sql way ( + ltree datatype to
help with indexes and performance as materialised path to avoid recursive
query all the time...)

Though would be nice to see something new what would simplify the tasks...

Cheers,

Misa


From: Jim Nasby <jim(at)nasby(dot)net>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Misa Simic <misa(dot)simic(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 19:55:15
Message-ID: 517ED023.7080001@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 4/29/13 2:20 PM, Florian Pflug wrote:
> On Apr29, 2013, at 21:00 , Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
>> I think we find work arounds or make shifts at the moment if we need
>> to use graphs in our database in postgres. If we have a datatype
>> itself, with support for commonly used operations built inside the
>> type itself, that will greatly simplify user's tasks, and open up a
>> whole new avenue of applications for us, such as recommender systems,
>> social network analysis, or anything that can be done with graphs.
>
> Usually though, you'd be interested a large graphs which include
> information for lots of records (e.g., nodes are individual users,
> or products, or whatever). A graph datatype is not well suited for
> that, because it'd store each graph as a single value, and updating
> the graph would mean rewriting that whole value. If you're e.g. doing
> social network analysis, and each new edge between two users requires
> you to pull the whole graph from disk, update it, and write it back,
> you'll probably hit problems once you reach a few hundred users or
> so… Which really isn't a lot for that kind of application.
>
> I'd love to see more support for those kinds of queries in postgres,
> (although WITH RECURSIVE already was a *huge* improvement in this
> area!). But storing each graph as a graph type would do isn't the
> way forward, IMHO.

My $0.02:

I believe it would be best to largely separate the questions of storage and access. Partly because of Florian's concern that you'd frequently want only one representation of the whole graph, but also because the actual storage interface does NOT have to be user friendly if we have a good access layer. In particular, if rows had a low overhead, we'd probably just store graphs that way. That's obviously not the case in PG, so is there some kind of hybrid approach we could use? Perhaps sections of a graph could be stored with one piece of MVCC overhead per section?

That's why I think separating access from storage is going to be very important; if we do that up-front, we can change the storage latter as we get real experience with this.

Second, we should consider how much the access layer should build on WITH RECURSIVE and the like. Being able to detect specific use patterns of CTE/WITH RECURSIVE seems like it could add a lot of value; but I also worry that it's way to magical to be practical.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Christopher Browne <cbbrowne(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-29 20:18:12
Message-ID: CAFNqd5W2+iTAK5y-907LXr-Kuqf9-GWMk8q-kr8XLrzBX-FGpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 29, 2013 at 10:50 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

> On Mon, Apr 29, 2013 at 9:25 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
> >>
> >> This is an interesting idea. Historically I've always decomposed
> >> graphs into relational structures because that's the only practical
> >> way to query them. Graphs are not currently able to be transported
> >> out of the database currently via JSON so one of the areas to focus
> >> your research will be how the client will consume the data.
> >> libpqtypes is one way to do it, but that will really restrict you
> >> audience so you'll probably need a rich set of functions present the
> >> internal data (just like hstore).
> >
> > I completely agree. Initially, I was thinking of exposing the data to
> > user via HStore. But now, after Robert's suggestions, I think it will
> > be better to have an alternate representation. JSON seems to be an
> > excellent idea for that.
>
> I don't agree with this; JSON is not really designed to store graphs.
> You will probably need a customized internal representation, just like
> hstore, that expresses a graph like structure.
>
> This is not a trivial project.
>

Not trivial, indeed.

I see there being two directions where a data type goes.

1. We created JSON and XML types as ways of storing data that has a robust
validation system.

They're still, in a sense, just "plain old text", but it's "plain old text"
that the user can be certain satisfies the respective rules for
representations.

2. Some types support special operations to allow the data to be queried
in novel ways.

That's NOT the case, at this point, for JSON or XML.

But it certainly IS the case for Jeff Davis' "range types", which expose
access to some new sorts of data validation and indexing.

It is true for the inet type, which behaves rather differently from our
other types.

It is true for the tsearch indexes, that enable interesting random access
within some large "blobs" of stored data.

I'm not sure quite what we *would* want as the merits of graph-related
types.

I suspect that the best answer is NOT one where a graph is represented as a
value in a table; that has the implication that modifying The Graph
requires altering a single tuple, and that seems likely to become a
horrible bottleneck. I'm suspicious that using HSTORE leads down that
path, where you'll have a database that has a table with just one tuple in
it, and where it's nearly impossible to alter that tuple.

I'm having a hard time thinking about what it looks like to have a graph as
table except to effectively compose the graph as a set of nodes, one tuple
per node, and I'm not sure that a new data type has anything to contribute
to that.

The one place where I *could* see a special type having a contribution is
for there to be a data type that can contain an arbitrary number of links.
That means you have one tuple per node, and, instead of needing a tuple for
each link between nodes, you have one attribute indicating *all* the
links. (And "interesting" is for that one attribute to be usable for
foreign key purposes.) That has a hard time scaling in cases where nodes
are over-connected, which is, broadly speaking, an acceptable sort of
scenario.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


From: Любен Каравелов <karavelov(at)mail(dot)bg>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Graph datatype addition
Date: 2013-04-30 00:33:45
Message-ID: c9035ab40e3ccd8d6599e025ce688bba.mailbg@mail.bg
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


----- Цитат от Christopher Browne (cbbrowne(at)gmail(dot)com), на 29.04.2013 в 23:18 -----
>
> The one place where I *could* see a special type having a contribution
> is for there to be a data type that can contain an arbitrary number of
> links. That means you have one tuple per node, and, instead of
> needing a tuple for each link between nodes, you have one attribute
> indicating *all* the links. (And "interesting" is for that one
> attribute to be usable for foreign key purposes.) That has a hard
> time scaling in cases where nodes are over-connected, which is,
> broadly speaking, an acceptable sort of scenario.
> ...

Hello,

From the start of the discussion I was trying to get what this graph
data type should be... I could not grasp it.

With the current postgres, in the most simple case we could do something
like:

create table node (
node_id serial primary key,
...
);
create table edge(
from integer references node,
to integer[] -- each element references node
);

With the addition of foreign keys constraint on arrays elements (that I
understand is work in progress), we could guarantee referential
integrity of the graph - I really hope that it will be ready for 9.4.

Without the array elements foreign keys constraints, if we have to
guarantee the integrity we could do:

create table edge(
from integer referecens node,
to integer references node,
weight real,
...
);

What is missing are some algorithms. I have personaly implemented some
algorithms using recursive queries and it is doable (most of my
experience was with Oracle but lately I have put in production some
postgresql schemas with recursive queries that are working just fine).
For large scale eigen values factorisation (think pagerank) this
sparse matrix form is reasonable data organisation (though I have
doubts that the best place to run this job is in the database)

Best regards
luben


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Любен Каравелов <karavelov(at)mail(dot)bg>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Graph datatype addition
Date: 2013-04-30 01:18:08
Message-ID: 517F1BD0.2090805@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 30/04/13 12:33, Любен Каравелов wrote:
> ----- Цитат от Christopher Browne (cbbrowne(at)gmail(dot)com), на 29.04.2013 в 23:18 -----
>> The one place where I *could* see a special type having a contribution
>> is for there to be a data type that can contain an arbitrary number of
>> links. That means you have one tuple per node, and, instead of
>> needing a tuple for each link between nodes, you have one attribute
>> indicating *all* the links. (And "interesting" is for that one
>> attribute to be usable for foreign key purposes.) That has a hard
>> time scaling in cases where nodes are over-connected, which is,
>> broadly speaking, an acceptable sort of scenario.
>> ...
>
> Hello,
>
> From the start of the discussion I was trying to get what this graph
> data type should be... I could not grasp it.
>
> With the current postgres, in the most simple case we could do something
> like:
>
> create table node (
> node_id serial primary key,
> ...
> );
> create table edge(
> from integer references node,
> to integer[] -- each element references node
> );
>
> With the addition of foreign keys constraint on arrays elements (that I
> understand is work in progress), we could guarantee referential
> integrity of the graph - I really hope that it will be ready for 9.4.
>
> Without the array elements foreign keys constraints, if we have to
> guarantee the integrity we could do:
>
> create table edge(
> from integer referecens node,
> to integer references node,
> weight real,
> ...
> );
>
> What is missing are some algorithms. I have personaly implemented some
> algorithms using recursive queries and it is doable (most of my
> experience was with Oracle but lately I have put in production some
> postgresql schemas with recursive queries that are working just fine).
> For large scale eigen values factorisation (think pagerank) this
> sparse matrix form is reasonable data organisation (though I have
> doubts that the best place to run this job is in the database)
>
> Best regards
> luben
>
>
For directed graphs, where each node has a set of lines each with their
own weight, I would suggest:

create table node
(
id int primary key,
to_node_id integer[], -- references node(id)
weight float[], -- weight of line
...
);

So if nodeA has a line going to nodeB - then you can't go from nodeB
back to nodeA, unless nodeB has nodeA as a destination for one of its
lines. Would be best to ensure that the arrays have the same length!

This way lines can also have asymmetric weights.

Though it would be better to have something like:
create table node
(
id int primary key,
edges directed_line[],
...
);

Where using a 'directed_line' instance could be an object in its own
right - with functions to extract start/end nodes along with the weight
and to add/remove (destination_node, weight) pairs. Yet the system could
internally just store the start node once for an array of directed_lines
(though something other than array would be more appropriate).

Cheers,
Gavin


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-30 03:51:56
Message-ID: 1367293916.32604.10.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2013-04-28 at 22:55 -0700, Atri Sharma wrote:
> If we add support for weighted graphs, we can probably add support for
> some common graph algorithms, such as Djikstra's algorithm, Bellman
> Ford algorithm, a MST making algorithm, network flow algorithms.

You might find that pgRouting already does much of this.


From: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-30 05:28:39
Message-ID: CAJKUy5hQb=-1cq_de8Oih75zG5sY-46s6_owYnWJcgoutXmf7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 29, 2013 at 10:51 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On Sun, 2013-04-28 at 22:55 -0700, Atri Sharma wrote:
>> If we add support for weighted graphs, we can probably add support for
>> some common graph algorithms, such as Djikstra's algorithm, Bellman
>> Ford algorithm, a MST making algorithm, network flow algorithms.
>
> You might find that pgRouting already does much of this.
>

actually, i was going to suggest to Atri to take a look at that,
pgrouting is currently in develop of v2.0 which will rewrite some
parts (including some of the algorithms).

Maybe this datatype could fit better as part of pgrouting

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Misa Simic <misa(dot)simic(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-30 08:49:52
Message-ID: CAOeZVid-cbBPnc_mE4-csegjaVdqQxtHUfaYcSRoQVAsn-Y=fQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Usually though, you'd be interested a large graphs which include
> information for lots of records (e.g., nodes are individual users,
> or products, or whatever). A graph datatype is not well suited for
> that, because it'd store each graph as a single value, and updating
> the graph would mean rewriting that whole value. If you're e.g. doing
> social network analysis, and each new edge between two users requires
> you to pull the whole graph from disk, update it, and write it back,
> you'll probably hit problems once you reach a few hundred users or
> so… Which really isn't a lot for that kind of application.

Yes, I agree. Hence, I suggested keeping the nodes and links separate.
So, if you need to add a new edge, you just need to update the
adjacency lists.

What I think will work is more of a 'logical' graph i.e. Consider a
tuple in some table. To add it to a new/existing graph as a node, we
just need to add an adjacency list for it. For each edge that connects
the tuple to some other node, we add the corresponding entry in the
list.

Essentially, the idea is again, to separate the nodes and links,
hence, the actual node is accessed pretty infrequently as compared to
the links. Since they are separate, there is no need to pull out the
entire graph for updating a single edge. This was the fundamental
problem with my HStore based design.

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Misa Simic <misa(dot)simic(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-30 08:52:01
Message-ID: CAOeZVif3Ri9jnWUh2KAeh613SUsvTaVbtsG7HKj6G9sUa48_Fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Have you considered maybe ltree datatype?
>
> To me all described sounds solveable on pure sql way ( + ltree datatype to
> help with indexes and performance as materialised path to avoid recursive
> query all the time...)

This may build over the existing solutions itself. I will have to check that.

> Though would be nice to see something new what would simplify the tasks...

Exactly.A specialized data type with built in support for common
operations will be helpful,IMHO.

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, Misa Simic <misa(dot)simic(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-30 08:59:13
Message-ID: CAOeZVie7X9b=9DZtcKwuCHsrUmeFXeb+o2+EMjwEZSw+L=8-Qw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I believe it would be best to largely separate the questions of storage and
> access. Partly because of Florian's concern that you'd frequently want only
> one representation of the whole graph, but also because the actual storage
> interface does NOT have to be user friendly if we have a good access layer.
> In particular, if rows had a low overhead, we'd probably just store graphs
> that way. That's obviously not the case in PG, so is there some kind of
> hybrid approach we could use? Perhaps sections of a graph could be stored
> with one piece of MVCC overhead per section?

Yes, I agree. We can probably store some(persistent) data in a
table,and the 'hot' parts of the datatype(like adjacency lists) could
be stored locally in a data structure which allows fast and low
overhead accesses and updates.

I completely agree with separating storage and access layers.
Supporting Florian's concern, I would also agree with your point of
abstraction. Users should not concern themselves with backend storage
details.

> That's why I think separating access from storage is going to be very
> important; if we do that up-front, we can change the storage latter as we
> get real experience with this.

+1.

> Second, we should consider how much the access layer should build on WITH
> RECURSIVE and the like. Being able to detect specific use patterns of
> CTE/WITH RECURSIVE seems like it could add a lot of value; but I also worry
> that it's way to magical to be practical.

I thought of a frequent access patterns mining tool recently. We can
store the recent accesses and apply a pattern mining algorithm on that
dataset(I was thinking of FP Growth algorithm). But again, it has its
own set of issues and is a project in itself.

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-30 12:27:44
Message-ID: CAOeZViciPsGCU1deKQKDbv4vF1gp=iXZ4PB=+y2Qv_4RjCtDvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> 1. We created JSON and XML types as ways of storing data that has a robust
> validation system.
>
> They're still, in a sense, just "plain old text", but it's "plain old text"
> that the user can be certain satisfies the respective rules for
> representations.

Yes, although, I feel that we can use JSON to port a graph's data out
of the database.

> I suspect that the best answer is NOT one where a graph is represented as a
> value in a table; that has the implication that modifying The Graph requires
> altering a single tuple, and that seems likely to become a horrible
> bottleneck. I'm suspicious that using HSTORE leads down that path, where
> you'll have a database that has a table with just one tuple in it, and where
> it's nearly impossible to alter that tuple.

I agree, hence, I have dropped the idea of building it completely over HStore.

> I'm having a hard time thinking about what it looks like to have a graph as
> table except to effectively compose the graph as a set of nodes, one tuple
> per node, and I'm not sure that a new data type has anything to contribute
> to that.

I can think of three points:

1) We can maintain pointers to tuples which are nodes in a graph
instead of copying over the actual data. This idea is based on the
assumption that node data access will be very specific i.e. accessing
actual data of nodes happens very less, for most analysis we are more
interested in the linkage.

2) In continuation of the above point, the more frequently accessed
areas are the link details, or in my current plan, the adjacency
lists. This could be maintained in a data structure by the data type,
with low overheads of updating, traversal and insertion.

3) The data type shall have in built support for common operations on
graphs. The user can call them directly and hence shall be saved the
overhead of designing his own data structures and functions.

> The one place where I *could* see a special type having a contribution is
> for there to be a data type that can contain an arbitrary number of links.
> That means you have one tuple per node, and, instead of needing a tuple for
> each link between nodes, you have one attribute indicating *all* the links.
> (And "interesting" is for that one attribute to be usable for foreign key
> purposes.) That has a hard time scaling in cases where nodes are
> over-connected, which is, broadly speaking, an acceptable sort of scenario.

Yes, that one attribute can be the adjacency set of adjacency lists.
For efficient access, we can store pointers to adjacency lists in hash
maps,with node identifier as keys(maybe).

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-04-30 12:34:35
Message-ID: CAOeZVidMPYEu58CVRgWL7X4=HFT2dTFCN_s4Ap_EeME9SC0SBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> actually, i was going to suggest to Atri to take a look at that,
> pgrouting is currently in develop of v2.0 which will rewrite some
> parts (including some of the algorithms).
>
> Maybe this datatype could fit better as part of pgrouting

Thanks, I will have a look at pgRouting.Although, in a short look, I
feel that the purposes of pgrouting and the proposed data type are
different, even though they support similar functions.

One good idea can be to refer to pgRouting's functions when
implementing similar functions for the proposed datatype.

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>
Subject: Re: Graph datatype addition
Date: 2013-05-01 09:02:26
Message-ID: CAOeZViesPgpHe8tNaxgCthoMjZ2Mur_XR5nkBVG7xbw5UsJA2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

Please find a probable prototype for the same:

struct GraphNode
{
Oid NodeOid; // Oid of the row which is the node here. We will
store an identifier to it here rather than the complete row(with data)
itself.
AdjacencyList *list; // Pointer to the node's adjacency list.
};

struct AdjacencyList
{
Oid[] neighbours_list;
};

struct AdjacencyList is probably the 'hottest' data structure in our
entire implementation. We can think of making a cache of recently
accessed struct AdjacencyList instances, or the AdjacencyList(s) of
the neighbours of the recently accessed nodes, because, they are most
likely to be accessed in near future. Advice here, please?

So.

struct AdjacencyCache
{
Oid[] cache_values;
};

push and pop functions for AdjacencyCache follow.

We need a replacement and invalidation algorithm for the cache. I feel
a version of LRU should be good here.

I have not given a prototype for operations and algorithm implementations.

I feel,as suggested by Peter and Jaime, we can look at pgRouting code
for algorithm implementations.

Florian's concerns are mitigated here to some extent,IMO. Since the
nodes and linkings are loosely coupled, and not represented as a
single representation, updating or changing of any part or adding a
new edge is no longer an expensive operation, as it only requires a
lookup of GraphNode and then its AdjacencyList. If we use the cache as
well, it will further reduce the lookup costs.

I have not yet thought of the user visible layer as suggested by Jim.
Probably. once we are ok with the internal layer, we can move to the
user visible layer.

Advice/Comments/Feedback please?

Regards,

Atri

--
Regards,

Atri
l'apprenant

On Tue, Apr 30, 2013 at 10:58 AM, Jaime Casanova <jaime(at)2ndquadrant(dot)com> wrote:
> On Mon, Apr 29, 2013 at 10:51 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>> On Sun, 2013-04-28 at 22:55 -0700, Atri Sharma wrote:
>>> If we add support for weighted graphs, we can probably add support for
>>> some common graph algorithms, such as Djikstra's algorithm, Bellman
>>> Ford algorithm, a MST making algorithm, network flow algorithms.
>>
>> You might find that pgRouting already does much of this.
>>
>
> actually, i was going to suggest to Atri to take a look at that,
> pgrouting is currently in develop of v2.0 which will rewrite some
> parts (including some of the algorithms).
>
> Maybe this datatype could fit better as part of pgrouting
>
> --
> Jaime Casanova www.2ndQuadrant.com
> Professional PostgreSQL: Soporte 24x7 y capacitación
> Phone: +593 4 5107566 Cell: +593 987171157

--
Regards,

Atri
l'apprenant


From: Misa Simic <misa(dot)simic(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>
Subject: Re: Graph datatype addition
Date: 2013-05-01 23:03:39
Message-ID: CAH3i69nz0Ve_S1k9OAe6=eLWSUek5uUkefWmEL4q=Sx=570HZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday, May 1, 2013, Atri Sharma wrote:

> Hi all,
>
> Please find a probable prototype for the same:
>
> struct GraphNode
> {
> Oid NodeOid; // Oid of the row which is the node here. We will
> store an identifier to it here rather than the complete row(with data)
> itself.
> AdjacencyList *list; // Pointer to the node's adjacency list.
> };
>
> struct AdjacencyList
> {
> Oid[] neighbours_list;
> };
>
> struct AdjacencyList is probably the 'hottest' data structure in our
> entire implementation. We can think of making a cache of recently
> accessed struct AdjacencyList instances, or the AdjacencyList(s) of
> the neighbours of the recently accessed nodes, because, they are most
> likely to be accessed in near future. Advice here, please?
>
> So.
>
> struct AdjacencyCache
> {
> Oid[] cache_values;
> };
>
> push and pop functions for AdjacencyCache follow.
>
> We need a replacement and invalidation algorithm for the cache. I feel
> a version of LRU should be good here.
>
> I have not given a prototype for operations and algorithm implementations.
>
> I feel,as suggested by Peter and Jaime, we can look at pgRouting code
> for algorithm implementations.
>
> Florian's concerns are mitigated here to some extent,IMO. Since the
> nodes and linkings are loosely coupled, and not represented as a
> single representation, updating or changing of any part or adding a
> new edge is no longer an expensive operation, as it only requires a
> lookup of GraphNode and then its AdjacencyList. If we use the cache as
> well, it will further reduce the lookup costs.
>
> I have not yet thought of the user visible layer as suggested by Jim.
> Probably. once we are ok with the internal layer, we can move to the
> user visible layer.
>
> Advice/Comments/Feedback please?
>
>
Honestly - I think I dont understand proposal...

Datatypes - are about values - what will be stored in that column in a
table....

Datatype - cant have any clue about "rows"

How I understand what you described - you can achieve the same with pure
SQL - struct are equvalent to graph tables... Instead od Oid column will
store PKs of nodes table...


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Misa Simic <misa(dot)simic(at)gmail(dot)com>
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>
Subject: Re: Graph datatype addition
Date: 2013-05-02 02:28:36
Message-ID: 01F3CF3B-D69A-456E-BAE2-BC8DA508EA33@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sent from my iPad

On 02-May-2013, at 4:33, Misa Simic <misa(dot)simic(at)gmail(dot)com> wrote:

>
>
> On Wednesday, May 1, 2013, Atri Sharma wrote:
>> Hi all,
>>
>> Please find a probable prototype for the same:
>>
>> struct GraphNode
>> {
>> Oid NodeOid; // Oid of the row which is the node here. We will
>> store an identifier to it here rather than the complete row(with data)
>> itself.
>> AdjacencyList *list; // Pointer to the node's adjacency list.
>> };
>>
>> struct AdjacencyList
>> {
>> Oid[] neighbours_list;
>> };
>>
>> struct AdjacencyList is probably the 'hottest' data structure in our
>> entire implementation. We can think of making a cache of recently
>> accessed struct AdjacencyList instances, or the AdjacencyList(s) of
>> the neighbours of the recently accessed nodes, because, they are most
>> likely to be accessed in near future. Advice here, please?
>>
>> So.
>>
>> struct AdjacencyCache
>> {
>> Oid[] cache_values;
>> };
>>
>> push and pop functions for AdjacencyCache follow.
>>
>> We need a replacement and invalidation algorithm for the cache. I feel
>> a version of LRU should be good here.
>>
>> I have not given a prototype for operations and algorithm implementations.
>>
>> I feel,as suggested by Peter and Jaime, we can look at pgRouting code
>> for algorithm implementations.
>>
>> Florian's concerns are mitigated here to some extent,IMO. Since the
>> nodes and linkings are loosely coupled, and not represented as a
>> single representation, updating or changing of any part or adding a
>> new edge is no longer an expensive operation, as it only requires a
>> lookup of GraphNode and then its AdjacencyList. If we use the cache as
>> well, it will further reduce the lookup costs.
>>
>> I have not yet thought of the user visible layer as suggested by Jim.
>> Probably. once we are ok with the internal layer, we can move to the
>> user visible layer.
>>
>> Advice/Comments/Feedback please?
>
> Honestly - I think I dont understand proposal...
>
> Datatypes - are about values - what will be stored in that column in a table....
>
> Datatype - cant have any clue about "rows"
>
> How I understand what you described - you can achieve the same with pure SQL - struct are equvalent to graph tables... Instead od Oid column will store PKs of nodes table...
>

Yes, I agree.I need to think more.

Let me get back with a deeper proposal.

Regards,

Atri


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Misa Simic <misa(dot)simic(at)gmail(dot)com>
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>
Subject: Re: Graph datatype addition
Date: 2013-05-08 18:40:48
Message-ID: CAOeZVie7enzrn-vV0m7nTw3HiR0CexQMs9r5CrCJok6O=o3gvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 2, 2013 at 7:58 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
>
>
> Sent from my iPad
>
> On 02-May-2013, at 4:33, Misa Simic <misa(dot)simic(at)gmail(dot)com> wrote:
>
>
>
> On Wednesday, May 1, 2013, Atri Sharma wrote:
>>
>> Hi all,
>>
>> Please find a probable prototype for the same:
>>
>> struct GraphNode
>> {
>> Oid NodeOid; // Oid of the row which is the node here. We will
>> store an identifier to it here rather than the complete row(with data)
>> itself.
>> AdjacencyList *list; // Pointer to the node's adjacency list.
>> };
>>
>> struct AdjacencyList
>> {
>> Oid[] neighbours_list;
>> };
>>
>> struct AdjacencyList is probably the 'hottest' data structure in our
>> entire implementation. We can think of making a cache of recently
>> accessed struct AdjacencyList instances, or the AdjacencyList(s) of
>> the neighbours of the recently accessed nodes, because, they are most
>> likely to be accessed in near future. Advice here, please?
>>
>> So.
>>
>> struct AdjacencyCache
>> {
>> Oid[] cache_values;
>> };
>>
>> push and pop functions for AdjacencyCache follow.
>>
>> We need a replacement and invalidation algorithm for the cache. I feel
>> a version of LRU should be good here.
>>
>> I have not given a prototype for operations and algorithm implementations.
>>
>> I feel,as suggested by Peter and Jaime, we can look at pgRouting code
>> for algorithm implementations.
>>
>> Florian's concerns are mitigated here to some extent,IMO. Since the
>> nodes and linkings are loosely coupled, and not represented as a
>> single representation, updating or changing of any part or adding a
>> new edge is no longer an expensive operation, as it only requires a
>> lookup of GraphNode and then its AdjacencyList. If we use the cache as
>> well, it will further reduce the lookup costs.
>>
>> I have not yet thought of the user visible layer as suggested by Jim.
>> Probably. once we are ok with the internal layer, we can move to the
>> user visible layer.
>>
>> Advice/Comments/Feedback please?
>>
>
> Honestly - I think I dont understand proposal...
>
> Datatypes - are about values - what will be stored in that column in a
> table....
>
> Datatype - cant have any clue about "rows"
>
> How I understand what you described - you can achieve the same with pure SQL
> - struct are equvalent to graph tables... Instead od Oid column will store
> PKs of nodes table...
>
>
> Yes, I agree.I need to think more.
>
> Let me get back with a deeper proposal.
>
> Regards,
>
> Atri

Hi all,

In continuation of the above discussion,I have been thinking about the
design of the data type. I have been thinking on the lines of making a
multi dimensional data structure for the same:

Specifically, I have been thinking about making multi lists for
representing data. After some research, I think that the following
data structure may help:

Each node will be represented as:

[Down Pointer][Atom][Right Pointer]

Suppose, a graph is like(sorry for the bad drawing):

B
/
A D
\ /
C
\
E

can be represented as:
C's data E's data
D's data
^ ^
^
A's data
[|][1][---------------------->[|][1][------------------>[|][1][NULL]
^ ^
[|][1][------------------>[|][0][--------------------->[|][1][NULL]
^
B's data

Essentially, the Atom flag denotes if the node has any out edges from
it. If it has no out edge, ATOM is 0 and Down Pointer points to an
auxiliary structure which holds the node's data(hence, the data can be
of arbitrary size).

If the node has some out degree, then, those nodes are added to a new
sublist which starts from the node which spawns those nodes.Node's
down pointer points to the head of the new sublist.

Essentially, a sublist has all the nodes directly spawning from the
head node of the sublist.

This approach has multiple advantages in term of memory and
efficiency. Also, isolating sub graphs based on some criteria is
pretty efficient, which is good for many analytics based operations.

Access time is restricted as well.

Thoughts/Comments/Feedback please?

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Jim Nasby <jim(at)nasby(dot)net>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Misa Simic <misa(dot)simic(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject: Re: Graph datatype addition
Date: 2013-05-08 19:07:45
Message-ID: 518AA281.5030400@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/8/13 1:40 PM, Atri Sharma wrote:
> On Thu, May 2, 2013 at 7:58 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
>>
>>
>> Sent from my iPad
>>
>> On 02-May-2013, at 4:33, Misa Simic <misa(dot)simic(at)gmail(dot)com> wrote:
>>
>>
>>
>> On Wednesday, May 1, 2013, Atri Sharma wrote:
>>>
>>> Hi all,
>>>
>>> Please find a probable prototype for the same:
>>>
>>> struct GraphNode
>>> {
>>> Oid NodeOid; // Oid of the row which is the node here. We will
>>> store an identifier to it here rather than the complete row(with data)
>>> itself.
>>> AdjacencyList *list; // Pointer to the node's adjacency list.
>>> };
>>>
>>> struct AdjacencyList
>>> {
>>> Oid[] neighbours_list;
>>> };
>>>
>>> struct AdjacencyList is probably the 'hottest' data structure in our
>>> entire implementation. We can think of making a cache of recently
>>> accessed struct AdjacencyList instances, or the AdjacencyList(s) of
>>> the neighbours of the recently accessed nodes, because, they are most
>>> likely to be accessed in near future. Advice here, please?
>>>
>>> So.
>>>
>>> struct AdjacencyCache
>>> {
>>> Oid[] cache_values;
>>> };
>>>
>>> push and pop functions for AdjacencyCache follow.
>>>
>>> We need a replacement and invalidation algorithm for the cache. I feel
>>> a version of LRU should be good here.
>>>
>>> I have not given a prototype for operations and algorithm implementations.
>>>
>>> I feel,as suggested by Peter and Jaime, we can look at pgRouting code
>>> for algorithm implementations.
>>>
>>> Florian's concerns are mitigated here to some extent,IMO. Since the
>>> nodes and linkings are loosely coupled, and not represented as a
>>> single representation, updating or changing of any part or adding a
>>> new edge is no longer an expensive operation, as it only requires a
>>> lookup of GraphNode and then its AdjacencyList. If we use the cache as
>>> well, it will further reduce the lookup costs.
>>>
>>> I have not yet thought of the user visible layer as suggested by Jim.
>>> Probably. once we are ok with the internal layer, we can move to the
>>> user visible layer.
>>>
>>> Advice/Comments/Feedback please?
>>>
>>
>> Honestly - I think I dont understand proposal...
>>
>> Datatypes - are about values - what will be stored in that column in a
>> table....
>>
>> Datatype - cant have any clue about "rows"
>>
>> How I understand what you described - you can achieve the same with pure SQL
>> - struct are equvalent to graph tables... Instead od Oid column will store
>> PKs of nodes table...
>>
>>
>> Yes, I agree.I need to think more.
>>
>> Let me get back with a deeper proposal.
>>
>> Regards,
>>
>> Atri
>
> Hi all,
>
> In continuation of the above discussion,I have been thinking about the
> design of the data type. I have been thinking on the lines of making a
> multi dimensional data structure for the same:
>
> Specifically, I have been thinking about making multi lists for
> representing data. After some research, I think that the following
> data structure may help:
>
> Each node will be represented as:
>
> [Down Pointer][Atom][Right Pointer]
>
> Suppose, a graph is like(sorry for the bad drawing):
>
> B
> /
> A D
> \ /
> C
> \
> E
>
> can be represented as:
> C's data E's data
> D's data
> ^ ^
> ^
> A's data
> [|][1][---------------------->[|][1][------------------>[|][1][NULL]
> ^ ^
> [|][1][------------------>[|][0][--------------------->[|][1][NULL]
> ^
> B's data
>
>
> Essentially, the Atom flag denotes if the node has any out edges from
> it. If it has no out edge, ATOM is 0 and Down Pointer points to an
> auxiliary structure which holds the node's data(hence, the data can be
> of arbitrary size).
>
> If the node has some out degree, then, those nodes are added to a new
> sublist which starts from the node which spawns those nodes.Node's
> down pointer points to the head of the new sublist.
>
> Essentially, a sublist has all the nodes directly spawning from the
> head node of the sublist.
>
> This approach has multiple advantages in term of memory and
> efficiency. Also, isolating sub graphs based on some criteria is
> pretty efficient, which is good for many analytics based operations.
>
> Access time is restricted as well.
>
> Thoughts/Comments/Feedback please?

Your second drawing didn't really make any sense to me. :(

I do think it would be most productive to focus on what the API for dealing with graph data would look like before trying to handle the storage aspect. The storage is potentially dirt-simple, as others have shown. The only challenge would be efficiency, but it's impossible to discuss efficiency without some clue of how the data will be accessed. Frankly, for the first round of this I think it would be best if the storage really was just some raw tables. Once something is available people will start figuring out how to use it, and where the API needs to be improved.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Misa Simic <misa(dot)simic(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject: Re: Graph datatype addition
Date: 2013-05-08 19:16:59
Message-ID: CAOeZVieAph_ZNXepCOrfaWSsr80NooEHiocbtDGxpyN5A98S1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> Your second drawing didn't really make any sense to me. :(
>
> I do think it would be most productive to focus on what the API for dealing
> with graph data would look like before trying to handle the storage aspect.
> The storage is potentially dirt-simple, as others have shown. The only
> challenge would be efficiency, but it's impossible to discuss efficiency
> without some clue of how the data will be accessed. Frankly, for the first
> round of this I think it would be best if the storage really was just some
> raw tables. Once something is available people will start figuring out how
> to use it, and where the API needs to be improved.
>
> --

Thanks for your reply.

Yes,my drawing sucks.heh.

Ok,I agree. I was pretty perked up about efficiency in storage, hence
started designing.

Sketching out an API in terms of functionalities will require a
different viewpoint. I think make, insert, search, delete
functionalities would be straightly exposed to the user.Then,
functionalities to isolate sub graphs based on some criterion/criteria
and implementations of standard graph algorithms(BFS,DFS,Djikstra's
algorithm) can be exposed through single functions.

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Misa Simic <misa(dot)simic(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Graph datatype addition
Date: 2013-05-09 13:36:59
Message-ID: CAHyXU0yMs_9qr63mA=39h=4TpaUt0zHmYYwKSUzUi2iXt1ZZOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 8, 2013 at 2:16 PM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
>>
>> Your second drawing didn't really make any sense to me. :(
>>
>> I do think it would be most productive to focus on what the API for dealing
>> with graph data would look like before trying to handle the storage aspect.
>> The storage is potentially dirt-simple, as others have shown. The only
>> challenge would be efficiency, but it's impossible to discuss efficiency
>> without some clue of how the data will be accessed. Frankly, for the first
>> round of this I think it would be best if the storage really was just some
>> raw tables. Once something is available people will start figuring out how
>> to use it, and where the API needs to be improved.
>>
>> --
>
> Thanks for your reply.
>
> Yes,my drawing sucks.heh.
>
> Ok,I agree. I was pretty perked up about efficiency in storage, hence
> started designing.

This is the wrong place to start. A proposed API will help people
understand the use cases you're trying to solve (if any) that are
insufficiently covered by existing paradigms.

merlin