Network Structure and Communities¶
Networks of very different types have many things in common:
- the mathematical graph model
- typical structural properties
- fundamental network (data) analysis problems, e.g.:
- community detection
- node classification and link prediction
Network science investigates modeling and analysis problems for networks in general
Degree distribution¶
A small network and its degree distribution:
Degree distribution in Movielens user-movie rating graph (user nodes only):
Citation Network
Source: Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/index.html
Nodes: 34546 papers in the Arxiv High Energy Physics category
Edges: 421578 citation links between the papers (directed)
Power Law¶
Observation: in many real networks there is a linear relationship
between the logarithms of the degree d, and its relative frequency f (d). Thus:
\sim : proportional
Erdös-Rényi random graph model¶
For all pairs of nodes, v,w: edge v\to w is included in the graph with probability p (same for all pairs)
Degree distribution in an ER-random graph:
- Not like the Power Law
- Real networks evolve by more complex mechanisms than represented by ER model
Diameters and Distances¶
Milgram's Experiment
- How many links via personal acquaintance are needed to connect a random pair of US citizens?
- Approximate measurement by passing a message from random starting persons in Omaha, Nebraska to target person in Boston, Massachusetts.
- Total number of initiated chains: 160. No. of completed chains: 44.
S. Milgram: The small-world problem. Psychology Today, 1967
Diameter and Distance¶
- dist(u,v) length (number of edges) of shortest path connecting u and v
- diameter of graph \max_{u,v\in G}dist(u,v)
Distance statistics in some real networks, collected from Stanford Large Network Dataset Collection http://snap.stanford.edu:
- 90'th percentile effective diameter:
- 90 percent of pairs of nodes are connected with a dist of this.
- says parameter on the figure which is a mistake
Clustering Coefficients¶
Two graphs with 20 nodes and 35 edges:
Intuition/empirical observation:
- real networks exhibit significant clustering:
- two acquaintances of mine are more likely to know each other than two random people.
How to measure the amount of clustering?
Consider G=(V,E) undirected. For v \in V with d(v) \geq 2:
- \tau_3(v):= \frac{d(v)(d(v)-1)}{2}
- number of pairs of neighbors of v
- \tau_\triangle := |\{\{u,w\}:\{u,v\}, \{w,v\},\{u,w\} \in E\}|
- number of triangles containing v
Local clustering coefficient:
- cl(v):= {\tau_\triangle(v) \over \tau_3(v)} (for v with degree(v) \geq 2)
Global clustering coefficient:
- cl(G):= {1\over|V'|} \sum_{v\in V'} cl(v), where V':= \{v \in V(G) | d(v) \geq 2\}
Clustering coefficients and comparison with
Communities¶
Zachary Karate Club
Nodes: members of a karate club at a U.S. university.
Edges: social interactions outside the club.
How the network looked at some point in time:
Eventually, the club broke up into two. Membership in new clubs indicated by color.
What are Communities¶
Intuitive
Informally, a community in a network is a group of nodes with greater ties internally than to the rest of the network. [Parathasarathy et al., 2011]
The actors in a network tend to form groups of closely-knit connections. The groups are also called communities, clusters, cohesive subgroups or modules in different context. Roughly speaking, individuals interact more frequently within a group than between groups. [Tang et al., 2010]
Communities and Clustering Coefficient
- Large value of cl(v): v is embedded in a closely connected neighborhood
- Large value of cl(G): most nodes lie within a closely connected component
Graph Clustering¶
A clustering of G= (V,E) is a partitioning \mathcal C = \{C_1, \dots, C_k\} of V
Method Skeleton
- Pick a quality measure for clusterings
- Find an algorithm to (approximately) determine the clustering with optimal quality
Cut¶
Quality of a 2-clustering (aka. a cut):
- number of edges that connects vertices in different clusters
- generalizes to weighted edges, and k-clusterings
Minimum weight cuts can be computed efficiently (but not for the more general k > 2 case)
Problem: optimal clusterings may contain very small clusters:
Quality Measure - Kernighan, Lin 1970¶
Quality Measure: Cut weight, subject to size constraints n_1 ≤|C_i |≤ n_2 for all i.
Algorithmic paradigm: greedy optimization, based on local modification operations (“shifting method” [Brandes et al., Ch.8]):
Kernighan, Lin local modifications:
Swap cluster membership of pairs of nodes. Gain (decrease in cut weight) from swapping u, v:
- (2-1) + (3-2)=2
Swaps maintain size constraints
B.W. Kernighan, S. Lin: An efficient heuristic procedure for partitioning graphs. Bell System Tech. J., 1970
Quality Measure - Newman, Girvan 2004¶
Quality Measure: Modularity (see below). No strong link between quality measure and algorithmic approach.
Algorithmic paradigm: divisive hierarchical clustering
Newman, Girvan:
- Division in line 3 is effected by sequence of edge deletions
- Not only internal edges of C_i will be deleted.
M.E.J. Newman and M. Girvan: Finding and evaluating community structure in networks. Physical Review, 2004
Edge Betweenness¶
Intuition: edges connecting communities will be used on many shortest paths connecting nodes:
Many more shortest path traverse edge e_1 than e_2
Shortest path betweenness:
Other formalizations of "betweenness": random walk betweenness, current-flow betweenness.
Newman-Girvan Algorithm¶
Line 11 can be implemented in time O(mn) (modification of standard shortest path computations by breadth first search)
Total time: O(m^2n)
Modularity¶
Newman-Girvan algorithm produces a hierarchy of n clusterings. Which one is most meaningful?
- e_{i,j} = e_{j,i}: proportion of edges that connect nodes in clusters i and j
- Example: e_{1,3} = 2/24,\quad e_{1,1}=6/24
- a_i := e_{i,i} + \sum_{j \neq i} e_{i,j}/2
- \sum_i a_i = 1
- Example: a_1 = 6/24 + 1/48 + 2/48
In matrix form with marginal sums:
a_i: normalized sum of degrees of nodes in cluster i
Modularity Random Graph Model¶
Given:
- n nodes partitioned into k clusters C_1, \dots, C_k
- Number m of edges
- Probabilities a_1,\dots a_k (or degrees d_1,\dots, d_n of all nodes)
construct random graph by selecting for each of the m edges randomly start and end node by
- randomly picking a cluster C_i according to probability a_i, and then select a node in C_i with uniform probability distribution
or
- randomly pick a node with a probability proportional to the degree of the node
Either way: the expected proportion of edges inside cluster C_i is a_i^2
- Note: multiple edges possible; cf. PA model
The modularity score compares the actual proportion of intra-cluster edges with the expected number under the random graph model:
Newman-Girvan Zachary Result¶
Modularity Optimization¶
Can one try to optimize modularity directly?
- Exact solution: NP Hard
- Approximation heuristics: greedy agglomerative hierarchical clustering (no quality of approximation guarantees)
Modularity optimal clustering of Zachary (Q = 0.419):
[U. Brandes et al.: On Modularity Clustering. IEEE Transactions on Knowledge and Data Engineering, 2008.]