For undirected graphs finding connected components is a simple matter of doing a DFS starting at each node in the graph and marking new reachable nodes as being within the same component.
A directed graph is connected if exists a path to reach a node from any other node, disconnected otherwise.
Let’s take for instance the following graph
Since the graph is disconnected, it has connected subcomponents. The following code finds the connected components of the undirected graph
Node 0 is in component 0
Node 1 is in component 0
Node 2 is in component 2
Node 3 is in component 2
Node 4 is in component 2
Node 5 is in component 2
Directed graphs
In case of directed graphs an additional distinction has to be made:
As before a graph is said to be disconnected if it doesn’t exist a path from a start vertex to any other vertex. Not even ignoring the edge directions (see next point).
A directed graph is said to be weakly connected if ignoring all the directions (i.e. just considering the graph as an undirected one) results in a connected graph.
If the graph isn’t weakly connected, weak connected components can be found as in the case for the undirected graph, i.e. by just marking every node encountered from a starting vertex as being in the same component.
A directed graph is said to be strongly connected if from any node there’s always a path to reach any other one. This is the criterion city roads should be designed with.
An example is the following graph
which, due to a missing path from 3 to any other node of the main cycle, is not strongly connected.
The way to check for a graph being strongly connected is doing a DFS from a starting node and check if every node can be reached. Then invert every edge in the graph and perform the same DFS from the same starting node again. If the graph is strongly connected every node will be reachable from the starting one and will also be capable of reaching it.
Regarding identifying the strongly connected components, let’s take the following graph as an example. The start node is marked with a double circle
One of the most famous (although not the only one ~ cfr. Cormen) algorithms was discovered by Tarjan to find SCC (strongly connected components) in a graph. The indices of the nodes in the picture above represent discovery times, i.e. indices indicating when a node was discovered.
In the graph above the DFS proceeds in the following way:
0 is discovered
1 is discovered
2 is discovered
3 is discovered
a back edge to 0 is discovered. Node 3 is assigned the minimum between its component (the single-element 3 initially) and the back edge’s one (0). Therefore 3 gets 0 as belonging component.
Back from 3, 2 gets the minimum between its component 2 and 3’s findings: 0.
The process starts again from 4 being discovered.
Output:
Found SSC with nodes: { 7 6 5 4 }
Found SSC with nodes: { 3 2 1 0 }
It has to be noted that the line
could also work if the back-edge’s component was used in the comparison rather than its discovery time. The reason why the discovery time is used instead it is because the back edge’s component value might not have been updated yet to a definitive value. For consistency reasons the component field indicates the earliest ancestor that can be reached from a node. Although, as stated, the algorithm would continue to work, the data regarding the individual component fields would be unreliable.
The complexity of the algorithm above, since it mainly calls DFS, it’s \( O(V+E) \) where \( V \) is the number of nodes of the graph and \( E \) the number of edges. DFS’s complexity, in fact, can be thought of as follows: node1 + edges_from_node1 + node2 + edges_from_node2 ... and rewritten as (node1 + node2 + node3 + node4 ..) + (edges_from_node1 + edges_from_node2 + ...) i.e. \( O(V+E) \). Notice that this complexity assumes an adjacency list as data structure (both nodes and edges are proportional to the input size). If we had to traverse an entire row to get the edges for a node (i.e. adjacency matrix) the complexity would be \( O(N^2) \).