Web crawling / Graph traversal
Andy Novocin
Daily Challenge
Given an undirected graph give an algorithm to convert from:
a) Adjacency Matrix to Adjacency List
Daily Challenge cont.
b) From Adjacency List to Incidence matrix. An Incidence matrix is a \( |V| \times |E| \) matrix where \(M[i,j] = 1\) if vertex \(i\) is part of edge \(j\) (for some edge numbering system).
Daily Challenge cont.
c) From incidence matrix to adjacency list.
Pedagogical Reminder
The only way to progress in a STEM subject is through wrestling with concepts.
You can fake it for a long time thanks to the internet... slow down and build skill.
Graph Traversal
We want to visit every node (every edge).
How do we avoid returning to old places?
How do we do this efficiently?
Breadcrumbs
Three states:
Undiscovered (starting state)
Discovered (seen once from somewhere)
Processed (completely drained of data)
Properties
A winning algorithm here must be:
Systematic
Efficient
What's wrong with right hand on the wall?
Today BFS next time DFS
BFS = Breadth-First-Search
DFS = Depth-First-Search
Same outcome, completely different applications
Pseudo-code
BFS(G,s):
for v in G.V:
v.state = "undiscovered"
v.parent = NULL
s.state = "discovered"
s.d = 0
Q.enqueue(s)
while not Q.isEmpty:
u = Q.dequeue()
//TODO: process u here
for v in u.neighbors:
//TODO: process edge (u,v) here
if v.state == "undiscovered":
v.state = "discovered"
v.parent = u
v.d = u.d + 1
Q.enqueue(v)
u.state = "processed"
ICU1: BFS(G,a) vs BFS(G,i);
ICU2
How many times is each edge visited?
What is the complexity of BFS?
As a vehicle for riding the rails
Many algorithms are just traversal algorithms with extra processing.
When you discover a vertex feel free to call a vertex sub-routine.
When you know a vertex and a parent then call an edge sub-routine.
When you are done with a vertex you can call a clean-up function.
Subtle Points (parent)
Parent is not well defined on a graph.
Our traversal creates a parent to children relationship
We've converted the graph into a tree.
ICU3: Tree from a; Tree from i
Subtle Points (queue)
If we merely change our data container from Queue to Stack we get Depth-First-Search.
We rely on the ordering of FIFO to get some unique properties
Subtle Points (bit arrays, properties, or other)
If you choose the wrong way of "coloring" the points you lose speed.
BFS tree == Shortest Path?
Work from leaf to root using parent
pointers.
This gives a backwards path.
I claim it is the shortest path between the two.
Reason: BFS works radially
I will process all of the vertices of distance 1.
The nodes which are discovered after them must not have been distance 1.
If they are one away from distance 1 then they must be distance 2.
Limitations
Only works when you start BFS from one of the two points.
So far, only works when unweighted.
Quick exercise: Use our tree to get shortest_path from \(a\) to \(f\).
Connected Components
Can I get from every vertex to every other vertex?
Visually, this seems simple (undirected).
Important question (power grids, emergency traffic redirect, google maps)
Doable with BFS
c = 0
for v in vertices:
if v.state == "undiscovered":
c += 1
BFS(G,v) // component c
ICU4 Can this be two-colored?
Bipartite Graph Detection
for v in vertices:
v.color = "none"
s = vertices[0]
s.color = "color1"
bipartite = True
BFS(G,s) //using extra sub-routines
def process_edge(u,v):
if (u.color == v.color):
bipartite = False
v.color = opposite_color(u.color)