Web crawling / Graph traversal

Andy Novocin

Daily Challenge


Given an undirected graph give an algorithm to convert from:


a) Adjacency Matrix to Adjacency List

Daily Challenge cont.


b) From Adjacency List to Incidence matrix. An Incidence matrix is a \( |V| \times |E| \) matrix where \(M[i,j] = 1\) if vertex \(i\) is part of edge \(j\) (for some edge numbering system).

Daily Challenge cont.


c) From incidence matrix to adjacency list.

Advanced Challenge


Programming assignment 3 is up and pertinent.

Pedagogical Reminder


The only way to progress in a STEM subject is through wrestling with concepts.


You can fake it for a long time thanks to the internet... slow down and build skill.

Graph Traversal


We want to visit every node (every edge).


How do we avoid returning to old places?


How do we do this efficiently?

Here's Johnny!

Breadcrumbs


Three states:


Undiscovered (starting state)


Discovered (seen once from somewhere)


Processed (completely drained of data)

Properties


A winning algorithm here must be:


Systematic


Efficient


What's wrong with right hand on the wall?

Today BFS next time DFS


BFS = Breadth-First-Search


DFS = Depth-First-Search


Same outcome, completely different applications

Pseudo-code

BFS(G,s):
  for v in G.V:
      v.state = "undiscovered"
      v.parent = NULL
  s.state = "discovered"
  s.d = 0
  Q.enqueue(s)
  while not Q.isEmpty:
    u = Q.dequeue()
    //TODO: process u here
    for v in u.neighbors:
      //TODO: process edge (u,v) here
      if v.state == "undiscovered":
        v.state = "discovered"
        v.parent = u
        v.d = u.d + 1
        Q.enqueue(v)
    u.state = "processed"
    

ICU1: BFS(G,a) vs BFS(G,i);

ICU2


How many times is each edge visited?


What is the complexity of BFS?

As a vehicle for riding the rails


Many algorithms are just traversal algorithms with extra processing.


When you discover a vertex feel free to call a vertex sub-routine.


When you know a vertex and a parent then call an edge sub-routine.


When you are done with a vertex you can call a clean-up function.

Subtle Points (parent)


Parent is not well defined on a graph.


Our traversal creates a parent to children relationship


We've converted the graph into a tree.

ICU3: Tree from a; Tree from i

Subtle Points (queue)


If we merely change our data container from Queue to Stack we get Depth-First-Search.


We rely on the ordering of FIFO to get some unique properties

Subtle Points (bit arrays, properties, or other)


If you choose the wrong way of "coloring" the points you lose speed.

BFS tree == Shortest Path?


Work from leaf to root using parent pointers.


This gives a backwards path.


I claim it is the shortest path between the two.

Reason: BFS works radially


I will process all of the vertices of distance 1.


The nodes which are discovered after them must not have been distance 1.


If they are one away from distance 1 then they must be distance 2.

Limitations


Only works when you start BFS from one of the two points.


So far, only works when unweighted.


Quick exercise: Use our tree to get shortest_path from \(a\) to \(f\).

Connected Components


Can I get from every vertex to every other vertex?


Visually, this seems simple (undirected).


Important question (power grids, emergency traffic redirect, google maps)

Google Maps

Doable with BFS

c = 0
for v in vertices:
  if v.state == "undiscovered":
    c += 1
    BFS(G,v) // component c
    

Strong/Weak Connections

ICU4 Can this be two-colored?

Bipartite Graph Detection

for v in vertices:
  v.color = "none"
s = vertices[0]
s.color = "color1"
bipartite = True
BFS(G,s) //using extra sub-routines
def process_edge(u,v):
  if (u.color == v.color):
    bipartite = False
  v.color = opposite_color(u.color)
  

ICU5 Now use algorithm