Data Structures and Algorithms wrap-up

Andy Novocin

Best-Day Algorithm(s)


Everyone write down a 1-100 score for their day


Algo 1: Create a group of 3-5 neighbors, pick the highest


Selected person look around and find 3-5 other winners, pick the highest


One last time should do it.


How many comparisons were done?

Worst(Best)-Day Algorithm(s)


Algo 2: Everyone that put less than 50 raise their hand


Less than 25


Repeat until min is found


How do we analyze this?

Average-Day Algorithm(s)


Algo 3: Everyone round your score to the nearest multiple of 25


For value in \( 0, 25, 50, 75, 100 \) we count hands


Do the math.


How much did this cost?

Daily Problem


3-12 Suppose you are given an input set \(S\) for \(n\) numbers, and a black box that if given any sequence of real numbers and an integer \( k \) instantly and correctly answers whether there is a subset of the input sequence with sum exactly \(k\). Show how to use the black box \( \mathcal{O}(n) \) times to find a subset of \( S \) that adds up to \(k\).

Sample HW1 Problem solved


2-41 Prove that the binary representation of \( n \geq 1 \) has \( \lfloor \log_{2}{n} \rfloor + 1 \) bits.


Also, if you come to me with a question equivalent to: "Can I get away with less effort?" then I will get preachy!

Scheduler for Programming Groups


Link on Sakai


cs.prof.ninja/programs


Here in slides

More Data Structures


Priority Queue


Another API style data structure

Definition


Insert(x) Assume that x has a key, k


Find-Min() and Find-Max() return pointer to winner


Delete-Min() and Delete-Max() remove the highest or lowest.

PQ Usage


How is this different than other DSs?


Whenever you need to work through your options in sensible order.


Scheduling OS tasks, Evaluating job candidates, (Dating)

Heap Preview


Of course, like all API-style structures, can be implemented in many ways.


This is the natural place to use a heap.


A heap is a type of compactified almost search tree


Each node dominates its children so root is the winner

More next week

PQ Challenge


Implement a normal Queue (FIFO) using a Priority Queue


Implement a Stack (LIFO) using a Priority Queue

Hashing!


The three most important algorithms at Yahoo! are hashing, hashing, and hashing.
Udi Manber (while chief scientist at Yahoo!)

As of October 13, 2010, he is also responsible for all of the search products at Google

Key Ideas


Looking up an item in an array by the index is \( \mathcal{O}(1) \)


If I only need to work with keys in the set \( \{ 0, 1, \ldots, m \} \) then I can use a length \(m+1\) array.


In reality I need to work with many more keys than that.


Can I reliably map a large universe of keys to \( \{ 0, 1, \ldots, m \} \)?

Challenge Problem


If all keys are distinct and from the set \( \{ 0, 1, \ldots, m \} \), use a bit-vector to design insert, search, and delete each in \( \mathcal{O}(1)\) time. (Assume no satellite data.)


(Extra Credit: Implement some bit twiddling on your own or play with std::bitset)

Hash Functions


To win, we need to map an arbitrary string to an integer \( \leq m \) and \( \geq 0\), deterministically.


I can map any integer to \( 0, \ldots, m \) using \( \mod{m} \) (in code i % m )


I can map any character to an integer using char(c) (ord() in python)


A good hash function might be treating each character like the digit in its representation.

Hash Functions cont.


$$ H(S) = \sum_{i} \alpha^i \cdot \textrm{char}(S[i]) \mod{m} $$


Collisions


Pigeon-hole principle tells me that there must be some values which hash to the same thing.


We deal with this by making a linked list chain out from each value.


Average vs Worst


In the worst case, every single key maps to the same bucket!


In general use each bucket has \( \mathcal{O}(n/m) \) keys with \( n \) input values.


If you design a BAD hashing function then you will see non-uniform distributions.

Avg vs Worst cont.


Insert and Delete are \( \mathcal{O}(1) \) in the worst-case


Search takes \( \mathcal{O}( n /m ) \) on average (practically \( \mathcal{O}(1) \) but \( \mathcal{O}(n) \) in the worst case (overloaded bucket).


The other guys (Min, Successor, etc) take \( \mathcal{O}(n + m) \)

Canonical Data Structures


Array


Linked Lists


Stack


Queue


Priority Queue


Dictionary


Binary Search Tree (Balanced)


Hash Table

What do you choose?


If you were writing a dictionary data structure, what would you pick?

What did Python choose?


Go look at my code!

Domain Specific Structures


Bit Vectors (shifting is cheap)


Compression/Decompression (knowledge is power)


Reversible Hashing (Crypto vs obfuscation)


Sparse Representations (Polys, Mats, Integers)

Understanding Problem 1


You are using the API for a Balanced Binary Search Tree. It is limited to Min(), Successor(p), and Insert(x), each of which runs in \( \mathcal{O}(\log{n})\) time. Design a \( \mathcal{O}(n \log{n})\) sorting algorithm.

Understanding Problem 2


Let your hash function be \( h(k) = k \mod{9}\). Using our chained hash table, what is the result after inserting the sequence of keys: \(5, 28, 19, 15, 20, 33, 12, 17, 10 \)?

Challenge Problem 3


3-5 Find the overhead fraction (the ratio of data space over total space) for each of the following binary tree implementations on \( n \) nodes.


a) All nodes store data, two child pointers, and a parent pointer. The data field requires four bytes, each pointer requires four bytes.


b) Only leaf nodes store data; internal nodes store two child pointers. The data field requires four bytes and each pointer requires four bytes.