Huffman codes are of interest to us as

- an example of binary tree usage, where the tree is deliberately unbalanced.
- an example of a greedy algorithm.
- an example of priority queue use.

There are 3 paragraphs on this subject in Skiena, pg 639.

Huffman coding is a much used method for lossless data compression.

The Lempel-Ziv method is even more widely used (in gzip, for example). There are interesting comparisons and contrasts to be made between Huffman and Lempel-Ziv.

The basic idea of Huffman coding is to use variable length bit strings to represent characters, with shorter bit strings accorded to the most frequently used characters.

For example, if a text contains (for simplicity) just the 6 letters a,b,c,d,e,f, with frequencies

a:45, b:13, c:12, d:16, e:9, f:5

then we want a very short code for 'a' and will accept a longer code for 'f'.

The frequencies add up to 100. If the text has 200 characters, 90 are a's, 24 are c's, etc.

Note that uniform 3 bit codes could be used. For a 200 character text that would use 600 bits to encode. Can we do better?

How about 'a' gets one bit, 0. Every other character gets a code starting with a 1...

Huffman codes have the

(b)Encoding can be table lookup: table indexed by the characters containing variable length bit strings.

(c)**BUILDING** the most effective tree is Huffman's algorithm.

__ (100) __ 0/ \1 / \ (86) (14) 0/ \1 0/ / \ / (58) (28) (14) 0/ \1 0/ \1 0/ \1 / \ / \ / \ [a:45][b:13][c:12][d:16][e:9][f:5]

(100) 0/ \1 / \ [a:45] (55) 0/ \1 / \ (25) (30) 0/ \1 0/ \1 / \ / \ [c:12][b:13] (14) [d:16] 0/ \1 / \ [e:9][f:5]

An optimum tree is necessarily a full tree (no internal node has only one child).

struct node { node *left, *right; char c; int f; /* c is used only in the leaves. * In a leaf, f is the frequency with which character c occurs. */ In an interior node, f is the sum of the frequencies of all chars in the subtree of the node. node(c, f) {left = right = NULL, c = cc, f = ff}; // constructs a leaf merge(node *a, node *b) { // c.merge(a,b) makes c the parent of a and b with correct freq. left = a; right = b; /* c = don't care, */ f = a.f + b.f; } } node *Huffman(vector<pair<char, int> > & V)[ PriorityQueue<node*> Q(less_by_frequency); for (int i = 0; i < V.size(); ++i) Q.insert(new Node(V[i].first, V[i].second)); while (not Q.isEmpty()){ Node a, b, c; a = Q.extractMin(); b = Q.extractMin(); c.merge(a,b), Q.insert(c); }The tree built this way is optimum among all prefix-property encodings of the text. Here optimum means that the number of bits used to represent the text is minimal. This minimum is called the

In-class team problem

(a) what is an optimal Huffman code for the following frequencies?

a:1, b:1, c:2, d:3, e:5, f:8, g:13, h:21.

(b) Generalize your answer to explain an optimal Huffman code for the first n Fibonacci numbers.

Closest pair of points problem: Given a set of n points in the plane, p_{i} = (x_{i},y_{i}), find the minimal distance between any two points.

Brute force: try every pair --> O(n^{2}) time.

Divide and conquer algorithm: --> O(n log(n)) time. [See Wikipedia, but note that it's algorithm details and analysis is flawed.]

Daily problem for next Tue:

Design an efficient algorithm to find the closest pair of n points in three dimensional space,
p_{i} = (x_{i}, y_{i}, z_{i}).