### Notes on Huffman codes for data compression

#### ----[ Thurs, 2014Sep18 ]----

Huffman codes are of interest to us as

1. an example of binary tree usage, where the tree is deliberately unbalanced.
2. an example of a greedy algorithm.
3. an example of priority queue use.

There are 3 paragraphs on this subject in Skiena, pg 639.

Huffman coding is a much used method for lossless data compression.

The Lempel-Ziv method is even more widely used (in gzip, for example). There are interesting comparisons and contrasts to be made between Huffman and Lempel-Ziv.

The basic idea of Huffman coding is to use variable length bit strings to represent characters, with shorter bit strings accorded to the most frequently used characters.

For example, if a text contains (for simplicity) just the 6 letters a,b,c,d,e,f, with frequencies

a:45, b:13, c:12, d:16, e:9, f:5

then we want a very short code for 'a' and will accept a longer code for 'f'.

The frequencies add up to 100. If the text has 200 characters, 90 are a's, 24 are c's, etc.

Note that uniform 3 bit codes could be used. For a 200 character text that would use 600 bits to encode. Can we do better?

How about 'a' gets one bit, 0. Every other character gets a code starting with a 1...

Huffman codes have the prefix property: no character code is a prefix of another. (a)This implies that a binary search tree can be used for decoding.

(b)Encoding can be table lookup: table indexed by the characters containing variable length bit strings.

(c)BUILDING the most effective tree is Huffman's algorithm.

```               __ (100) __
0/           \1
/             \
(86)               (14)
0/     \1            0/
/       \            /
(58)        (28)      (14)
0/  \1      0/  \1    0/  \1
/    \      /    \    /    \
[a:45][b:13][c:12][d:16][e:9][f:5]
```

```               (100)
0/     \1
/       \
[a:45]      (55)
0/     \1
/       \
(25)       (30)
0/  \1     0/  \1
/    \     /    \
[c:12][b:13] (14) [d:16]
0/  \1
/    \
[e:9][f:5]
```

An optimum tree is necessarily a full tree (no internal node has only one child).

```
struct node {
node *left, *right;
char c; int f;
/* c is used only in the leaves.
* In a leaf, f is the frequency with which character c occurs.
*/ In an interior node, f is the sum of the frequencies of all chars in the subtree of the node.

node(c, f) {left = right = NULL, c = cc, f = ff}; // constructs a leaf

merge(node *a, node *b) { // c.merge(a,b) makes c the parent of a and b with correct freq.
left = a; right = b;
/* c = don't care, */
f = a.f + b.f;
}
}

node *Huffman(vector<pair<char, int> > & V)[
PriorityQueue<node*> Q(less_by_frequency);
for (int i = 0; i < V.size(); ++i)
Q.insert(new Node(V[i].first, V[i].second));
while (not Q.isEmpty()){
Node a, b, c;
a = Q.extractMin();
b = Q.extractMin();
c.merge(a,b),
Q.insert(c);

}
```
The tree built this way is optimum among all prefix-property encodings of the text. Here optimum means that the number of bits used to represent the text is minimal. This minimum is called the entropy of the text.

In-class team problem
(a) what is an optimal Huffman code for the following frequencies?
a:1, b:1, c:2, d:3, e:5, f:8, g:13, h:21.

(b) Generalize your answer to explain an optimal Huffman code for the first n Fibonacci numbers.

Closest pair of points problem: Given a set of n points in the plane, pi = (xi,yi), find the minimal distance between any two points.

Brute force: try every pair --> O(n2) time.
Divide and conquer algorithm: --> O(n log(n)) time. [See Wikipedia, but note that it's algorithm details and analysis is flawed.]

Daily problem for next Tue:
Design an efficient algorithm to find the closest pair of n points in three dimensional space, pi = (xi, yi, zi).