Notes on Huffman codes for data compression

----[ Thurs, 2014Sep18 ]----

Huffman codes are of interest to us as

  1. an example of binary tree usage, where the tree is deliberately unbalanced.
  2. an example of a greedy algorithm.
  3. an example of priority queue use.

There are 3 paragraphs on this subject in Skiena, pg 639.

Huffman coding is a much used method for lossless data compression.

The Lempel-Ziv method is even more widely used (in gzip, for example). There are interesting comparisons and contrasts to be made between Huffman and Lempel-Ziv.

The basic idea of Huffman coding is to use variable length bit strings to represent characters, with shorter bit strings accorded to the most frequently used characters.

For example, if a text contains (for simplicity) just the 6 letters a,b,c,d,e,f, with frequencies

a:45, b:13, c:12, d:16, e:9, f:5

then we want a very short code for 'a' and will accept a longer code for 'f'.

The frequencies add up to 100. If the text has 200 characters, 90 are a's, 24 are c's, etc.

Note that uniform 3 bit codes could be used. For a 200 character text that would use 600 bits to encode. Can we do better?

How about 'a' gets one bit, 0. Every other character gets a code starting with a 1...

Huffman codes have the prefix property: no character code is a prefix of another. (a)This implies that a binary search tree can be used for decoding.

(b)Encoding can be table lookup: table indexed by the characters containing variable length bit strings.

(c)BUILDING the most effective tree is Huffman's algorithm.

               __ (100) __
             0/           \1
             /             \
         (86)               (14)
       0/     \1            0/
       /       \            /
    (58)        (28)      (14)
   0/  \1      0/  \1    0/  \1
   /    \      /    \    /    \

             0/     \1
             /       \
        [a:45]      (55)
                  0/     \1
                  /       \
                (25)       (30)
               0/  \1     0/  \1
               /    \     /    \
            [c:12][b:13] (14) [d:16]
                        0/  \1
                        /    \

An optimum tree is necessarily a full tree (no internal node has only one child).

struct node { 
  node *left, *right; 
  char c; int f;  
  /* c is used only in the leaves.
   * In a leaf, f is the frequency with which character c occurs.
   */ In an interior node, f is the sum of the frequencies of all chars in the subtree of the node.

  node(c, f) {left = right = NULL, c = cc, f = ff}; // constructs a leaf

  merge(node *a, node *b) { // c.merge(a,b) makes c the parent of a and b with correct freq.
    left = a; right = b; 
    /* c = don't care, */ 
    f = a.f + b.f;

node *Huffman(vector<pair<char, int> > & V)[
  PriorityQueue<node*> Q(less_by_frequency); 
  for (int i = 0; i < V.size(); ++i) 
    Q.insert(new Node(V[i].first, V[i].second));
  while (not Q.isEmpty()){
    Node a, b, c;
    a = Q.extractMin();
    b = Q.extractMin();

The tree built this way is optimum among all prefix-property encodings of the text. Here optimum means that the number of bits used to represent the text is minimal. This minimum is called the entropy of the text.

In-class team problem
(a) what is an optimal Huffman code for the following frequencies?
a:1, b:1, c:2, d:3, e:5, f:8, g:13, h:21.

(b) Generalize your answer to explain an optimal Huffman code for the first n Fibonacci numbers.

Closest pair of points problem: Given a set of n points in the plane, pi = (xi,yi), find the minimal distance between any two points.

Brute force: try every pair --> O(n2) time.
Divide and conquer algorithm: --> O(n log(n)) time. [See Wikipedia, but note that it's algorithm details and analysis is flawed.]

Daily problem for next Tue:
Design an efficient algorithm to find the closest pair of n points in three dimensional space, pi = (xi, yi, zi).