Huffman codes are of interest to us as
There are 3 paragraphs on this subject in Skiena, pg 639.
Huffman coding is a much used method for lossless data compression.
The Lempel-Ziv method is even more widely used (in gzip, for example). There are interesting comparisons and contrasts to be made between Huffman and Lempel-Ziv.
The basic idea of Huffman coding is to use variable length bit strings to represent characters, with shorter bit strings accorded to the most frequently used characters.
For example, if a text contains (for simplicity) just the 6 letters a,b,c,d,e,f, with frequencies
a:45, b:13, c:12, d:16, e:9, f:5
then we want a very short code for 'a' and will accept a longer code for 'f'.
The frequencies add up to 100. If the text has 200 characters, 90 are a's, 24 are c's, etc.
Note that uniform 3 bit codes could be used. For a 200 character text that would use 600 bits to encode. Can we do better?
How about 'a' gets one bit, 0. Every other character gets a code starting with a 1...
Huffman codes have the
(b)Encoding can be table lookup: table indexed by the characters containing variable length bit strings.
(c)BUILDING the most effective tree is Huffman's algorithm.
__ (100) __
0/ \1
/ \
(86) (14)
0/ \1 0/
/ \ /
(58) (28) (14)
0/ \1 0/ \1 0/ \1
/ \ / \ / \
[a:45][b:13][c:12][d:16][e:9][f:5]
(100)
0/ \1
/ \
[a:45] (55)
0/ \1
/ \
(25) (30)
0/ \1 0/ \1
/ \ / \
[c:12][b:13] (14) [d:16]
0/ \1
/ \
[e:9][f:5]
An optimum tree is necessarily a full tree (no internal node has only one child).
struct node {
node *left, *right;
char c; int f;
/* c is used only in the leaves.
* In a leaf, f is the frequency with which character c occurs.
*/ In an interior node, f is the sum of the frequencies of all chars in the subtree of the node.
node(c, f) {left = right = NULL, c = cc, f = ff}; // constructs a leaf
merge(node *a, node *b) { // c.merge(a,b) makes c the parent of a and b with correct freq.
left = a; right = b;
/* c = don't care, */
f = a.f + b.f;
}
}
node *Huffman(vector<pair<char, int> > & V)[
PriorityQueue<node*> Q(less_by_frequency);
for (int i = 0; i < V.size(); ++i)
Q.insert(new Node(V[i].first, V[i].second));
while (not Q.isEmpty()){
Node a, b, c;
a = Q.extractMin();
b = Q.extractMin();
c.merge(a,b),
Q.insert(c);
}
The tree built this way is optimum among all prefix-property encodings of the text.
Here optimum means that the number of bits used to represent the text is minimal. This minimum is called the Closest pair of points problem: Given a set of n points in the plane, pi = (xi,yi), find the minimal distance between any two points.
Brute force: try every pair --> O(n2) time.
Divide and conquer algorithm: --> O(n log(n)) time. [See Wikipedia, but note that it's algorithm details and analysis is flawed.]
Daily problem for next Tue:
Design an efficient algorithm to find the closest pair of n points in three dimensional space,
pi = (xi, yi, zi).