Lecture 22: Data Compression & Huffman Codes

Announcements

Coming Soon:

Gradescope link for Assignment 08
Assignment 09 Description (Optional)
Assignment 10 Description

Overview

Shakespeare, Revisisted
Prefix Codes
Huffman Codes

Last Time

Considered shakespeare.txt

basic stats:
- file size: 5.8 MB
- 170,592 lines
- 961,443 words
- 5,756,698 characters
ASCII encoding
- 8 bits ber char
- 256 possible char values

Question. Could we represent shakespeare.txt using a smaller file?

Observation 1

Consider the number of distinct characters used by Shakespeare!

Data Compression Attempt 1

ASCII encoding allows for 256 possible characters
- 8 bits per character
Shakespeare only uses 114 distinct characters

Can we use this fact to compress shakespeare.txt into a smaller file?

Idea

Re-encode characters actually used by Shakespeare

table of (ASCII) characters used and new encoding

store actual characters with new encoding

Question 1.1

How much space (number of bits) does the new encoding require?

Question 1.2

How would we decode the newly encoded file?

Digging Deeper

Consider character frequencies of Shakespeare

Observation

8 distinct characters account for more than half of the characters of shakespeare.txt!

char  ASCII count
' '   32  1055175
'e'   101  445988
't'   116  315647
'o'   111  305115
'a'   97   265561
'h'   104  238932
's'   115  236293
'n'   110  235774
-----------------
total     3098485 (= 54% of chars)

Question 2

How could we exploit frequency counts to further compress shakespeare.txt?

Data Compression Attempt 2

Have two tables: one for frequent characters, another for infrequent characters

only 8 frequent chars, so need only 3 bits to encode each
still < 128 infrequent chars, so use 7 bits to encode each

Question 2.1

If we have 2 char tables, how can we decode a string?

Frequent: 
' ' -> 000
'e' -> 001
...

Infrequent:
'r'  -> 0000000
'i'  -> 0000001
`\n' -> 0000010
...

How to decode 000000101001100...?

An Issue!

When decoding, how do we distinguish between frequent and infrequent character from the encoded character?

A Solution?

Use an extra bit to indicate if the following (encoded) character is frequent (3 bits) or infrequent (7 bits).

Frequent: 
' ' -> 0000
'e' -> 0001
...

Infrequent:
'r'  -> 10000000
'i'  -> 10000001
`\n' -> 10000010
...

Now frequent characters use 4 bits (always starting with 1), infrequent use 8 bits (always starting with 0)

Example

Decode the string 010010000000100000010100

' ' -> 0000
'e' -> 0001
't' -> 0010
'o' -> 0011
'a' -> 0100

'r' -> 10000000
'i' -> 10000001

Structure of Encoded Document

Start scanning bits from the first bit:

if first bit is 0, first four bits encode a frequent character
if first bit is 1, first eight bits encode an infrequent character
once a character is decoded, the next bit tells us if following character is frequent or infrequent

Picture:

Question

If we use frequent/infrequent character encoding, what is the resulting size of shakespeare.txt?

5.8 M characters
1/2 are frequent -> 4 bits each
< 1/2 are infrequent -> 8 bits each

Strategies So Far

ASCII: every possible character gets 8 bits
- 5.8 M characters => 5.8 MB
Re-encode only used characters
- all (used) characters encoded w/ 7 bits
- resulting encoding uses 5.8 M * 7 / 8 = 5.1 MB
- compression ratio $7 / 8 = 87.5\%$
Re-encode frequent and infrequent characters separately
- 8 frequent chars account for more than 1/2 of chars
- encode remaining chars with 8 bits
- Size: ~ 5.8 M (0.5 * 4 + 0.5 * 8) / 8 = 4.3 MB
- compression ratio $3/4 = 75\%$

Can We Do Beter?

Why limit ourselves to just two types of characters (frequent/infrequent)?

General Situation:

every character gets assigned a codeword
- codeword is a binary string (some number of 0s and 1s)
lengths of codewords may be different

Example:

' ' -> 00
'e' -> 010
't' -> 011
'o' -> 101
...
'X' -> 10011011

Two Questions

What properties of codewords are required to enable us to decode an encoded text?
What properties of codewords are desired to enable us to compress the original text?

Question 1

What properties of codewords are required to enable us to decode an encoded text?

Prefix Codes

Unique decodability:

When reading individual bits, must know when I’ve reached the end of a character
Cannot have: one codeword is 1001 and another codeword starts 1001...
We say 1001 is a prefix of 1001011

Definition. A set of codewords is a prefix code if no codeword is a prefix of any other.

Examples.

ASCII
$7$-bit Shakespeare encoding
$4,8$-bit Shakespeare encoding

Prefix Codes and Trees

Any prefix code can be represented as a binary tree!

Start at root:

label children of each node 0 and 1
label leaves with characters
codeword associated with a leaf is sequence of 0s and 1s along the path from root to the leaf

Example

Construct the binary tree for

'a' -> 00
'b' -> 01
'c' -> 101
'd' -> 111
'e' -> 1101
'f' -> 1100

Example

Use previous tree to decode 1100001011101111

Question 2

What properties of codewords are desired to enable us to compress the original text?

Huffman Coding

Idea. Start with all characters together with their frequency counts

each distinct character corresponds to a leaf in encoding tree
each node gets a weight
- weight of a leaf = frequency count
- wight of internal node = sum of frequencies of descendants

Then: form a tree by “merging” nodes by adding a parent

pick two lightest nodes w/ out parents: $u$, $w$
create a parent $v$ for $u$ and $w$
continue until all nodes are connected

Huffman, Illustrated

Build Huffman tree for text ABAAABBAACCBAAADEA

Huffman More Formally

Node stores:

a char c (0 if internal Node)
an int weight
left (0) and right (1) child (both null if leaf)

Huffman Procedure

Create a Node for each distinct character in text, weight is character frequency
Add Nodes to a collection c
While c.size() > 1:
- remove 2 nodes from c with smallest weights: u, w
- create new node v
  - v’s children are u and w
  - v.weight = u.weight + w.weight
- add v to c
Set tree root to unique Node in c

Question

Given a Huffman tree, how do we compute the resulting file size?

Remarkable Fact

Theorem. Among all possible prefix codes for a given text, Huffman codes give the smallest possible encoded text.

Homework 10

Implement Huffman coding

Think about ADTs and data structures you’ll need
- don’t need to implement new containers from scratch
Measure the size & compression ratio of encoding for different texts