Huffman Coding

31/05/2009

This is the second of probably three posts (the first was on run-length encoding in Haskell) inspired by Thomas Guest’s interesting article on the Deflate algorithm. This is also my first post in literate haskell. Please post any improvements to the code if you have them!

For a refreshingly readable introduction to Huffman Coding and the Deflate algorithm, have a look at this short explanation by Antaeus Feldspar.


First some boilerplate...

> module Huffman
>     where
>
> import Data.List (sort, insert)
> import qualified Data.Map as M
> import Data.Function (on)
> import Control.Arrow (second)

We make an abstract datatype for binary digits. I wonder if this would
be a nice (but slow?) way of working with binary IO. There doesn't seem
to be a package on Hackage for this

> data Bit = O | I
>            deriving (ShowOrdEq)

we define a simple binary tree useful for decoding the encoded binary
stream. the simple algorithm for assigning codes to our symbols
produces this tree. A completed tree for some text containing only
three letter might look like this:

     0    /\    1
         /\ A  
        C  B

> data HTree a = Branch {zer :: (HTree a), one :: (HTree a),
>                        wt  :: Int }
>              | Leaf {symb :: a,  wt :: Int }
>               deriving (Show)

maps from <Symbol> to <Binary Code>, we create this from the HTree
built from the list of weighted symbols. the HTree can't be used for
encoding.

> type CodeDict a = M.Map a [Bit]

Now some instance declarations so that we know how to order (by weight):

> instance Ord (HTree a) where
>     compare = compare `on` wt
>
> instance Eq (HTree a) where
>     (=== (==`on` wt


our function for merging two branches which we use to build the HTree
from the list of symbols and their corresponding weights. no point in
defining a Monoid instance, but we'll tip our hat to it:

> mappend t1 t2 = Branch t1 t2 (wt t1 + wt t2)


BULDING THE CODING TREES


    
We assign codes to our weighted symbols using a simple algorithm
which takes the trees (initially Leaves) with the lowest weights
and combines them (and their weights) and inserts them back into
the list until they have been combined into a single tree:

> buildDecTree :: [(a,Int)] -> HTree a
> buildDecTree = build . sort . map (uncurry Leaf)
>     where build (t:[])     = t
>           build (t1:t2:ts) = build $ insert (t1 `mappend` t2) ts


now convert the binary tree representation to a dictionary form for
encoding. we decompose the tree into a list from the top down, mapping
either a 1 or 0 over the flattened children. a little confusing:

> buildEncDict :: (Ord a) => HTree a -> CodeDict a
> buildEncDict = M.fromList . build 
>     where build (Leaf t _)     = (t,[]) : []
>           build (Branch a b _) = mapBit O a ++ mapBit I b
>            -- build up the codes in the snd of each Leaf's tuple:
>           mapBit b = map (second (b:)) . build


(EN/DE)CODING FUNCTIONS



To encode a list of symbols for which we've generated an HTree, we just
map over it, looking up it's code in our Map dictionary:

> encode :: (Ord a) => CodeDict a -> [a] -> [Bit]
> encode d = concatMap (d M.!)


To decode we simply read a Bit at a time from the input, at the same time
traversing the HTree (going left when we encounter a zero, and vice versa).
When we hit a Leaf (the end of a code) we return the symbol and go onto
the next bit from the top of the HTree once again.

> decode :: HTree a -> [Bit] -> [a]
> decode t [] = []
> decode t bs = dec bs t
>     where dec bs' (Leaf x _) = x : decode t bs'
>           dec (O:bs') (Branch l _ _) = dec bs' l 
>           dec (I:bs') (Branch _ r _) = dec bs' r 



UPDATE: I scrapped and re-did this section. Should be a little better now.:


USAGE EXAMPLES



I realize I need a more compelling example and some explanation. First,
imagine we want to encode into binary the following phrase:

> phrase = "twenty bytes of text"

we could represent it in ASCII but that would be wasteful of space, using
a whole byte per character when we are using only ten of the 256 possible
codes in the ASCII character set.

So we generate a list of the characters to encode along with their
frequencies (symbols with higher frequencies will be given shorter
prefix codes, saving space!):

> huffmanTree' = buildDecTree [('t',5),('e',3),('y',2),('w',1),('n',1),
>                              ('b',1),('s',1),('o',1),('f',1),('x',1),
>                              (' ',3)]

...and encode it using the dictionary built from the code tree we just
built:

> encodedPhrase = let dict = buildEncDict huffmanTree'
>                  in encode dict phrase

This yields the following binary stream of 8 bits (vs. 20 if we had used
ASCII encoding):

[I,O,O,O,I,O,I,I, O,O,O,I,I,I,O,O, I,O,I,I,I,O,O,O, O,O,I,O,I,O,I,I,
 O,O,O,O,I,I,I,I, O,I,I,I,O,O,I,I, I,I,I,I,I,I,O,I, I,O,O,I,I,O,I,O]

Of course we have to encode instructions to build our tree along with the
above message.

There are 7 comments in this article:

  1. 31/05/2009Ralph Hodgson says:

    Thanks for posting this. Thought you might like to see another example:

    > myString = “the quick brown fox jumps over the lazy dog’s back”

    Assigning the same number to every letter (todo – I need to check if this is the right)

    > myHuffmanTree = buildDecTree [('a',1),('b',1),('c',1),
    > ('d',1),('e',1),('f',1),
    > ('g',1),('h',1),('i',1),
    > ('j',1),('k',1),('l',1),
    > ('m',1),('n',1),('o',1),
    > ('p',1),('q',1),('r',1),
    > ('s',1),('t',1),('u',1),
    > ('v',1),('w',1),('x',1),
    > ('y',1),('z',1),('\'',1),
    > (' ',1)]
    >
    > myEncodedString = let dict = buildEncDict myHuffmanTree
    > in encode dict myString

    … ‘myString’ compresses to 238 bits (30 bytes). myString is 50 bytes:

    [O,I,O,O,I,I,I,I,O,I,I,I,I,I,O,O,O,O,I,O,I,O,I,O,O,I,I,I,O,I,O,O,I,O,I,I,O,O,O,I,O,O,O,O,O,O,O,I,I,I,O,I,I,O,I,O,I,I,I,O,I,O,O,O,I,I,O,O,I,O,I,I,I,O,O,O,I,I,I,I,I,I,I,O,I,O,O,O,I,I,O,I,O,O,O,I,I,O,O,I,I,O,I,I,I,O,I,O,I,I,O,I,O,I,O,I,O,I,O,O,O,O,O,O,I,I,O,I,O,O,O,I,I,I,I,I,I,I,I,O,O,I,O,I,I,O,O,O,I,O,I,O,O,I,I,I,I,O,I,I,I,I,I,O,O,O,O,I,I,O,O,O,I,I,I,O,I,O,O,O,I,I,O,O,I,O,O,O,O,I,I,I,O,O,I,I,O,I,O,O,I,I,I,O,O,O,O,O,O,O,I,O,O,O,O,O,O,I,I,I,O,I,I,I,I,O,I,O,I,I,O,O,O,I,O,O,O,O]

    …and it correctly decompresses:

    > myDecodedString = decode myHuffmanTree myEncodedString

  2. 31/05/2009jberryman says:

    @Ralph Hodgson:

    Thanks for commenting. I see I should have provided a more compelling example, but got lazy towards the end. Will mend that.

    In the one you provided, you could assign a weight of two to ‘a’, ’s’, and ‘c’, because they appear more frequently (I think you just got the “quick brown fox” phrase a little wrong though)

  3. 7/06/2009Dan Rosén says:

    Frequency distribution:

    freq = map (head &&& length) . group . sort

    > freq “the quick brown fox jumps over the lazy dog’s back”
    [(' ',9),('\'',1),('a',2),('b',2),('c',2),('d',1),('e',3),('f',1),('g',1),('h',2),('i',1),('j',1),('k',2),('l',1),('m',1),('n',1),('o',4),('p',1),('q',1),('r',2),('s',2),('t',2),('u',2),('v',1),('w',1),('x',1),('y',1),('z',1)]

    Nice tutorial :)

  4. 7/06/2009Bart Massey says:

    See http://wiki.cs.pdx.edu/bartforge/rle and http://wiki.cs.pdx.edu/bartforge/huffman for my sample implementations of these functions. Your literate Haskell is much cleaner.

    My frequency counter maintains a map from symbol to count. I haven’t benchmarked it, but I suspect that it’s faster than group.sort .

    Thanks for an interesting post!

  5. 11/06/2009jberryman says:

    @Bart Massey:

    Thanks, I’m looking forward to checking out your code!

  6. 18/06/2009Lazy Arrays and the LZ77 Algorithm in Haskell | LAMBDAPHONE says:

    [...] related to the DEFLATE algorithm: the first on run-length encoding, and the second on simple Huffman Coding. This post models the LZ77 algorithm, the second of the two compression strategies used by DEFLATE, [...]

  7. 7/09/2009Boss Resurfacing says:

    Off topic – Help with PM?
    lost password
    Boss Resurfacing
    Boss Resurfacing

Write a comment: