Fun with Lazy Arrays: the LZ77 Algorithm

18/06/2009

This is my third post investigating compression techniques related to the DEFLATE algorithm: the first on run-length encoding, and the second on simple Huffman Coding. This post models the LZ77 algorithm, the second of the two compression strategies used by DEFLATE, and in the process explores some interesting properties of Haskell’s basic Arrays.

IMPLEMENTATION:


> module LZ77
>     where

we will use GHC's "basic non-strict arrays" for this
experiment:

> import Data.Array


and use Ints to store the length of the entire decoded
message (needed to create our array):

> type Length = Int
> type Offset = Int

in place of the length in the standard length-offset pair
I've decided to use an Int representing the index of the
last element in the span relative to the element at
offset. Thus the pair encoding a span of only one element,
two elements back would be (0,2), rather than the more
traditional (1,2).

This simply makes more sense to me, especially since I want
to be able to encoded reversed sequences, as you will see.

> type Index = Int

and a few simple dataypes for our uncompressed and
compressed data respectively:

> type Decoded a = Array Index a
>
> data Encoded a = 
>     Enc Length [ Either a (Index,Offset) ] 


The decompress function works by traversing the encoded message,
keeping track of our array index position (since offsets are
relative to the current position), and building an Array lazily
from a list which we generate, in part by referencing elements
from the partially generated array itself.

So when we see a Right value we look up in the Array the elements
referenced by the length-offset, concat-ing that list with the
result of processing the rest of the encoded message.

If we hit [] in 'dec' we call an error because the stored value
for the length of the uncompressed message in the Encoded type
was longer than what the 'decompress' function could produce.

It's diffficult to describe, but I hope the code is clear:

> decompress :: Encoded a -> Decoded a
> decompress (Enc el es) = decoded
>     where decoded = listArray (0,el - 1$ dec 0 es
>          
>           dec  _     [] = 
>               error "message is shorter than it should be" 
>
>           dec n (Left x : xs) = 
>               x : dec (n+1) xs
>          
>           dec n (Right (iRel,off) : xs) =
>               let i1 = n  - off
>                   i2 = (if iN > i1 then succ else pred) i1 
>                   iN = i1 + iRel
>                
>                in [ decoded!| i <- [i1, i2 .. iN] ] ++ 
>                   dec (n + 1 + abs iRel) xs


Some interesting things about the code above:

    1) we create an array from a list, which we build,
       in part, by looking up elements from the array
       we are in the process of building.

    2) we can compress a sequence of symbols which were
       seen previously but in reverse order, simply by
       storing a negative relative index in the
       (relative_index,offset) tuple. So, the string...
        
            "her racecar returns to race"
      
       might compress to:
            
            {her race(-5,2)turns to(4,19)}

       I'm not sure if this is useful in real compression,
       especially when it comes down to the binary encoding.

    3) even more interesting, we can use this same decoder
       function to decompress data that matches a sequence
       in parts of the array we haven't built yet! We simply
       use a negative offset in our tuple.

EXAMPLES AND CONCLUSION:


point (3) may or may not be something like the LZ78 algorithm,
which apparently works by encoding future data, but it is
defintely a cool thing to be able to do with arrays:

> coolArray = listArray (0,4
>        [0,  coolArray!4 - 3,  2,  coolArray!1 + 2,  4]


Here's an example with great compression where the relative
index exceeds the offset:

> exceedsOffset = elems $ decompress $
>       Enc 25 [Left 'B'Left 'l'Left 'a'Left 'h'Left ' ',
>               Left 'b'Right (17,5), Left '!']


...and a code example for (2) above:

> reverseReference = elems $ decompress $
>       Enc 27 [Left 'h',Left 'e',Left 'r',Left ' ',Left 'r',
>               Left 'a',Left 'c',Left 'e',Right(-5,2),Left 't',
>               Left 'u',Left 'r',Left 'n',Left 's',Left ' ',
>               Left 't',Left 'o',Right(4,19)]

...and finally an example combining (2) and (3):

> reverseLookAhead = elems $ decompress $ 
>       Enc 5 [Left 1Right (-1,-3), Left 3Left 4]


I was surprised to discover these properties in Haskell's lazy
Arrays. hope they came as a surprise to a few others.

2 Comments

Run-length Encoding

26/05/2009

Just a quick implementation of a RLE algorithm for lists in haskell. We compress a list by converting “runs” of consecutive elements into a tuple of the form: (run_length, element).


import Data.List (group)
import Control.Arrow


runLengthEnc :: Eq a => [a] -> [(Int,a)]
runLengthEnc = map (length &&& head. group

decode :: [(Int, a)] -> [a]
decode = concatMap (uncurry replicate)

If the &&& combinator looks foreign to you, check out David R. Maciver’s very enlightening blog about Arrow functions.

I’m always curious to see how naive-looking functions like the above compare in performance to a from-scratch implementation with explicit recursion, so I came up with the following:


runLengthEnc' :: Eq a => [a] -> [(Int,a)]
runLengthEnc' [] = []
runLengthEnc' (a:as) = run 1 a as
    where run n x [] = [(n,x)]
          run n x (x2:xss) | x == x2   = run (n+1) x xss
                           | otherwise = (n,x) : run 1 x2 xss

I tested both functions on a list of 100,000 random 1s and 0s and found the explicit version to be only marginally better performing, completing the list in 49 ticks & 130Mb, compared to 54 ticks & 139 Mb for the one-liner!

2 Comments

Gnome Sort

11/05/2009

In my boredom, I implemented the toy sorting algorithm called Gnome Sort. It accomplishes backtracking by using a zipper-like system of two lists treated as stacks, and is about 100X slower than GHC’s merge sort on the test I used. It’s also not especially small or simple-looking. It’s basically terrible:


gnomeSort :: (Ord a)=> [a] -> [a]
gnomeSort = sort [] 
      where sort rs     []     = rs 
            sort []     (x:xs) = sort [x] xs
            sort (r:rs) (x:xs) 
                   | x > r     = sort rs (x:r:xs)          
                   | otherwise = sort (x:r:rs) xs

The above doesn’t keep track of where it left off before it began to backtrack, an obvious optimization which turned the algorithm into an insertion sort:


import Data.List (insert)

insertSort :: (Ord a)=> [a] -> [a]
insertSort = sort [] 
       where sort rs []     = rs 
             sort rs (x:xs) = sort (insert x rs) xs

about 60X slower than Data.List.sort
.

No Comments