Polishing a Functional Pearl: The Burrows-Wheeler Transform

7/11/2009

Here is a quick post to get me back into the swing of blogging:

I was looking through an old post on StackOverflow about clever functional code, and the best answer, given by “yairchu” was a nice version of the Burrows-Wheeler Transform, which is an algorithm for permuting a string such that it can be compressed more effectively by other algorithms. The code posted was (import Data.List assumed):


bwp :: (Ord a)=> [a] -> [a]
bwp xs = map snd $ sort $ zip (rots xs) (rrot xs)
rots xs = take (length xs) (iterate lrot xs)
lrot xs = tail xs ++ [head xs]
rrot xs = last xs : init xs

I saw I could improve/shorten this in a couple of obvious ways and came up with this:


bwp :: (Ord a) => [a] -> [a]
bwp = map snd . sort . rots 
rots xs = zip (tail $ iterate lrot xs) xs
lrot (x:xs) = xs ++ [x]

Still unsatisfied and even more obsessed I came up with this final, prettiest version, before forcing myself to give it up already:


bwp :: (Ord a)=> [a] -> [a]
bwp = map snd . sort . rots 
rots xs =  zip (lrot xs) xs
lrot = tail . tails . cycle

Unfortunately, this last version will croak if your string happens to look like “111111″ or “cAbcAb” because sort will keep trying to compare infinites lists.

Update: I did a short post on the Move To Front transform as a follow-up to this post.

1 Comment

Enumerating all n-integer Sets Which Sum to an Integer x

30/07/2009

I came up with what I think is an elegant use of monadic code, while working on a program to compute solutions to the Postage Stamp Problem. The problem asks:

Consider that an envelope is able to hold a limited number of postage stamps. Then, consider the set of the values of the stamps — positive integers only. Then, what is the smallest total postage which *cannot* be put onto the envelope?

The algorithm below takes a number of stamps that can fit and a total postage to apply and returns a list of all the sets of different stamp values that can be combined to form the target postage sum:


> sumLists :: Int -> Int -> [[Int]]
> sumLists = f 0 where
>     f _ 1 n = return [n]  
>     f i c n = do a  <- [i.. div n c] 
>                  as <- f (i+a) (c-1) (n-a)      
>                  return (a : as) 

Thus the potential stamp combinations for a letter which can hold three stamps and costing 5 cents to mail would be:

*Main> sumLists 3 5
[[0,0,5],[0,1,4],[0,2,3],[1,1,3],[1,2,2]]

As you can see, the algorithm also produces ordered lists naturally!

But perhaps all we care about are unique stamp values that. If we want to adjust our goal such that the algorithm returns a list of all sets of unique stamp values sufficient to create the target postage, we need only adjust the return line:


>                  return (a : dropWhile (==a) as) 

Similarly we could drop the zeros (which mean “no stamp” in our case) from the beginnings of each set if we desired. The above code have applications to many similar problems including Subset Sum, and Knapsack as well as other partitioning problems.

No Comments

Run-length Encoding

26/05/2009

Just a quick implementation of a RLE algorithm for lists in haskell. We compress a list by converting “runs” of consecutive elements into a tuple of the form: (run_length, element).


import Data.List (group)
import Control.Arrow


runLengthEnc :: Eq a => [a] -> [(Int,a)]
runLengthEnc = map (length &&& head. group

decode :: [(Int, a)] -> [a]
decode = concatMap (uncurry replicate)

If the &&& combinator looks foreign to you, check out David R. Maciver’s very enlightening blog about Arrow functions.

I’m always curious to see how naive-looking functions like the above compare in performance to a from-scratch implementation with explicit recursion, so I came up with the following:


runLengthEnc' :: Eq a => [a] -> [(Int,a)]
runLengthEnc' [] = []
runLengthEnc' (a:as) = run 1 a as
    where run n x [] = [(n,x)]
          run n x (x2:xss) | x == x2   = run (n+1) x xss
                           | otherwise = (n,x) : run 1 x2 xss

I tested both functions on a list of 100,000 random 1s and 0s and found the explicit version to be only marginally better performing, completing the list in 49 ticks & 130Mb, compared to 54 ticks & 139 Mb for the one-liner!

2 Comments