Archive of published articles on April, 2009

Back home

Some initial tests of Tries

16/04/2009

I have been interested in writing a Trie implementation to see how its performance compares to using Data.Map for storing dictionary-like data. I wrote a minimal implementation of Tries which exports insert,insertWith, and lookup definitions and can be used in place of Data.Map for those functions. I tested the implementation using a program frequency.hs which reads a source file and uses insertWith on each word to count the frequency of the words in a file; we then use lookup to print out the frequencies of a list of arbitrary words.

The Implementation:

The Trie uses a Data.Map at each node to provide quick access to each branch; I performed a test with a simple unbalanced binary tree for storing branches, but the performance was slightly worse.

We also store the unused tail of a list-key in a special ValBucket constructor so that we don’t need to store a bunch of singleton Maps. I was curious if laziness would provide the same benefit automatically (as the remainder of the list should be stored as a thunk until another key with the same prefix comes along, forcing evaluation), but my version without the buckets was pretty significantly slower.

Here is the Trie type:


data Trie a v = Node    { branches :: M.Map a (Trie a v) }
              | ValNode { branches :: M.Map a (Trie a v),
                          val :: v }    
              | ValBucket { bucket :: [a], 
                            val :: v  }
              | Val { val :: v }
              | Empty
               deriving (Show)

Initial Testing Results

It is quite likely that there are obvious performance problems with my Trie implementation and I know that my test script doesn’t provide a very good look at the data structure (it doesn’t provide much of a benchmark for lookup, spends an inordinate amount of time preparing the text for insertion, etc.)

I was disappointed with the results which showed no improvement with Trie when compared with the same test using Data.Map. Here is a quick graph comparing execution times of frequency.hs using Map vs Trie as the size of the input text increases:

(note: tests compiled in GHC using -O1 optimization, and timed with Benchpress)

Based on the improvement I saw using the buckets for the string tails I want to see if extending the idea and turning the module into a radix trie will improve performance.

I’m hoping people will be interested in looking at my code and providing some feedback. You can use:

$ darcs get http://coder.bsimmons.name/code/Trie/

If you want to download just the Trie module you can get it here.

UPDATE:

Here are the results of a quick test in GHCi comparing the effects of using a Data.Map.Map vs. an Unbalanced Binary Tree vs. simple Lists to store the branches at each node:

test run-time (ms) for:
Input File Size (kB) | Data.Map Simple Tree List
96.7 618 1015 1998
596.5 10225 25255 71236

Data.Map, as expected, is the winner. It’s possible that with very small Tries that either lists or the simple trees would have slightly better performance.

Next on my agenda is to perform the following test: build up a Trie from a reduced english dictionary, then parse a random block of text, printing the definition of each word. I hope the Trie will fare better in this test.

UPDATE 2:
I just noticed one property of the Data.Map library that probably makes it less optimal than it could be for this application: the Map type is strict in all its constructors except the in the value of the key/value pair (which is where we are storing the recursive Trie paths from a node. I suppose this also means that -funbox-strict-fields has little effect on the structure.

In Trie.hs I replaced the M.insertWith function call (from Data.Map) with the strict insertWith' function and got some improvement in CPU usage.

UPDATE 3:

You can read some of my conclusions in this post.

10 Comments

Cycle Detection

12/04/2009

The Algorithm

We can use Brent’s Algorithm to detect infinitely-repeating sequences in a list of values generated by some iterated function: that is, any list in which the next value in the sequence is generated from the previous value alone; if we find duplicate values in the list, we know we have a cycle.

The implementation below isn’t particularly elegant, and since I want to use it as a stand-alone tool I’m having it output strings:


module Main 
    where

cycling :: (Show a, Eq a) => Int -> [a] -> String
cycling k []     = "Empty list"
cycling k (a:as) = find 0 a 1 2 as
    where find _ _ c _  [] = "reached end at " ++ show c ++": no cycles"
          find i x c p (x':xs) 
            | c >  k    =  "no cycles after " ++ show k
            | x == x'   =  "cycle at "++ show c ++": "++(show$take (c-i) xs)
            | c == p    = find c x' (c+1) (p*2) xs 
            | otherwise = find i x  (c+1) p xs

  --- SOME RANDOM NUMBER GENERATORS TO TEST ---
-- bad generators?:
g1 = 0 :      [ (g*7 + 1)   `mod`   32       | g <- g1]
g2 = 17 :     [ (g*22 + 221)   `mod`  2^32   | g <- g2]
g3 = 3249 :   [ (g*22695477 + 1)  `mod` 300  | g <- g3]
g4 = 234587 : [ (g*22695476 + 1`mod` 2^32  | g <- g4]

-- good generators:
g5 = 294587 : [ (g*22695477 + 1`mod` 2^32  | g <- g5]
g6 = 0 :      [ (g*22695477 + 1`mod` 2^32  | g <- g6]

Randomness is hard…

Linear Congruential Generator’s are notoriously easy to screw up with the wrong parameters. Let’s use our function to test just how finicky this RNG algorithm can be. random stream g4 is almost identical to g5, a known-good generator. Let’s compare the two with a cut off limit of 999,999. First the good:

*Main> cycling 999999 g5
"no cycles after 999999"

Great, now let’s see how our typo-ed generator fares:

*Main> cycling 999999 g4
"cycle at 17: [180790021]"

… yikes.

No Comments

Parallel List Comprehensions with a Monte Carlo example

8/04/2009

GHC has an extension to the list comprehensions syntax that replicates the functionality of zipWith, by allowing you to have two generators running in parallel. Turn on the extension in GHCi like so:

Prelude> :set -XParallelListComp

I use the extension below to generate an infinite list of random coordinates (scaled to a  1000×1000 grid) using two different  Linear Congruential Generators running in parallel. It should be simple to modify the code below to actually  run the generators concurrently using Data Parallel Haskell (although I haven’t had a chance to play with that yet).


randPoints :: (Integral i) => i -> [(i,i)]
randPoints s = drop 1 [ (scale x, scale y) | x <- g1 | y <- g2 ]
    where -- two Linear Congruential Generators (from Borland C, glibc):
          g1 = s : [ (g*22695477 + 1)   `mod`   2^32   | g <- g1]
          g2 = s : [ (g*1103515245 + 12345`mod` 2^32 | g <- g2]
          -- adjust the random output to fit a 1000x1000 plane:
          scale v = (v `mod` 1000- 500

Then we can use another list comprehension, this time with boolean guards, to generate random points that lie within an outer circle (and outside an inner circle):


monteDonut = [p | p@(x,y) <- randPoints 42let b= x^2+y^2, b< 500^2, b> 10000]

Finally from GHCi we can use the Gnuplot library to easily plot the points we generated:

*Main> :m + Graphics.Gnuplot.Simple
*Main> plotDots [Aspect(Ratio 1), XTicks Nothing, YTicks Nothing] (take 10000 monteDonut)

…and we get a nice Monte Carlo donut with sprinkles!:

random dot plot defining a circle
2 Comments