Archive of published articles on April, 2009

Back home

a DirectoryTree module and some examples

30/04/2009

UPDATE: I’ve just released this as my first package on hackage. You can read more and write any comments here.

I just put together a library that provides a simple tree data structure to represent the structure of files and directories in the OS. It provides some simple IO functions like readDirectory (analogous to readFile) which “opens” a directory, returning a DirTree of Strings in the IO monad.

The nice thing is that by defining a simple instance for Traversable (and the default instances for Foldable and Functor that we get for free) we get a whole array of nice functions which we can apply to directory structures! For example, we can combine all the text in a directory of files with:


combineFiles :: FilePath -> IO [Char]
combineFiles d = do (_ :/ t) <- readDirectory d
                    return $ F.foldr1 (++) t

(the (_:/t) portion ignores the base directory returned). The DirTree type also includes a constructor for handling failures. Here is the type definition:


data DirTree a = Dir { name     :: FileName,
                       contents :: [DirTree a]  } --files + directories
               | File { name :: FileName,
                        file :: a }
               | Failed { name :: FileName,
                          err  :: Exception }
                 deriving (ShowEq)

I have created a file with three examples:

  1. in the first we simulate the command darcs initialize to illustrate creating a directory of files by hand, and writing it to disk.
  2. second, we show combining several different directories from around the filesystem and assembling the into a new tree structure.
  3. lastly, we use our readDirectoryWith (with Data.ByteString.readFile), and our Foldable instance to hash all the files in a directory structure (with Thomas DuBuisson’s pureMD5 package) and compare it to the hash of a different directory to see if the contents match exactly

There are probably bugs, and there are many useful functions I can think of to add if people think this is useful. I would be interested in hearing your thoughts on the interface, on any functionality you would like added, and anything else!

You can get the module and examples.hs with:

$ darcs get http://coder.bsimmons.name/code/DirectoryTree/

1 Comment

Why I love Hoogle

25/04/2009

I’m working on a module called DirectoryTree, a simple tree structure mirroring the structure of a directory tree in the outside world, with functions to hopefully make it easy to do useful things (I’m not sure if something like this already exists). Here is the data-type:


data DirTree a = Dir { name  :: String,
                       files :: [a],
                       directories :: [DirTree a]  }
               | Fail String    --file/directory name

I would like to do things like map over a DirTree of FilePaths, returning a tree (in the IO monad) of Handles… so the code involves a lot of slogging through the IO monad, which I’m not totally comfortable with yet.

I’m feeling out how to proceed with the above and begin to define this function:


-- an abstraction for this? monad?:
ioMap :: (a -> IO b) -> DirTree a -> IO (DirTree b)

…there’s something happening here but I don’t know what it is; let’s see what Hoogle can tell me about this abstraction. I take the type definition and plug it into firefox’s hoogle search plugin and voila!; I’m presented with the following matches:

Data.Traversable:
traverse :: (Traversable t, Applicative f) => (a -> f b) -> t a -> f (t b)
for :: (Traversable t, Applicative f) => t a -> (a -> f b) -> f (t b)
mapM :: (Traversable t, Monad m) => (a -> m b) -> t a -> m (t b)
forM :: (Traversable t, Monad m) => t a -> (a -> m b) -> m (t b)

I see that mapM from Data.Traversable looks like what I want, and traverse also looks useful.

I learned something, and have an idea of how to proceed, and that’s why I love hoogle :)

No Comments

Initial tests of Tries: Follow Up

18/04/2009

This is to wrap up my previous post about Tries and my attempts at an implementation, and to summarize some of the things I think I learned about implementing a Trie structure in Haskell.

Complexity is Slow

My initial idea of using a container type at each node to store the collection of branches was a good idea for testing (I could easily swap in Data.Map, or my own simple binary tree) and convenience (I could use Data.Map’s built-in functions to insert the element of the list key into the the container); but it turned out to be bad for performance.

One thing I noticed was that Data.Map is not strict in its ‘value’ field, which is where I was storing more Trie types with more Maps to hold more Trie types… this seems to be simply too complicated to be fast.

Simplicity wins over Balanced Trees

Magnus Jonsson replied with a complete re-working of my implementation which uses different constructors in the Trie type to handle both the linear node-to-node movement through the key, and the searching through the set of branches. This gives a good illustration of the concept. His is the winner in terms of performance.

He doesn’t use any techniques to keep the tree balanced but his implementation out-performs my Trie with its cumbersome (but balanced) Maps at each node. I also found that when I swapped in a module MyMap (a simple non-balanced tree) in place of Data.Map that the code was about twice as fast in both tests I performed.

I would like to play with approaches to keeping the Branch trees balanced (or almost-balanced) in a lightweight way. Perhaps a splay tree or an adaptation would make sense .

Radix Trees don’t make Sense

EDIT: Jake McArthur has pointed out that my characterization of traversing data structures in FP and conclusion in this paragraph are probably incorrect.

I chose to use a special “bucket” constructor to hold the tails of keys in their original list form when the tail was unique; e.g. when inserting the key “washing” into a Trie containing only the key “waste”, the “was” would overlap and each element 'w'--> 'a' --> 's' would reside in its own Trie constructor, but both the “te” and “hing” would be stored as intact lists in separate ValBucket constructors, since there is no need to touch them yet.I thought extending this idea and implementing a radix tree might be even more efficient.

But this doesn’t make much sense in FP. Imagine the list type [] was defined in a more verbose way as follows:


data List a = L a (List a)
            | X

In our Trie structure, a sequence of key elements which have no branches (which would be reduced to a bucket in the radix tree) might look something like this:

Key ‘w’ (Key ‘a’ (Key ’s’ (Branches … )))

…whereas the same situation in an attempted radix tree might look like:

KeyBucket (L ‘w’ (L ‘a’ (L ’s’ X))) (Branches …)

…which is essentially identical. And since in functional programming when we “traverse” a data structure (e.g. to compare two keys) we are actual destroying it and rebuilding it as we go, it stands to reason that it should be just as easy to rebuild our structure with a constructor of a different name (Key vs. List).

Storing the tail ends of keys when we can makes sense because we avoid destroying the structure until we have to inspect it.

Summary of Tests:

These are not scientific, but I include them to give you an idea of the performance differences I was seeing. Tests were run several times and averaged, and compiled with: ghc --make -prof -auto-all -O2 -funbox-strict-fields.

The first test frequency.hs is described in my last post. The second test dictLookup.hs reads an alphabetized list of words and builds a dictionary Trie from the word to the word’s line number. It then reads a random list of words from a file and prints (word, line_number) tuples.

Tests of frequency.hs:

MagnusTrie Trie (with MyMap) Data.Map Trie (with Data.Map)
ticks: 21 29 39 58
total alloc. (MB): 100 133 101 235

Tests of dictLookup.hs:

MagnusTrie Trie (with MyMap) Data.Map Trie (with Data.Map)
ticks: 17 20 18 46
total alloc. (MB): 99 133 103 215

The whole set of tests and files can be downloaded with:

$ darcs get http://coder.bsimmons.name/code/Trie/

Unpack the texts.tar.gz file before running any of the code. Thanks for the patience with the long post. Let me know your thoughts.

UPDATE:
I finally was able to test wren ng thornton’s bytestring-trie library with one of my tests. I couldn’t profile the code (GHC complained the libs were missing) but I ran dictLookup.hs with BenchPress and the bytestring-trie modified version looked about twice as fast.

15 Comments