<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>LAMBDAPHONE &#187; Map</title>
	<atom:link href="http://coder.bsimmons.name/blog/tag/map/feed/" rel="self" type="application/rss+xml" />
	<link>http://coder.bsimmons.name/blog</link>
	<description>fragmentary ideas  ䷿  intellectual what-nots  ䷷  and haskell programming  ䷴</description>
	<lastBuildDate>Sun, 29 Jan 2012 17:24:54 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Initial tests of Tries: Follow Up</title>
		<link>http://coder.bsimmons.name/blog/2009/04/initial-tests-of-tries-follow-up/</link>
		<comments>http://coder.bsimmons.name/blog/2009/04/initial-tests-of-tries-follow-up/#comments</comments>
		<pubDate>Sat, 18 Apr 2009 20:23:07 +0000</pubDate>
		<dc:creator>jberryman</dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[Map]]></category>
		<category><![CDATA[Tree]]></category>

		<guid isPermaLink="false">http://coder.bsimmons.name/blog/?p=112</guid>
		<description><![CDATA[<p><em>This is to wrap up my <a href="http://coder.bsimmons.name/blog/2009/04/some-initial-tests-of-tries/">previous post about Tries</a> and my attempts at an implementation, and to summarize some of the things I think I learned about implementing a Trie structure in Haskell.</em></p>
<h3>Complexity is Slow</h3>
<p>My initial&#8230; <a href="http://coder.bsimmons.name/blog/2009/04/initial-tests-of-tries-follow-up/" class="read_more">   [ R E A D &#124; M O R E ]</a></p>]]></description>
			<content:encoded><![CDATA[<p><em>This is to wrap up my <a href="http://coder.bsimmons.name/blog/2009/04/some-initial-tests-of-tries/">previous post about Tries</a> and my attempts at an implementation, and to summarize some of the things I think I learned about implementing a Trie structure in Haskell.</em></p>
<h3>Complexity is Slow</h3>
<p>My initial idea of using a container type at each node to store the collection of branches was a good idea for testing (I could easily swap in Data.Map, or my own simple binary tree) and convenience (I could use Data.Map&#8217;s built-in functions to insert the element of the list key into the the container); but it turned out to be bad for performance.</p>
<p>One thing I noticed was that Data.Map is not <a href="http://www.haskell.org/haskellwiki/Performance/Data_types#Strict_fields">strict</a> in its &#8216;value&#8217; field, which is where I was storing more Trie types with more Maps to hold more Trie types&#8230; this seems to be simply too complicated to be fast. </p>
<h3>Simplicity wins over Balanced Trees</h3>
<p>Magnus Jonsson replied with <a href="http://magnus.hcoop.net/MagnusTrie3.hs">a complete re-working of my implementation</a> which uses different constructors in the <code>Trie</code> type to handle both the linear node-to-node movement through the key, and the searching through the set of branches. This gives a good <a href="http://en.wikipedia.org/wiki/Binary_tree#Encoding_n-ary_trees_as_binary_trees">illustration of the concept</a>. His is the winner in terms of performance.</p>
<p>He doesn&#8217;t use any techniques to <a href="http://en.wikipedia.org/wiki/Balanced_binary_tree">keep the tree balanced</a> but his implementation out-performs my Trie with its cumbersome (but balanced) <code>Map</code>s at each node. I also found that when I swapped in a module <code>MyMap</code> (a simple non-balanced tree) in place of <code>Data.Map</code> that the code was about twice as fast in both tests I performed.</p>
<p>I would like to play with approaches to keeping the Branch trees balanced (or almost-balanced) in a lightweight way. Perhaps a <a href="http://en.wikipedia.org/wiki/Splay_tree">splay tree</a> or an adaptation would make sense .</p>
<h3>Radix Trees don&#8217;t make Sense</h3>
<p><em>EDIT: Jake McArthur has pointed out that my characterization of traversing data structures in FP and conclusion in this paragraph are probably incorrect.</em></p>
<p>I chose to use a special &#8220;bucket&#8221; constructor to hold the tails of keys in their original list form when the tail was unique; e.g. when inserting the key &#8220;washing&#8221; into a Trie containing only the key &#8220;waste&#8221;, the &#8220;was&#8221; would overlap and each element <code>'w'--> 'a' --> 's'</code> would reside in its own <code>Trie</code> constructor, but both the &#8220;te&#8221; and &#8220;hing&#8221; would be stored as intact lists in separate <code>ValBucket</code> constructors, since there is no need to touch them yet.I thought extending this idea and implementing a <a href="http://en.wikipedia.org/wiki/Radix_tree">radix tree</a> might be even more efficient. </p>
<p>But this doesn&#8217;t make much sense in FP. Imagine the list type <code>[]</code> was defined in a more verbose way as follows:</p>
<p><blockquote class="vimblock"><br>
<span class="Type">data</span>&nbsp;List&nbsp;a&nbsp;<span class="Statement">=</span>&nbsp;L&nbsp;a&nbsp;(List&nbsp;a)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="Statement">|</span>&nbsp;X<br>
<br></blockquote></p>
<p>In our Trie structure, a sequence of key elements which have no branches (which would be reduced to a bucket in the radix tree) might look something like this:</p>
<blockquote><p>Key &#8216;w&#8217; (Key &#8216;a&#8217; (Key &#8216;s&#8217; (Branches &#8230; )))</p></blockquote>
<p>&#8230;whereas the same situation in an attempted radix tree might look like:</p>
<blockquote><p>KeyBucket (L &#8216;w&#8217; (L &#8216;a&#8217; (L &#8216;s&#8217; X))) (Branches &#8230;)</p></blockquote>
<p>&#8230;which is essentially identical. And since in functional programming when we &#8220;traverse&#8221; a data structure (e.g. to compare two keys) we are actual destroying it and rebuilding it as we go, it stands to reason that it should be just as easy to rebuild our structure with a constructor of a different name (<code>Key</code> vs. <code>List</code>).</p>
<p>Storing the tail ends of keys when we can makes sense because we avoid destroying the structure until we have to inspect it.</p>
<h4>Summary of Tests:</h4>
<p>These are not scientific, but I include them to give you an idea of the performance differences I was seeing. Tests were run several times and averaged, and compiled with: <code>ghc --make -prof -auto-all -O2 -funbox-strict-fields</code>.</p>
<p>The first test <code>frequency.hs</code> is described in my last post. The second test <code>dictLookup.hs</code> reads an alphabetized list of words and builds a dictionary Trie from the word to the word&#8217;s line number. It then reads a random list of words from a file and prints <code>(word, line_number)</code> tuples.</p>
<p>Tests of <code>frequency.hs</code>:</p>
<p>
<blockquote>
<TABLE cellpadding="2">
  <TR>
    <th></th>
    <TH>MagnusTrie</TH>
    <TH>Trie (with MyMap)</TH>
    <th>Data.Map</th>
    <TH>Trie (with Data.Map)</TH>
  </TR>
  <TR>
    <TD>ticks:</TD>
    <td>21</td>
    <TD>29</TD>
    <TD>39</TD>
    <TD>58</TD>
  </TR>
    <TR>
    <TD>total alloc. (MB):  </TD>
    <td>100</td>
    <TD>133</TD>
    <TD>101</TD>
    <TD>235</TD>
  </TR>
</TABLE>
</blockquote>
</p>
<p>Tests of <code>dictLookup.hs</code>:<br />

<blockquote>
<TABLE cellpadding="2">
  <TR>
    <th></th>
    <TH>MagnusTrie</TH>
    <TH>Trie (with MyMap)</TH>
    <th>Data.Map</th>
    <TH>Trie (with Data.Map)</TH>
  </TR>
  <TR>
    <TD>ticks:</TD>
    <td>17</td>
    <TD>20</TD>
    <TD>18</TD>
    <TD>46</TD>
  </TR>
    <TR>
    <TD>total alloc. (MB):  </TD>
    <td>99</td>
    <TD>133</TD>
    <TD>103</TD>
    <TD>215</TD>
  </TR>
</TABLE>
</blockquote>
</p>
<p>The whole set of tests and files can be downloaded with:</p>
<blockquote><p>$ darcs get http://coder.bsimmons.name/code/Trie/</p></blockquote>
<p>Unpack the <code>texts.tar.gz</code> file before running any of the code. Thanks for the patience with the long post. Let me know your thoughts.</p>
<p><strong>UPDATE:</strong><br />
I finally was able to test wren ng thornton&#8217;s <a href="http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-trie">bytestring-trie</a> library with one of my tests. I couldn&#8217;t profile the code (GHC complained the libs were missing) but I ran <code>dictLookup.hs</code> with <a href="http://hackage.haskell.org/cgi-bin/hackage-scripts/package/benchpress">BenchPress</a> and the bytestring-trie modified version looked about twice as fast.</p>
]]></content:encoded>
			<wfw:commentRss>http://coder.bsimmons.name/blog/2009/04/initial-tests-of-tries-follow-up/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Some initial tests of Tries</title>
		<link>http://coder.bsimmons.name/blog/2009/04/some-initial-tests-of-tries/</link>
		<comments>http://coder.bsimmons.name/blog/2009/04/some-initial-tests-of-tries/#comments</comments>
		<pubDate>Thu, 16 Apr 2009 00:19:58 +0000</pubDate>
		<dc:creator>jberryman</dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[Map]]></category>
		<category><![CDATA[Tree]]></category>

		<guid isPermaLink="false">http://coder.bsimmons.name/blog/?p=96</guid>
		<description><![CDATA[<p>I have been interested in writing a <a href="http://en.wikipedia.org/wiki/Trie">Trie</a> implementation to see how its performance compares to using Data.Map for storing dictionary-like data. I wrote a minimal implementation of Tries which exports <code>insert</code>,<code>insertWith</code>, and <code>lookup</code> definitions and can be used&#8230; <a href="http://coder.bsimmons.name/blog/2009/04/some-initial-tests-of-tries/" class="read_more">   [ R E A D &#124; M O R E ]</a></p>]]></description>
			<content:encoded><![CDATA[<p>I have been interested in writing a <a href="http://en.wikipedia.org/wiki/Trie">Trie</a> implementation to see how its performance compares to using Data.Map for storing dictionary-like data. I wrote a minimal implementation of Tries which exports <code>insert</code>,<code>insertWith</code>, and <code>lookup</code> definitions and can be used in place of <code>Data.Map</code> for those functions. I tested the implementation using a program <code>frequency.hs</code> which reads a source file and uses <code>insertWith</code> on each word to count the frequency of the words in a file; we then use <code>lookup</code> to print out the frequencies of a list of arbitrary words.</p>
<h3>The Implementation:</h3>
<p>The Trie uses a <code>Data.Map</code> at each node to provide quick access to each branch; I performed a test with a simple unbalanced binary tree for storing branches, but the performance was slightly worse. </p>
<p>We also store the unused tail of a list-key in a special <code>ValBucket</code> constructor so that we don&#8217;t need to store a bunch of singleton Maps. I was curious if laziness would provide the same benefit automatically (as the remainder of the list should be stored as a thunk until another key with the same prefix comes along, forcing evaluation), but my version without the buckets was pretty significantly slower.</p>
<p>Here is the Trie type:</p>
<p><blockquote class="vimblock"><br>
<span class="Type">data</span>&nbsp;Trie&nbsp;a&nbsp;v&nbsp;<span class="Statement">=</span>&nbsp;Node&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;branches&nbsp;<span class="Statement">::</span>&nbsp;M.Map&nbsp;a&nbsp;(Trie&nbsp;a&nbsp;v)&nbsp;}<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="Statement">|</span>&nbsp;ValNode&nbsp;{&nbsp;branches&nbsp;<span class="Statement">::</span>&nbsp;M.Map&nbsp;a&nbsp;(Trie&nbsp;a&nbsp;v),<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;val&nbsp;<span class="Statement">::</span>&nbsp;v&nbsp;}&nbsp;&nbsp;&nbsp;&nbsp; <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="Statement">|</span>&nbsp;ValBucket&nbsp;{&nbsp;bucket&nbsp;<span class="Statement">::</span>&nbsp;[a],&nbsp;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;val&nbsp;<span class="Statement">::</span>&nbsp;v&nbsp;&nbsp;}<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="Statement">|</span>&nbsp;Val&nbsp;{&nbsp;val&nbsp;<span class="Statement">::</span>&nbsp;v&nbsp;}<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="Statement">|</span>&nbsp;Empty<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span class="Type">deriving</span>&nbsp;(<span class="Type">Show</span>)<br>
<br></blockquote></p>
<h3>Initial Testing Results</h3>
<p>It is quite likely that there are obvious performance problems with my Trie implementation and I know that my test script doesn&#8217;t provide a very good look at the data structure (it doesn&#8217;t provide much of a benchmark for <code>lookup</code>, spends an inordinate amount of time preparing the text for insertion, etc.)</p>
<p>I was disappointed with the results which showed no improvement with <code>Trie</code> when compared with the same test using <code>Data.Map</code>. Here is a quick graph comparing execution times of <code>frequency.hs</code> using Map vs Trie as the size of the input text increases:</p>
<p><img alt="" src="http://coder.bsimmons.name/blog/wp-content/uploads/graph_trans.png" title="trie vs. map" class="aligncenter" width="387" height="310" /></p>
<p>(note: tests compiled in GHC using -O1 optimization, and timed with <a href="http://hackage.haskell.org/cgi-bin/hackage-scripts/package/benchpress">Benchpress</a>)</p>
<p>Based on the improvement I saw using the buckets for the string tails I want to see if extending the idea and turning the module into a <a href="http://en.wikipedia.org/wiki/Radix_tree">radix trie</a> will improve performance. </p>
<p>I&#8217;m hoping people will be interested in looking at my code and providing some feedback. You can use:</p>
<blockquote><p> $ darcs get http://coder.bsimmons.name/code/Trie/</p></blockquote>
<p>If you want to download just the Trie module you can <a href="http://coder.bsimmons.name/code/Trie/Trie.hs">get it here</a>.</p>
<p><strong>UPDATE:</strong></p>
<p>Here are the results of a quick test in GHCi comparing the effects of using a Data.Map.Map vs. an Unbalanced Binary Tree vs. simple Lists to store the branches at each node:</p>
<p>
<blockquote>
<TABLE cellpadding="5">
  <TR>
     <TH></th>
     <th></th>
     <th align="left" colspan="3">test run-time (ms) for:</th>
  </tr>
  <TR>
    <TH>Input File Size (kB)</TH>
    <th> | </th>
    <TH>Data.Map</TH>
    <TH>Simple Tree</TH>
    <TH>List</TH>
  </TR>
  <TR>
    <TD align="right">96.7</TD>
    <td></td>
    <TD>618</TD>
    <TD>1015</TD>
    <TD>1998</TD>
  </TR>
    <TR>
    <TD align="right">596.5</TD>
    <td></td>
    <TD>10225</TD>
    <TD>25255</TD>
    <TD>71236</TD>
  </TR>
</TABLE>
</blockquote>
</p>
<p>Data.Map, as expected, is the winner. It&#8217;s possible that with very small Tries that either lists or the simple trees would have slightly better performance.</p>
<p>Next on my agenda is to perform the following test: build up a Trie from a reduced english dictionary, then parse a random block of text, printing the definition of each word. I hope the Trie will fare better in this test.</p>
<p><strong>UPDATE 2:</strong><br />
I just noticed one property of the Data.Map library that probably makes it less optimal than it could be for this application: <a href="file:///usr/share/doc/ghc6-doc/libraries/containers/src/Data-Map.html#Map">the Map type</a> is strict in all its constructors <em>except</em> the in the value of the key/value pair (which is where we are storing the recursive Trie paths from a node. I suppose this also means that -funbox-strict-fields has little effect on the structure.</p>
<p>In <code>Trie.hs</code> I replaced the <code>M.insertWith</code> function call (from Data.Map) with the strict <code>insertWith'</code> function and got some improvement in CPU usage.</p>
<p><strong>UPDATE 3:</strong></p>
<p>You can read some of my conclusions <a href="http://coder.bsimmons.name/blog/2009/04/initial-tests-of-tries-follow-up/">in this post</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://coder.bsimmons.name/blog/2009/04/some-initial-tests-of-tries/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Data.Map Conversion Functionality</title>
		<link>http://coder.bsimmons.name/blog/2009/03/datamap-functions/</link>
		<comments>http://coder.bsimmons.name/blog/2009/03/datamap-functions/#comments</comments>
		<pubDate>Sun, 29 Mar 2009 18:20:55 +0000</pubDate>
		<dc:creator>jberryman</dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[library]]></category>
		<category><![CDATA[Map]]></category>
		<category><![CDATA[short]]></category>

		<guid isPermaLink="false">http://coder.bsimmons.name/blog/?p=10</guid>
		<description><![CDATA[<p>There are three identical functions in <code>Data.Map</code> for flattening a Map to a list of tuples in ascending order:</p>
<blockquote><p><code>assocs = toList = toAscList</code></p></blockquote>
<p>&#8230;but nothing to convert to a descending list.</p>
]]></description>
			<content:encoded><![CDATA[<p>There are three identical functions in <code>Data.Map</code> for flattening a Map to a list of tuples in ascending order:</p>
<blockquote><p><code>assocs = toList = toAscList</code></p></blockquote>
<p>&#8230;but nothing to convert to a descending list.</p>
]]></content:encoded>
			<wfw:commentRss>http://coder.bsimmons.name/blog/2009/03/datamap-functions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk
Database Caching 1/15 queries in 0.017 seconds using disk

Served from: coder.bsimmons.name @ 2012-02-05 10:57:07 -->
