<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ethan Fast &#187; Wordy</title>
	<atom:link href="http://blog.ethanjfast.com/tag/wordy/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.ethanjfast.com</link>
	<description>Lambdas, Hacks, and Fiction</description>
	<lastBuildDate>Fri, 27 Aug 2010 12:50:02 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>Analyzing Word Frequencies with Clojure, Enlive and Incanter</title>
		<link>http://blog.ethanjfast.com/2010/03/analyzing-word-frequencies-with-clojure-enlive-and-incanter/</link>
		<comments>http://blog.ethanjfast.com/2010/03/analyzing-word-frequencies-with-clojure-enlive-and-incanter/#comments</comments>
		<pubDate>Mon, 08 Mar 2010 18:39:48 +0000</pubDate>
		<dc:creator>Ethan</dc:creator>
				<category><![CDATA[Clojure]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Enlive]]></category>
		<category><![CDATA[Incanter]]></category>
		<category><![CDATA[Wordy]]></category>

		<guid isPermaLink="false">http://blog.ethanjfast.com/?p=381</guid>
		<description><![CDATA[I&#8217;ve long been interested in getting a better feel for Incanter, a statistical computing and graphical environment for Clojure. So gifted with the fleeting favors of my muse (otherwise known as free time), I thought I&#8217;d put together a small library &#8212; although it&#8217;s not quite a library, yet &#8212; for analyzing word-use patterns on blogs and webpages. To [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve long been interested in getting a better feel for <a href="http://incanter.org/">Incanter</a>, a statistical computing and graphical environment for Clojure. So gifted with the fleeting favors of my muse (otherwise known as <em>free time</em>), I thought I&#8217;d put together a small library &#8212; although it&#8217;s not quite a library, yet &#8212; for analyzing word-use patterns on blogs and webpages.</p>
<p>To do this, I drew a bit of help from <a href="http://github.com/cgrand/enlive">Enlive</a>, which functions primarily as a templating library, but has a few features useful for screen-scraping. This was perhaps a bit of overkill, as I only ended up using one of it&#8217;s functions, <em>html-resource</em>, which takes an URL as input, and outputs an hash that nicely represents a web-page&#8217;s structure.</p>
<p>What I ended up is <a href="http://github.com/Ejhfast/wordy">wordy</a>, which at the moment can do a simple word-count frequency analysis on a given page. That is, it counts how often words used, filtering (if desired) on word length. In just a bit, I&#8217;ll get into some of the more interesting aspects of coding it  up, but first,  here is a simple use case.</p>
<p>Running the following in slime&#8230;<br />
<code>(graph-words "http://ethanjfast.com" 5 5 1)</code></p>
<p style="text-align: center;"><img class="aligncenter" title="As applied to this blog." src="/images/ethanjfast.com.png" alt="" width="500" /></p>
<p style="text-align: left;">Where the parameters correspond to:</p>
<ul>
<li>ethanjfast.com -&gt; web page to look at</li>
<li>5 -&gt; minimum length (letter count) of word for first anaylsis</li>
<li>5 -&gt; minimum length of word for last anaylsis</li>
<li>1 -&gt; the amount of word length to increment by between the first and last anaylsis</li>
</ul>
<p>To make this a bit clearer, consider a different run:<br />
<code>(graph-words "http://ycombinator.posterous.com" 3 10 3)</code></p>
<p style="text-align: center;"><img class="aligncenter" title="Ycombinator Run" src="/images/ycom2.png" alt="" width="500" /></p>
<p style="text-align: left;">Here wordy does three analyses, with minimum word lengths of 3, 6, and 9 respectively. Clearly, I have some work to do insofar as these graphs look rather pathetic, but it was nice to get incanter working.</p>
<p style="text-align: left;">Now, onto some implementation details. Most of the code is quite simple, so I&#8217;ll just go through a few functions that may have some value to someone learning Clojure. For instance, here is <em>rec-map</em>, a function which recursively traverses the map/list structure returned by <em>html-resource</em>.</p>
<script src="http://gist.github.com/325414.js"></script>
<p>Basically, this function filters out all page content that doesn&#8217;t match specific tags (getting rid of links, css, javascript, ect.) But at first glance, you might wonder why I used <em>trampoline</em> rather than <em>recur</em>. After all, <em>trampoline</em> is used to recurse between two different functions, and it looks very much like <em>rec-map</em> is calling itself. Well, the trick is that I am calling <em>trampoline</em> inside the function passed to map, so <em>recur</em> will fail spectacularly (and in a very confusing manner). So watch out for recursion within anonymous functions!</p>
<p>Here is another bit of code, where I create the graph with Incanter.</p>
<script src="http://gist.github.com/325432.js"></script>
<p>The :group-by parameter is slightly unintuitive. To use it, you make a new vector of labels, each label mapping to a counterpart in the data vector. All data with the same label are then put into the same group (e.g for data ["You" "Me" "I"] [3 2 4] one might use the label vector [0 1 1] to group &#8220;Me&#8221; and &#8220;I&#8221; together). The rest is fairly self-explanatory, but I&#8217;ll mention one thing that I didn&#8217;t know until this morning. You can&#8217;t nest the # function shortcut. For instance, the following would not work:</p>
<p><code>(map #(map #(first %1) %1) lst)</code></p>
<p>It&#8217;s rather obvious in retrospect, I know. But I was dumb enough to try it. That&#8217;s all for now, and the code is available on <a href="http://github.com/Ejhfast/wordy">github</a>.</p>
 <img src="http://blog.ethanjfast.com/wp-content/plugins/feed-statistics.php?view=1&post_id=381" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://blog.ethanjfast.com/2010/03/analyzing-word-frequencies-with-clojure-enlive-and-incanter/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
