Analyzing Word Frequencies with Clojure, Enlive and Incanter

I’ve long been interested in getting a better feel for Incanter, a statistical computing and graphical environment for Clojure. So gifted with the fleeting favors of my muse (otherwise known as free time), I thought I’d put together a small library — although it’s not quite a library, yet — for analyzing word-use patterns on blogs and webpages.

To do this, I drew a bit of help from Enlive, which functions primarily as a templating library, but has a few features useful for screen-scraping. This was perhaps a bit of overkill, as I only ended up using one of it’s functions, html-resource, which takes an URL as input, and outputs an hash that nicely represents a web-page’s structure.

What I ended up is wordy, which at the moment can do a simple word-count frequency analysis on a given page. That is, it counts how often words used, filtering (if desired) on word length. In just a bit, I’ll get into some of the more interesting aspects of coding it  up, but first,  here is a simple use case.

Running the following in slime…
(graph-words "http://ethanjfast.com" 5 5 1)

Where the parameters correspond to:

  • ethanjfast.com -> web page to look at
  • 5 -> minimum length (letter count) of word for first anaylsis
  • 5 -> minimum length of word for last anaylsis
  • 1 -> the amount of word length to increment by between the first and last anaylsis

To make this a bit clearer, consider a different run:
(graph-words "http://ycombinator.posterous.com" 3 10 3)

Here wordy does three analyses, with minimum word lengths of 3, 6, and 9 respectively. Clearly, I have some work to do insofar as these graphs look rather pathetic, but it was nice to get incanter working.

Now, onto some implementation details. Most of the code is quite simple, so I’ll just go through a few functions that may have some value to someone learning Clojure. For instance, here is rec-map, a function which recursively traverses the map/list structure returned by html-resource.

Basically, this function filters out all page content that doesn’t match specific tags (getting rid of links, css, javascript, ect.) But at first glance, you might wonder why I used trampoline rather than recur. After all, trampoline is used to recurse between two different functions, and it looks very much like rec-map is calling itself. Well, the trick is that I am calling trampoline inside the function passed to map, so recur will fail spectacularly (and in a very confusing manner). So watch out for recursion within anonymous functions!

Here is another bit of code, where I create the graph with Incanter.

The :group-by parameter is slightly unintuitive. To use it, you make a new vector of labels, each label mapping to a counterpart in the data vector. All data with the same label are then put into the same group (e.g for data ["You" "Me" "I"] [3 2 4] one might use the label vector [0 1 1] to group “Me” and “I” together). The rest is fairly self-explanatory, but I’ll mention one thing that I didn’t know until this morning. You can’t nest the # function shortcut. For instance, the following would not work:

(map #(map #(first %1) %1) lst)

It’s rather obvious in retrospect, I know. But I was dumb enough to try it. That’s all for now, and the code is available on github.

This entry was posted in Clojure, Computer Science, Uncategorized and tagged , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Trackback

  1. [...] Word Frequencies with #Clojure, #Enlive and #Incanter by Ethan Fast (here, via @liebke) — Uses enlive to scrape a web page, count the words and show their frequencies [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>