Tags
as it turns out Barefoot Running Blogging Clogger Clojure Computer Science ebook Enlive Framework Functional Programming Gajure Genetic Algorithms Git hacking Haskell Incanter initiative iSight Lazy Evaluation Lisp MacBook Macports Markov Math Open source OSX Package Management paul graham police Rails Ruby running Security Shell Scripting sinatra Snow Leopard Spam startups success Syntax Thunks twitter web application Wordy WritingFollow Me?
I am @Unignorant
-
Recent Posts
- Nytimes Oracle (a Markov text generator)
- Security: Simultaneously Weak and Amusing
- High On Lisp
- Thunks and Haskell
- The Allure of the Asymmetrical
- Analyzing Word Frequencies with Clojure, Enlive and Incanter
- As it turns out is quite innocuous
- Gajure Now on Clojars
- Police Pursue and Capture a Barefoot Runner
- On Initiative
- How I develop on OSX
- The Tweeting Narcissist
- Clojure :pre and :post
- Slowly Programming in R
- National Novel Writing Month
Archives
- May 2010 (1)
- April 2010 (3)
- March 2010 (3)
- February 2010 (4)
- December 2009 (4)
- October 2009 (4)
- September 2009 (1)
Recent Tweets
- Nytimes Oracle (a Markov text generator) http://blog.ethanjfast.com/2010/05/nytimes-oracle-a-markov-text-generator/
- Your own genius rises up against your principles.
- "Real World Haskell" is quite awesome and comprehensive.
- Have seen a lot more people running in five fingers, recently. http://www.vibramfivefingers.com/
- Eat a lime and then drink a glass of water. Taste buds are fooled and the water should seem sweet.
Slowly Programming in R
Recently, I coded up a cross validation function in R, and things were moving rather less quickly than I would have liked. (The purpose of c.v. is to assess how well one’s statistical analysis will generalize to an independent data set.) Anyhow, I was implementing 10-fold cross validation, and with a dataset containing around 100,000 observations, my code was taking hours to run. This was, of course, ridiculous.
Now, I doubt that it will come as a surprise, but I am rather a newbie at this whole R thing, and as I later found out, loops in R should be avoided at all costs. After hacking around with my code, I found that its critical path looked something like this:
total <- 0
for(i in 1:nrow(dataset)){
total <- total + sum( dataset[i,1:25]*coef )
}
Now this is very simple loop, and it seemed to me somewhat less than obvious that it would beget a significant performance bottleneck. Ever so naturally, then, it did.
Ironically, the solution here is to use code more along the lines of the map-reduce paradigm, something I would have loved to do in the first place, were not I overcome by the cryptic nature of R’s documentation. After all, my favorite languages are all variants of lisp, and I am no stranger to functional programming. After some digging, I stumbled across apply, which more-or-less functions along the lines of map in scheme or clojure. So I tried:
my_sum <- function(x){ sum( x[1:25]*coef ) }
sum( apply( dataset, my_sum ) )
In addition to being more elegant, this is much, much faster. What was taking hours, now takes tens of seconds. Apparently, R has a fast backend implementation for this sort of thing. So, this post is dedicated to as a warning to my fellow inexperienced users: avoid iterative loops in R!