SpiderLabs Blog

Machine Learning Update 1

Written by Ryan Merritt | May 20, 2013 1:27:00 PM

It has been almost exactly a month since my last post regarding the new project I am working on, so I figure it is time for an update. First off, I was excited and encouraged with the responses I received via Twitter after my initial posting. One response in particular mentioned the related work that @silviocesare is doing with the SimSeer project as well as a book he co-authored "Software Similarityand Classification". Both appear to be excellent resources and I plan to check them both out in more detail as time allows.

It seems as if the stars were in alignment because just after I announced the project, a little birdy (@spookerlabs) let me know that a free Machine Learning course from Stanford University was being presented through Coursera. Did I mention that it is free? I signed up for it and we are about four weeks through the 10-week course. I have to say that I am pretty impressed with how the course is laid out and presented. We wasted no time jumping right into the math, but that shouldn't really be of any surprise to anyone. The course mainly applies Linear Algebra, but an understanding of at least first year Calculus is a definite bonus. For example, here is a slide from the first week of the class covering the application of a Linear Regression Model and the Gradient Descent Algorithm, which would be used to help predict something like house pricing based on known square footage:  
Admittedly it has been awhile since I've applied math concepts like this, in my head I was secretly hoping for something more along the lines of this:


All joking aside though, if you are a self-paced learner this is a great resource that is being made available for free. It is most definitely worth checking out what they have to offer.

The course uses the software package Octave(similar to Matlab) to program solutions to exercises. The Octave language gives you command line input and some pretty impressive graphics manipulation capabilities to model your data with.

Additionally I picked up the book "Machine Learning for Hackers". I haven't gotten too deep into it yet, but the authors are using the language R to solve their problems. R is a free open-sourced tool similar to S. I am looking forward to comparing what I learn in the online course with what I am able to extract from the book. I think it is typically a good idea to not get all of your knowledge from a single source.

In general these tools/languages such as R and Octave would likely be used to rapidly prototype your machine learning theories against your data sets. They are great for visualizing and manipulating your data sets, and quickly testing your hypotheses. However, once you are satisfied with the output of your learning algorithm, you will likely want to implement the solution with a more efficient language such as C or Java to use in your production environments. I don't know at this point where to draw that particular line in the sand, but it is something to keep in mind as you work towards your goals.

I am trying to balance this bootstrapping type of learning along with my normal daily duties here at work, and there have already been times when I've had to put this stuff down while dealing with the influx of "real work", but I'm quite excited about the things I'm picking up already, and I'm itching to get my hands dirty. My hope is that by the end of the course I will know enough to be dangerous and I can start publishing some of my initial results right here. Stay tuned...