Blogs & Stories

SpiderLabs Blog

Attracting more than a half-million annual readers, this is the security community's go-to destination for technical breakdowns of the latest threats, critical vulnerability disclosures and cutting-edge research.

Machine Learning Update 1

It has been almostexactly a month since my last postregarding the new project I am working on, so I figure it is time for anupdate. First off, I was excited and encouraged with the responses I received viaTwitter after my initial posting. One response in particular mentioned therelated work that @silviocesare isdoing with the SimSeer project as well asa book he co-authored "Software Similarityand Classification". Both appear to be excellent resources and I plan tocheck them both out in more detail as time allows.

MLIt seems as if the starswere in alignment because just after I announced the project, a little birdy (@spookerlabs) let me know that afree Machine Learning coursefrom Stanford University was beingpresented through Coursera. Did Imention that it is free? I signed up for it and we are about four weeks throughthe 10-week course. I have to say that I am pretty impressed with how thecourse is laid out and presented. We wasted no time jumping right into themath, but that shouldn't really be ofany surprise to anyone. The course mainly applies Linear Algebra, butan understanding of at least first year Calculus is a definite bonus. For example,here is a slide from the first week of the class covering the application of a Linear Regression Model and the Gradient Descent Algorithm, which would be used to help predict something like house pricing based on known square footage: Formula1
Admittedly it has been awhile since I've applied math concepts like this, in my head I was secretly hopingfor something more along the lines of this:


All joking aside though,if you are a self-paced learner this is a great resource that is being madeavailable for free. It is most definitely worth checking out what theyhave to offer.

The course uses thesoftware package Octave(similar to Matlab) to program solutions to exercises. The Octave languagegives you command line input and some pretty impressive graphics manipulationcapabilities to model your data with.

MachinelearninghackersAdditionally I picked upthe book "Machine Learning for Hackers". I haven't gotten too deep into it yet,but the authors are using the language R to solve their problems. R is a free open-sourced tool similar to S. I am looking forward to comparing whatI learn in the online course with what I am able to extract from the book. Ithink it is typically a good idea to not get all of your knowledge from a singlesource.

In general these tools/languages such as R and Octave would likely be used to rapidly prototype your machine learning theories against your data sets. They are great for visualizing and manipulating your data sets, and quickly testing your hypotheses. However, once you are satisfied with the output of your learning algorithm, you will likely want to implement the solution with a more efficient language such as C or Java to use in your production environments. I don't know at this point where to draw that particular line in the sand, but it is something to keep in mind as you work towards your goals.

I am trying to balancethis bootstrapping type of learning along with my normal daily duties here at work, and there havealready been times when I've had to put this stuff down while dealing with theinflux of "real work", but I'm quite excited about the things I'm picking upalready, and I'm itching to get my hands dirty. My hope is that by the end ofthe course I will know enough to be dangerous and I can start publishing someof my initial results right here. Stay tuned...