Posts

Weeks 7 & 8

Moving on: Stage 2 with TensorFlow I've spent a lot of time working on stage 1, the legal/illegal instruction classifier, and unfortunately haven't made much progress. I wrote a lot of code , but ran into so many problems (with software as well as hardware) that I wasn't able to get a working classifier. I've decided to cut my losses and move on to stage 2. I've also decided to ditch PyTorch for the more mature TensorFlow. This has proven to be a very good decision: after a few minor hiccups, I can now train networks on a new NVIDIA GPU that I bought. I find TensorFlow much easier and more complete than PyTorch, although I'm sure the latter will continue to get better as it develops. In any case, I've been hard at work writing code for stage 2 and have found far more success with TensorFlow. It's going to be tough to get enough done in the five weeks remaining to pull off a decent investigative report, but I'm fairly confident I can manage it if I...

Weeks 5 & 6

Technical Issues Continued I spent the end of week 5 and the beginning of week 6 with a few more problems using PyTorch, Skorch and SciKit-Learn together, partly due to a lack of clear documentation about the data certain classes and functions expect. After a lot of tweaking I managed to get a model running, but it wasn't learning anything -- the loss function was returning either infinity or zero. After a lot of tweaking, resulting in even more errors, I decided to scrap Skorch and SK-Learn, and write the model selection code myself. After two weeks spent debugging, I managed to get a concurrent gridsearch/k-folds cross-validation implementation up and running in an evening. (I wasn't sure how to feel about that -- annoyed that I had wasted so much time when the solution was staring my in the face, or glad that I finally had something that worked and could stop worrying about problems in other people's code.) I'm still making improvements -- at the moment there is a...

Weeks 2-4

Technical Issues! So, first, I've completely deviated from my planned posting schedule due to a series of somewhat embarrassing technical problems I've experienced. Everything from difficulties building frameworks due to dependency hell, to realising that they either won't do what I want to do with them or lack documentation and provide examples which don't compile, to Windows kernel bugs and hosing my Linux distro by switching from the Stable branch to Testing. Programming Languages Continued Near the end of the second week, I still hadn't fully settled on the language. After speaking with Martin about my difficulties with F# and JavaScript, and my general dislike of Python, I decided, with Martin's encouragement, to try Haskell. I learned Haskell during my undergraduate degree and found it a very expressive language with a reasonable amount of library support at a very high (i.e. abstract) level. Enough to write my undergraduate dissertation project, an ...

Week 1

Programming Languages This is just a short update about what I've done so far. In my plan the first week was allocated to general planning and research. I looked into machine learning frameworks for the various languages I considered, focusing on Python, JavaScript, F# and Haskell. I've decided against F# and Haskell because it's unnecessary extra complexity: I want to minimise the number of new concepts I have to learn in this project to the bare essentials. LSTM networks are new to me, as is malware analysis in general, and I also need to learn a deep learning framework. F# is a new language, and I can't write Haskell as intuitively as I can the imperative languages I know. So I'm now between JavaScript and Python. I'm currently learning TensorFlow JS which provides a native code LSTM implementation that should be very fast (it can run on GPU). There is also TensorFlow for Python, but I would rather write JS all else being equal. The TensorFlow website says ...

Plan

Time, Time, Time With the deadline falling on the 4th September 2018, there are roughly 12 weeks to complete the project from the time of writing. Treating this as a full-time job gives 40*12= 480 hours . The dissertation is 12,000 words; if it takes an hour to write and edit 100 words, then I will spend 120 hours (3 weeks) writing the dissertation, leaving 360 hours (9 weeks) to design & implement the software. I also won't count writing this blog against my 480 hours, as I can do it on the weekends. I'm going to spend the first week planning, reading, learning and prototyping. After that I will have 8 weeks of development time and three weeks of QA and dissertation-writing. Project Goal and Breakdown Goal of the project Predict whether a given executable file contains malware. Steps Predict whether a binary contains a valid program  Disassemble a binary and produce valid disassembly (this is the minimal viable product) Predict whether the disassembled prog...

Introduction

Hi, I'm Chris, a postgrad student of artificial intelligence (officially Intelligent and Adaptive Systems ) at the University of Sussex. This blog is going to serve as my notes for my dissertation project, which is exploring automatic classification of files as malware or non-malware based on machine learning models. I'm interested in many computer science topics, including systems programming, security and, of course, AI, so this project is my attempt to kill two (or three) birds with one stone. This post will give an overview of what I plan to do with this project. Just a quick note on terminology: malware is what the average person would call a (computer) virus. Technically a virus is just one type of malware (one which is capable of spreading by itself) so I will generally use the word "malware" instead, which is any software designed to cause harm to a computer. If I do use the word "virus", assume it's used interchangeably with "malware...