Plan

Time, Time, Time

With the deadline falling on the 4th September 2018, there are roughly 12 weeks to complete the project from the time of writing. Treating this as a full-time job gives 40*12=480 hours. The dissertation is 12,000 words; if it takes an hour to write and edit 100 words, then I will spend 120 hours (3 weeks) writing the dissertation, leaving 360 hours (9 weeks) to design & implement the software. I also won't count writing this blog against my 480 hours, as I can do it on the weekends.

I'm going to spend the first week planning, reading, learning and prototyping. After that I will have 8 weeks of development time and three weeks of QA and dissertation-writing.

Project Goal and Breakdown

Goal of the project

Predict whether a given executable file contains malware.

Steps

Predict whether a binary contains a valid program
Disassemble a binary and produce valid disassembly (this is the minimal viable product)
Predict whether the disassembled program is malware

Methodology

I will use a loose interpretation of agile methodology to design & implement the project. I say loose because I won't have strictly-timed or structured sprints. There are three steps to achieving the project goal above, and each one will probably require different amounts of time. The first step is quite simple in theory, and could probably be done in a week, except that I will be getting to grips with whatever language, libraries and development environment I choose to use. (This will be somewhat mitigated by the "reading week".) Steps 2 & 3 will definitely need longer, either two or three weeks each. There are 8 weeks, so that roughly works out to two weeks for step 1 and three weeks each for steps 2 and 3.

In each sprint there will be time for research, development and testing, although much of the testing will be automated.

Choice of Language

Choosing the right programming language from the start is vital, because it is difficult to port a program to a different language, and the difficulty will increase the further into the project I get. So this needs to be done right.

Here are some relevant criteria for a good choice of programming language:

Performance: Training machine learning models can be very slow, so a language which compiles to native code or achieves similar performance would be handy. Parallelisation is especially useful since we can train a different model on each processor with gridsearch. (humblebrag: My development machine has eight cores.)
Library support: With only 9 weeks development time, I don't really want to design, implement and (most importantly) debug implementations of preprocessing algorithms, model selection algorithms and machine learning algorithms myself.
Development time: Some languages are easier/quicker to code in than others. C++ and Java for example tend to involve a lot of boilerplate. Python has tons of built-in features and you can hack together a prototype of basically anything in a day or two.
Testing: I need to be able to unit test the project without too much hassle. This shouldn't be a big problem as most languages have good frameworks.
GUI framework: I'd like to provide a basic graphical interface, so the language should have GUI toolkits available. If worst comes to worst though, it's possible to do the interface in another language and either use interop to call functions directly or pipe & parse its output stream.
IDE/tool support: I'm used to using the POSIX command-line as an IDE, but for certain tasks (debugging in particular) a real IDE such as Visual Studio can be a big plus.
Portability: Not a primary concern but the model should definitely run on computers other than mine.
Other: There are some languages I just don't like, for no good reason. This is illogical but needs to be taken into consideration.

Here are the programming languages up for consideration. It probably goes without saying that I'm only considering languages I already know well, given the time constraints.

C++: My "mother tongue". Legendary performance (on single-core), decent built-in libraries since 2011 (plus Boost) and access to 80% of all the software libraries and tools ever written. Cool metaprogramming & other modern features. Downsides include a very clunky module system and tendency to produce Lovecraftian error messages.
Java: A large part of me dislikes Java simply for being Java. It feels like programming with training wheels. However it does have excellent library support, it's quite easy to write decent code in, and it will run on basically anything with no extra work. Downsides are performance and that general icky feeling. Performance might not be that much of an issue as the JVM tends to optimise well, but won't reach C++ levels. Again, this is looking at single-core.
C#: Like Java but it lets you use pointers and references directly, has some cool features that make programming a bit more comfortable. Not as good library support as Java and not as easily ported since there are quirks with Mono and .NET Core. Performance is apparently slightly worse than Java.
Python: I already know what libraries I would use with Python and how to use them, which is a big plus. It's also ported to every system worth considering. But, like Java, there's something about Python I just don't like. It's great for prototyping and writing short programs, but anything bigger tends to become messy fast. Plus, Python can be excruciatingly slow. If I can't get access to a cloud platform I will cross Python off the list.
JavaScript: Despite having some weird quirks I quite like programming in JS with Node. The event-based model is a bit weird but using a promise-based framework can be quite neat. I'm aware that machine learning frameworks for Node.js exist but I suspect the performance will be even worse than Python, so again, unless I can get a cloud server, I won't consider JS.
Haskell: Here's Haskell to break up the imperative monotony. A beautiful language which gets decent performance on single core (comparable to Java), but it's real advantage here is its scalability. As a purely functional language, programs written in Haskell can generally be parallelised with no changes and the runtime will manage this for you. This saves tons of time -- both in development/testing and runtime because threaded functional programs don't need synchronisation. Great metaprogramming facilities. Haskell has a fantastic unit testing framework (QuickCheck) but is somewhat less popular than the other languages and thus lacks framework and tooling support in general. It can also be quite unintuitive and some of the more mathematically abstract features are truly mind-bending. (I wrote my undergraduate dissertation project in Haskell and I'm still not 100% sure what a monad is. I'm only partly joking.) I have a major soft spot for this language and if I can find a decent ML framework for it, I will probably choose it, despite the other problems.
I said I was only listing languages that I already know, but F# and Scala seem quite easy to pick up (I've read a couple of F# examples and I already feel I know the language) and would unite many of the advantages of Haskell and C#/Java. F# is apparently a simpler language and benefits from features the CLR has built-in that the JVM lacks (tail call optimisation for one; though this info is coming from 2015 so may be outdated) [1] and is also supported by Visual Studio, so if it comes to F# or Scala I will choose F#.

Tooling & Automation

It will be useful to have some tooling for this although I'll need to think about how it would tie in with the kind of project this is. At the very least, I will use Git hooks to automate unit tests.

Architecture and Platform

The vast majority of malware targets the Microsoft Windows family of operating systems on x86 or x86_64 architecture, as does Practical Malware Analysis. Therefore I will also focus on these platforms, although if I find myself with extra time I could train models for different architectures (ARM or even Java bytecode would be good ones as they would encapsulate the mobile market which is the second largest target of malware authors).

Machine Learning Algorithms

Machine code is a sequential problem which points to using either convolutional or recurrent networks. Long short-term memory (LSTM) is a type of recurrent neural network designed to solve the vanishing gradient problem. I will either use this or a convolutional net, as in Davis & Wolff – Deep Learning on Disassembly Data.

I have also found a technique for training under adversarial circumstances: generative adversarial networks are trained using a generator and classifier engaged in a zero-sum competition. A generative network generates data which attempts to trick a pre-trained classifier to produce a particular classification. Over the course of the game, the generator generates increasingly convincing fakes, and the classifier gets better at distinguishing the fakes from real inputs.

Other

Since training models can be slow, it would be good to use a cloud platform like Azure or Google Cloud. These tend to be "freemium" but I might be able to survive on free trials for the three months I have to complete the project.

If cloud platforms are unavailable/unfeasible it would be cool to use GPGPU technology such as CUDA to get extra performance gains. F# seems to support this and I'm sure Java does as well.

References

Quora: Is F# (F-Sharp) better than Scala? If so, why?

Search This Blog

Malware Learning Machine: AI-based Malware Classifier Project