Introduction

Hi, I'm Chris, a postgrad student of artificial intelligence (officially Intelligent and Adaptive Systems) at the University of Sussex. This blog is going to serve as my notes for my dissertation project, which is exploring automatic classification of files as malware or non-malware based on machine learning models. I'm interested in many computer science topics, including systems programming, security and, of course, AI, so this project is my attempt to kill two (or three) birds with one stone.

This post will give an overview of what I plan to do with this project. Just a quick note on terminology: malware is what the average person would call a (computer) virus. Technically a virus is just one type of malware (one which is capable of spreading by itself) so I will generally use the word "malware" instead, which is any software designed to cause harm to a computer. If I do use the word "virus", assume it's used interchangeably with "malware" unless otherwise noted. I will usually refer to the software which attempts to detect and remove malware as "antivirus" software because "antimalware" is not commonly used, but consider these interchangeable terms in common usage.

Purpose of the Project

The basic goal of the project is to see if we can improve malware detection rates by using machine learning techniques. (Note: I'm being deliberately vague when I refer to ML "techniques"; a later post will go over what techniques I plan to employ). "Old school" malware detection was based on digital fingerprinting. The antivirus software's developers assign every known malware a digital signature, a string of data found in every file infected by that particular malware, but rarely found in other files (including other malwares). Antivirus software was shipped with databases containing signatures of every malware the software could recognise, and malware was detected by searching for the signatures within files. One problem with this is probably obvious: only a malware whose signature is in the database can be detected this way, so this kind of antivirus software is inherently unable to detect malware more recent than its latest database version. It is also so specific that any slight change in a malware's code changes its signature, allowing the malware to evade detection until the signature database is updated. Since malware authors can churn out new or updated malware faster than antimalware developers can analyse them [1], create signatures and release database updates (and since users are notoriously bad at updating software), defeating old-school antivirus software is fairly simple.

Modern antivirus software throws out signatures and uses "definitions" instead. Malware definition databases contain features which are extracted from known malwares using a combination of data mining techniques and hand-selection by malware researchers [2]. Each feature, when detected, correlates with the probability of the file being malware. This is much more robust than using signatures, since it's not defeated by changing small parts of the malware binary. However, the data mining operations used rely on having access to (1) massive computational resources, and (2) massive volumes of data, neither of which are really accessible to me. I have found a few dozen malware samples in public GitHub repos (see resources section) but this clearly won't compete with the resources available to antivirus companies.

These methods are both forms of static analysis [1]. There is also dynamic analysis, which is based on observing the behaviour of a program w.r.t. suspicious behaviours such as disabling security features or attempting to detect a sandboxed environment [1].

Because of these difficulties and disadvantages, I don't expect to revolutionise the world of malware detection with a classifier that beats all the existing antivirus programs. Instead, I will look to fill a niche by finding a scenario where current techniques are ineffective for one or more reasons. For example, I might find a technique that works on embedded systems with limited space for definitions databases. This is something I will have to discover as the project develops, because this is a new domain for me. I will start by researching modern malware techniques in more detail and figure out what the "cutting edge" is. It might be good to duplicate an existing paper and extend it. More on this in a later post!

References

Resources

This is a list of resources I've found so far which I will use in the project.








Comments

Popular posts from this blog

Plan

Weeks 2-4