Weeks 2-4

Technical Issues!

So, first, I've completely deviated from my planned posting schedule due to a series of somewhat embarrassing technical problems I've experienced. Everything from difficulties building frameworks due to dependency hell, to realising that they either won't do what I want to do with them or lack documentation and provide examples which don't compile, to Windows kernel bugs and hosing my Linux distro by switching from the Stable branch to Testing.

Programming Languages Continued

Near the end of the second week, I still hadn't fully settled on the language. After speaking with Martin about my difficulties with F# and JavaScript, and my general dislike of Python, I decided, with Martin's encouragement, to try Haskell. I learned Haskell during my undergraduate degree and found it a very expressive language with a reasonable amount of library support at a very high (i.e. abstract) level. Enough to write my undergraduate dissertation project, an Ext2 demo driver, in it. That project was pretty tough and suffered from major performance issues (so I wrote the dissertation in the vein of "Is Haskell a Good Choice for Systems Programming?" -- the answer: "Probably not") but I did learn a great deal about the power of Haskell's type system. Because of these difficulties and because I still don't feel I understand Haskell 100%, I was reluctant to use it at first. But there's usually no point doing things the easy way. I probably wouldn't benefit from most of Haskell's features in a deep learning project because I'm not going to implement much of the guts of the program myself -- a DL framework is hard to do right, especially since I ideally want it running on the GPU, something I have no experience with short of a few basic OpenGL shaders. I'm trying to keep the code I have to write and test to a minimum so I can focus on getting results.

Unfortunately I've had to put aside the possibility of using Haskell for the time being due to problems with the package management system and a lack of documentation for the framework I intended to use, which is called Grenade. I had a go with Scala using the Apache MXNet framework which seems to be the only big one to provide generative adversarial networks [1].

I've finally settled on using Torch with Python (PyTorch). I had hoped to avoid Python because there are a few things about it which irritate me, but at least I was actually able to get something up and running with it. Documentation is great and there is a lot of support due to its popularity. I'm also using SciKit-Learn for its implementation of gridsearch, and a wrapper library called Skorch which allows the two to work together.

Current Progress

As of the 9th July, I've written a script which generates x86 instructions semi-randomly, and a classifier with PyTorch which I'm currently putting the finishing touches on. I say "semi-randomly" for the instruction generator because x86 instructions can be up to 15 bytes long, which is an insanely large number of different possible values (if I'm right in thinking it's 256^(15!), then it's ~10^(10^12)), nearly all of which will be illegal. I want to have a balanced training set and I'd like to have it within the lifespan of this universe, so to narrow the search space, my generator follows some rules:

An instruction can have up to four prefixes. Instruction prefixes can be divided into four groups (lock, rep/repne, 6 segment override prefixes, operand size override and address size override), so each instruction can have at most one prefix from each group. The generator uses exponentially decreasing probabilities to choose the number of prefixes, with a base probability of 1%. I got this number by disassembling gcc 7.3 and counting how many times each prefix occurred (except for the operand/address size prefixes). The total number of prefixes was about 1% of the number of lines of assembly code.
An opcode can be 1, 2 or (occasionally) 3 bytes long. There are 256 single byte opcodes and 233 two or three byte ones, so the generator generates single byte opcodes about 53% of the time.
Opcodes can be followed by optional ModRM and SIB bytes. ModRM can encode opcode extensions, register operands and memory operands; SIB encodes effective address computations. We just generate 0, 1 or 2 random bytes with equal probability here.
Most x86 instructions take one or two operands, a few take three and a few take none (some take more but we ignore those). I haven't attempted to find the correct proportion; instead I've simply guessed that about 10% take 0 operands, 20% take one, 65% take two and the remaining 5% take three. This is probably wrong but "good enough".

Once an instruction is generated it is then tested for validity. The only reasonable way to do this is, of course, to execute it directly on the running CPU. To this end there's a small C program which reads binary data from standard input and executes it. Yep, a computer security related project includes a program which blindly executes user-input binary data. Who said there's no room for irony in a dissertation project? Anyway, the program executes the instruction and if the CPU generates an Illegal Instruction exception (or segmentation fault), the program returns a failure code. Otherwise it returns success (0).

The Python script runs this program in a subprocess. It tests each generated instruction and writes them to standard output such that 50% of the written instructions are legal and 50% are illegal. This still takes a few minutes because there are far more illegal instructions but it's a lot faster than I expected.There are a few bugs waiting to be ironed out; for example, sometimes the opcode executer* just doesn't return, which is presumably because it jumped somewhere random in memory. There are a couple of safeguards in the program, such as a one-second alarm which allows it to take back control of the CPU from the running code and jump past the buffer using setjmp/longjmp. I haven't taken the time to debug and figure this out yet because I want to focus on getting the classifier off the ground, so my temporary solution is to kill the process from within the Python script if it takes more than one second, and record an illegal instruction.

Search This Blog

Malware Learning Machine: AI-based Malware Classifier Project

Weeks 2-4

Technical Issues!

Programming Languages Continued

Current Progress

References

Comments

Post a Comment

Popular posts from this blog

Plan