Weeks 5 & 6

Technical Issues Continued

I spent the end of week 5 and the beginning of week 6 with a few more problems using PyTorch, Skorch and SciKit-Learn together, partly due to a lack of clear documentation about the data certain classes and functions expect. After a lot of tweaking I managed to get a model running, but it wasn't learning anything -- the loss function was returning either infinity or zero. After a lot of tweaking, resulting in even more errors, I decided to scrap Skorch and SK-Learn, and write the model selection code myself. After two weeks spent debugging, I managed to get a concurrent gridsearch/k-folds cross-validation implementation up and running in an evening. (I wasn't sure how to feel about that -- annoyed that I had wasted so much time when the solution was staring my in the face, or glad that I finally had something that worked and could stop worrying about problems in other people's code.) I'm still making improvements -- at the moment there is a problem with memory usage and possibly a deadlock (when two or more processes try to access a resource, and both end up waiting for the other to finish with it), but at least these are bugs in code that  wrote, so I should be able to fix them pretty easily. (Definitely more glad than annoyed.)

CPU, GPU or Cloud?

Training deep learning models can be really slow due to the complex maths involved. Computers (i.e. CPUs) are generally a lot slower at processing real numbers (numbers with a decimal point such as 3.14, 2.72, etc.) compared to integers (whole numbers). Deep learning involves repeatedly (tens of thousands of times) performing operations on tensors, which can contain millions of numbers, and the model selection stage (where we find "hyperparameters" such as the number of units and layers in the network, or the learning rate) involves doing this dozens of times with different combinations of hyperparameters to find the best combination. This can be very time consuming -- a typical home computer could take days to find a model. Although my CPU is fairly powerful, running on GPU would be far faster. Graphics processors (GPUs) are optimised for processing large quantities of real numbers in parallel and matrices and can many times faster than a CPU for deep learning (the strongest GPU was almost 15x as fast as two CPUs in a test with TensorFlow [1]).

Unfortunately I have an AMD GPU and PyTorch only supports NVIDIA GPUs (via CUDA). I have, however, managed to get my hands on an older NVIDIA GPU (a GT 730 with 4 GB of memory and 384 CUDA cores) which should work with PyTorch and will update my code to use it once it arrives and I get it set up. The results of that should find their way into the next update and I'm really hoping they're going to be positive results, otherwise I will have wasted £50.

I also investigated a number of cloud platforms, such as Amazon EC2, Google Cloud Platform and Microsoft Azure. They're all free for the first month or so if you use them carefully but only Google seems to offer GPU access, and the way the prices are structured means it would apparently cost about $700 (including the $300 credit they give you) to run for a single month. I decided to go with the GPU.

Preprocessing

If you recall, stage 1 of the project was to produce a classifier which could decide whether a stream of up to 15 bytes constitute a valid x86 instruction. PyTorch doesn't seem to like variable length input sequences, so I had to pad the inputs with infinity values (since 0 is a valid x86 instruction). Also, since my inputs are bytes (unsigned 8-bit integers) but the network expects real numbers, I've converted the input stream into reals by dividing by 256 (the maximum value of a byte). That's all the preprocessing I'm doing for now. I'm not going to do any dimensionality reduction because the input has only one dimension, and every part of the input sequence is relevant to whether the instruction is valid. I'm not sure if things like the correlation between input features (for which whitening would be applied) apply to 1D sequential inputs.

Small Changes to the Binary Generator

I've updated the binary generator (which generates random instruction sequences) to run one task per CPU. The result has been amazing -- generating 2,000 instructions on one CPU takes about 40 minutes; on 8 CPUs it takes less than 4. That's a 90% improvement - concurrency is magic. I also updated the binary generator to generate larger instruction sets -- 20,000 instructions now -- because I'm using k-folds cross-validation with three folds, and with 2,000 instructions that's only 666 instructions for each model to be trained on; probably not enough to decide which model is the best. 

References


Comments

Popular posts from this blog

Plan

Weeks 2-4