Low Dimensional Embedding of Malicious Software Images

Computer vision is playing an increasingly important rolein automated malware detection with to the rise of the image-based binary representation. These binary images are fast to generate, require no feature engineering, and are resilient to popular obfuscation methods.

In this lab, we walk through how to use GPU accelerated dimensionality reduction (PCA) to embed different types of malicious software represented as binary images. We obtain these malicious binary images from the MalNet database. Specifically, we will look at how to:

Part1. Downloading and processing the data

We download the image data from the MalNet database and store it locally in the folder notebooks/.

Now that the data has been downloaded, lets gather the image file paths and labels. We load each image as a 1D array, allowing us to form a 2D matrix where each row represents an image, and the columns pixel intensities.

To get a feeling for the data we are going to be working with, lets visualize one of the images.

Each image is a visual representation of an executable's bytecode, where each pixel is in the range of [0, 255]. Each pixel is colored corresponding to the section that the bytecode originates from (e.g., header, data).

Part 2. Dimensionality reduction using PCA

One of the most popular forms of dimensionality reduction is Principal Components Analysis (PCA). Given a set of data, PCA finds a low dimensional representation of the data while preserving as much of the data's variation as possible. By using PCA to find a low dimensional reprsentation of the data, we can perform a variety of important downstream e.g., data visualization, clustering.

Lets begin by running PCA using cuML, a GPU accelerated version of PCA, and plotting the amount of information (data variance) retained in the first 30 principal components.

In the figure above, we see that 95% of the image information can be represented using just the first 30 principal components. This allows us to reduce our original data matrix to less than 1% of its original size while keeping nearly all of the information!

Part 3. Visualizing low dimensional image embedddings

Now lets visualize these malware images in 2D. While the first 2 principal components only contain ~50% of the image information, its enough to create some interesting visualizations. In fact, looking at the figure below we see that the 3 malware types form distinct clusters.

If we were to classify these 3 types of malware using dimensionality reduction, would we want to use 2 components, or more? Explain why.