Submitted by CAE Community on

In recent years, malicious binary programs have increased significantly. One way to analyze such programs is to decompile them into source code so that more scalable analyses can be performed using tools that require source code. However, most of the decompilers produce code with undefined types and other errors that prevent the programs to be recompiled correctly. We have developed a closed-loop GNN-based system to generate recompilable source code. Given a binary program, we use Ghidra to generate the initial source code and then we use a trained GNN to identify potential corrections to the errors. A novel component of the system is that we use the emulation of Ghidra to be able to automatically identify and fix compilation errors. We utilize Ghidra’s API to decompile methods that are present in a binary program to C code. Each decompiled method is then passed into Joern, to generate a Code-Property graph representation for each method. This type of graph is a union of the Abstract Syntax Tree, Control Flow Graph, and Program Dependence Graphof each method. We then remove features of this graph that are either redundant or not necessary for the overall goal of the framework. To apply Deep Learning on the decompiled program, we tokenize the entire program and then encode the generated tokens to produce their vector representations. For this purpose, we use Transformer based models, since they are expected to obtain higher-quality embeddings with contextual information. We do this using CodeBERT, which is based on the RoBERTa architecture, and is trained on both natural language and programming language bimodal samples. To leverage the graph representation of the decompiled program, we then use a Graph Convolutional Network, where a GCN layer takes the output embeddings from CodeBERT along with the edges between each node as its input.

Timothy Barao
Thursday Block I