CN107358099B

CN107358099B - Useless variable detection method based on LLVM intermediate representation program slicing technology

Info

Publication number: CN107358099B
Application number: CN201710431448.2A
Authority: CN
Inventors: 张迎周; 王星; 陈星昊; 尹秀; 赵莲
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2017-06-09
Filing date: 2017-06-09
Publication date: 2020-05-05
Anticipated expiration: 2037-06-09
Also published as: CN107358099A

Abstract

The invention discloses a useless variable detection method based on LLVM intermediate representation program slicing, which comprises the steps of starting from a program source code added with useless variables, firstly converting the source code into a LLVM intermediate representation form, then analyzing the LLVM intermediate representation by using a program slicing technology to obtain a program dependence graph, then extracting and simplifying the program dependence graph to obtain a variable distance graph, finally setting a distance threshold value, calculating the distance between an output variable and other variables on the variable distance graph, and judging whether the useless variables exist in the source code. The method and the device can effectively detect the useless variables added into the source codes, and have universality when detecting the source codes of different languages.

Description

Useless variable detection method based on LLVM intermediate representation program slicing technology

Technical Field

The invention relates to the technical field of malicious code analysis, in particular to a useless variable detection method based on LLVM intermediate representation program slicing technology.

Background

With the rapid development of the internet technology in the information age, the life of people becomes more convenient and efficient, and meanwhile, network users are more easily attacked by malicious codes. Network information security is increasingly emphasized by people, and various malicious code analysis methods are continuously proposed. In order to increase the difficulty of analyzing the malicious code, a writer of the malicious code often adopts various methods to protect the code, and code obfuscation is one of the commonly used methods. The use of code obfuscation techniques increases the overhead of reverse engineers to analyze the code and also enables malware to evade detection by security tools.

Control flow obfuscation and data flow obfuscation are the most widely used code obfuscation methods. The former changes the control flow structure of a program through various means, and makes the control flow of the program complicated and difficult to analyze and understand by people on the premise of not changing the execution result of the program. The latter converts data or data structures in the program into an unintelligible form without affecting the result of the program execution, making it difficult for an anti-obfuscator to analyze the data in the program. Inserting useless variables is one of control flow obfuscation methods that inserts variables in the source program that are not related to the results of program execution, thereby preventing the anti-obfuscator from analyzing the code.

Researchers at home and abroad propose various code anti-confusion methods. An article proposes a detection method of an opaque predicate facing logic, which represents the intrinsic characteristics of the opaque predicate by constructing a general logic formula, judges whether the predicate is opaque or not by symbolic execution and constraint solution, and further restores a program control flow structure. There is a paper that proposes a method combining static analysis and dynamic analysis, and supplements the result of dynamic analysis by using static analysis, and adds a possible control flow edge to a control flow graph obtained by dynamic analysis to recover the control flow graph of an obfuscated code. There is a paper that proposes a program conversion method with semantic preservation, which combines with a taint recognition technique to recover the internal logic of a program from code using different obfuscation techniques.

These code anti-obfuscation methods, although recovering the control flow structure by various technical means, cannot analyze either specifically for inserted useless variables or uniformly for obfuscated codes of different programming languages. Therefore, general and targeted detection methods of useless variables are still in need of further research.

Disclosure of Invention

The invention provides a useless variable detection method based on LLVM intermediate representation program slices. The method starts from a source code possibly added with useless variables, analyzes the source code by using a program slicing technology, detects the useless variables inserted in the source code, and restores the original control flow structure of a program. The method can carry out unified analysis on the source codes written in different languages, reduces the manual analysis overhead and improves the detection efficiency.

The invention utilizes the program slicing technology to analyze the source code possibly added with useless variables to obtain a program dependency graph. And constructing a variable distance graph by extracting and simplifying the program dependency graph, and calculating the distance between the variables on the graph. Finally, the variables inserted into the source code and irrelevant to the program execution result are detected.

The method comprises the following steps:

s1, acquiring source codes into which useless variables can be inserted;

s2, converting the source code in the S1 into a form of LLVM intermediate representation under the LLVM;

s3, slicing the LLVM intermediate representation obtained in the S2 by using a program slicing technology to obtain a program dependence graph;

s4, extracting and simplifying the program dependency graph to construct a variable distance graph;

s5, setting the variable number n in the source code as a variable distance threshold value r, calculating the distance d between other variables and the output variable on the variable distance graph, and if d > r, considering the variables as useless variables irrelevant to the program execution result.

The conversion of the source code into the form of LLVM intermediate representation as described in S2 is done by means of a clone compiler.

The process of constructing the variable distance map in S4 is as follows:

s4-1, traversing nodes in the program dependency graph, adding all variables as nodes into the variable distance graph, and only one repeated node is reserved;

s4-2, traversing edges in the program dependency graph, setting a variable set in a starting node of the edge as B and a variable set in an ending node of the edge as E for one directed edge in the program dependency graph, and adding a directed edge which points to a variable in the set E from the variable in the set B into the variable distance graph; only one edge is reserved for the repeated edge.

As a method for detecting useless variables in source codes, the method makes up the defects of the traditional control flow obfuscated code detection method, analyzes the source program added with the useless variables by using a program slicing technology, extracts and simplifies the obtained program dependency graph, and constructs a variable distance graph. And calculating the distance between the variables on the variable distance graph, and detecting useless variables added in the source program.

The present invention brings the following advantageous effects

(1) Analyzing the source program added with the useless variables by using a program slicing technology, constructing a variable distance graph, calculating the distance between the variables on the graph, and having higher accuracy when detecting the useless variables;

(2) the source program is converted into LLVM intermediate representation, and then slice analysis is carried out on the LLVM intermediate representation. Through the conversion of LLVM intermediate representation, the source programs written in different languages can be analyzed and processed uniformly, so that the method has strong universality in detection of useless variables.

Drawings

Fig. 1 is a flowchart of a useless variable detection method based on LLVM intermediate representation program slicing technology according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

A useless variable detection method based on LLVM intermediate representation program slicing obtains a variable distance graph of a source program by using LLVM intermediate representation and program slicing technology from the source program added with useless variables, calculates the distance between an output variable and other variables on the variable distance graph, and detects the useless variables inserted in the source program. Fig. 1 shows the overall process of the method of the present invention, which comprises the following steps:

step 1): source code is obtained that may have garbage variables added. Useless variables are variables in a program that have no effect on the outcome of the output, but may have control or data dependencies with other variables in the program. By adding useless variables, the source code can be made complex and difficult to analyze and understand by humans. The source code with the inserted garbage variables can be downloaded from a malicious code website.

Step 2): the source code is converted into a form of LLVM intermediate representation. LLVM is short for low level virtual machine, and is a compiler framework written by C + +. The LLVM intermediate representation can uniformly represent source code written in different languages. The LLVM intermediate representation file can be compiled from the source code by a clone compiler.

Step 3): and (3) carrying out slicing analysis on the LLVM intermediate representation obtained in the step (2) by using a program slicing technology to obtain a Program Dependency Graph (PDG) (program dependency graph). The program dependency graph is composed of a control flow graph, a control dependency graph and a data dependency graph. The control flow graph contains control flow information of the program, the control dependency graph contains control dependency information of the program, and the data dependency graph contains data dependency information of the program. The program slicing technology can analyze various dependency relations possibly existing in the process and generate a program dependency graph.

Step 4): extracting and simplifying the program dependence graph obtained in the step 3, and constructing a Variable Distance Graph (VDG).

Step 4.1): nodes in the PDG are traversed, and all variables in the PDG nodes are added as nodes to the VDG. Only one of the repeated nodes is reserved;

step 4.2): traversing the edges in the PDG, and setting the variable set in the starting node of the edge as B and the variable set in the ending node of the edge as E for one directed edge in the PDG. Adding a directed edge in the VDG that points from the variable in set B to the variable in set E. Only one edge is reserved for the repeated edge.

Step 5): a variable distance threshold r is set, and the variables output from the source code are analyzed to detect useless variables in the program.

Step 5.1): setting the variable number n in the source code as a variable distance threshold r;

step 5.2): and (4) setting the weight values of edges in the VDG to be 1, and calculating the distance d between the output variable and other variables on the VDG. If a directed path does not exist between a certain variable and an output variable, d is infinite;

step 5.3): the relation between d and r is judged, and variables satisfying d > r are considered to be useless variables irrelevant to the program execution result.

Claims

1. The useless variable detection method based on the LLVM intermediate representation program slicing technology is characterized by comprising the following steps of:

s1, acquiring source codes into which useless variables can be inserted;

s4, extracting and simplifying the program dependency graph, and constructing a variable distance graph, wherein the process comprises the following steps:

s4-2, traversing edges in the program dependency graph, setting a variable set in a starting node of the edge as B and a variable set in an ending node of the edge as E for one directed edge in the program dependency graph, and adding a directed edge which points to a variable in the set E from the variable in the set B into the variable distance graph; only one edge is reserved for repeated edges;

2. The garbage variable detecting method according to claim 1, wherein the converting the source code into the LLVM intermediate representation in S2 is performed by a claspg compiler.