CN116663019B

CN116663019B - Source code vulnerability detection method, device and system

Info

Publication number: CN116663019B
Application number: CN202310823880.1A
Authority: CN
Inventors: 索雯琪; 胡雨涛; 吴月明; 李珍; 邹德清
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-10-24
Anticipated expiration: 2043-07-06
Also published as: CN116663019A

Abstract

The application discloses a source code vulnerability detection method, device and system, belonging to the technical field of information security, wherein the method comprises the following steps: performing static analysis on the code segments in the training set to obtain corresponding enhanced AST, and converting the enhanced AST into a gray level image corresponding to the state probability matrix; training an original CNN model by using gray images corresponding to code segments in a training set to obtain a target CNN model; converting the source code to be detected into a gray level image of a state probability matrix corresponding to the enhanced AST; and inputting the gray level image corresponding to the source code to be detected into a target CNN model to obtain a vulnerability detection result. The application carries out static detection on the codes and further realizes AST expansion, thus being capable of more completely and comprehensively retaining the grammar and semantic information of the program; the method has the advantages that the AST is converted into a picture form to represent the mode while the program structure information is reserved, and then the trained CNN model is utilized to detect the loopholes, so that the detection efficiency can be improved, and the multi-program language can be supported.

Description

Source code vulnerability detection method, device and system

Technical Field

The application belongs to the technical field of information security, and particularly relates to a method, a device and a system for detecting source code loopholes.

Background

In recent years, network security events such as hacker investigation, botnet attack, user information leakage and the like frequently occur, and as an important component of network space, the vulnerability of a software system brings serious security threat to the network space. According to the National Vulnerability Database (NVD) statistics, the number of global vulnerabilities is increasing, the number of security vulnerabilities disclosed by 2021 has reached 20137, and the growth rate also shows an increasing trend. Automated attack and defense has gradually become a trend of research. Under the trend of automatic attack and defense, the discovery and the mining of the loopholes are the most basic stages. Therefore, the method actively discovers the security hole of the system and has important significance for attack and defense.

Common vulnerability detection methods convert the code into an intermediate representation to learn the code characterizations. According to the conversion mode of the source code, the existing research can be divided into four types: text-based detection, token-based detection, syntax tree-based detection, and graph-based detection. The deep learning loophole detection based on the text directly uses the code text as input, but semantic information of the program cannot be accurately grasped; the deep learning vulnerability detection based on token divides each code line into a mark sequence according to lexical rules, but still regards the source code as plain text, and lacks program semantics and context information; syntax tree-based deep learning vulnerability detection represents code with a syntactic structure, such as an parse tree or an Abstract Syntax Tree (AST), which provides more accurate syntax information, but tree analysis is very complex and costly; the deep learning loophole detection based on the graph describes source codes by graphs (PDG, CFG), wherein nodes represent sentences or identifier separators, edges represent control or data dependence, and grammar and semantic information of a program can be completely and comprehensively reserved. However, graphic analysis is time-consuming and difficult to expand. And some graphics (such as PDG) generation needs to be compiled, and can only support C/C++, and cannot be suitable for other languages.

Therefore, the existing intelligent vulnerability detection method cannot be applied to large-scale real software and mainly has the following two defects: 1) Efficiency and accuracy are difficult to achieve; 2) Only one programming language is generally supported, and the method is not applicable to detection of other languages.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the application provides a source code vulnerability detection method, a device and a system, which aim to realize AST expansion by carrying out static detection on codes and can more completely and comprehensively reserve grammar and semantic information of programs; converting AST into a picture form to represent the mode while retaining the program structure information, and further utilizing a trained CNN model to perform vulnerability detection, so that the detection efficiency can be improved, and the multi-program language can be supported; therefore, the technical problems that efficiency and precision are difficult to be complete and compatibility is poor when the vulnerability detection method is applied to large-scale real software are solved.

To achieve the above object, according to one aspect of the present application, there is provided a source code vulnerability detection method, including:

training phase:

s1: aiming at the code segments in the training set, obtaining a corresponding enhanced abstract syntax tree AST through static analysis;

s2: converting the enhanced AST of the code segments in the training set into a gray level image corresponding to a state probability matrix of the enhanced AST;

s3: training an original CNN model by using the gray level image corresponding to the code segment in the training set to obtain a target CNN model;

and (3) detection:

s4: converting the source code to be detected into a gray level image of a state probability matrix corresponding to the enhanced AST according to the steps in S1 and S2;

s5: and inputting the gray level image corresponding to the source code to be detected into the target CNN model to obtain a vulnerability detection result.

In one embodiment, the S1 includes:

generating AST of the code fragments in the training set through static analysis;

and adding a control stream and a data stream to the AST of the code fragments in the training set to obtain the enhanced AST of the code fragments in the training set.

In one embodiment, the enhanced AST specifies the following types of edges representing data and control flows:

father-son relationship: according to AST rule, connecting non-terminal node to all sub-nodes;

sibling relationship: connecting a node to its sibling node;

the following identification: connecting a terminal node to the next terminal node;

data flow: connecting nodes used by one variable and nodes appearing next time;

control flow: sides representing if, for, while statement control flow and sides representing statement order.

In one embodiment, the S2 includes:

s21: counting and enhancing information of two nodes connected with one edge of each sub tree in the AST to obtain the times of transferring one state into the other state; establishing an AST-based Markov chain model by counting all state transition conditions;

s22: generating a state transition matrix according to the state transition times recorded in the AST-based Markov chain model;

s23: and converting the state transition matrix into a transition probability matrix, and graying values in the transition probability matrix to obtain a corresponding gray image.

In one embodiment, the step S23 includes:

normalizing all data in the state transition matrix to determine the probability of one state transition to another state, and finally obtaining a transition probability matrix;

and graying the values in the transition probability matrix to obtain a corresponding gray image.

In one embodiment, the state of each subtree includes: statement expressions, call statements, parameter lists, and identifiers.

In one embodiment, the step S5 includes:

inputting the gray level image corresponding to the source code to be detected into the target CNN model;

a vulnerability detection result 1 output by the target CNN model indicates that the source code to be detected has a vulnerability;

and if the vulnerability detection result 0 output by the target CNN model indicates that the source code to be detected is not vulnerability.

According to another aspect of the present application, there is provided a source code vulnerability detection apparatus, including:

the training module is used for acquiring a corresponding enhanced abstract syntax tree AST through static analysis aiming at the code fragments in the training set; converting the enhanced AST of the code segments in the training set into a gray level image corresponding to a state probability matrix of the enhanced AST; training an original CNN model by using the gray level image corresponding to the code segment in the training set to obtain a target CNN model;

the detection module is used for converting the source code to be detected into a gray image of a state probability matrix corresponding to the enhanced AST; and inputting the gray level image corresponding to the source code to be detected into the target CNN model to obtain a vulnerability detection result.

According to another aspect of the present application there is provided a source code vulnerability detection system comprising a memory storing a computer program and a processor implementing the steps of the method described above when executing the computer program.

According to another aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method described above.

In general, the above technical solutions conceived by the present application, compared with the prior art, enable the following beneficial effects to be obtained:

(1) According to the source code vulnerability detection method for large-scale real software, static detection is carried out on codes based on AST to realize AST expansion, and grammar and semantic information of programs can be reserved completely and comprehensively; the method has the advantages that the AST is converted into a picture form to represent the mode while the program structure information is reserved, and then the trained CNN model is utilized to detect the loopholes, so that the detection efficiency can be improved, and the multi-program language can be supported. The application solves the problems of detection efficiency and accuracy by analyzing and enhancing AST, and realizes rapid and accurate large-scale vulnerability detection supporting multiple program languages.

(2) According to the scheme, code semantics and structure information in an AST node are fully utilized, edges representing control flow, data flow and statement execution sequence information are additionally added to expand the AST to generate the enhanced AST, and code features matched with the graph are obtained in a short time. The semantic and grammar information of the program is extracted to the greatest extent while the efficiency is ensured.

(3) The generated enhanced AST is expressed in a Markov chain mode and finally converted into a gray image, the AST is expressed in a simpler mode while the program structure information is maintained, the AST information is fully fused and converted into a picture, and the vulnerability detection is more efficient based on CNN classification. The tool tree-side for extracting AST used in static analysis is a parser generator tool and an incremental parsing library. It can build a specific syntax tree for a source code file and efficiently update the syntax tree when editing a source file. It supports parsing in multiple programming languages including python, java, c, etc. While supporting the use of multiple programming languages. Thus, a small amount of modification is required and can be easily applied to other languages and data sets.

Drawings

Fig. 1 is a schematic diagram of a source code vulnerability detection method for large-scale real software according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a generation process of a source code corresponding enhanced AST according to an embodiment of the present application.

Fig. 3 is a schematic diagram of an embodiment of the present application for enhancing AST conversion into a gray scale image.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. In addition, the technical features of the embodiments of the present application described below may be combined with each other as long as they do not collide with each other.

As shown in fig. 1, a source code vulnerability detection method is provided, which mainly includes two stages: a training phase and a detection phase.

The purpose of the training phase is to train a target CNN model for analyzing the suspicious nature of the gray scale image generated by AST transformation. The method mainly comprises 3 steps, including obtaining enhanced AST through static analysis, converting the enhanced AST into a state probability matrix, converting the matrix into a gray level image, and training a CNN model by using the gray level image generated by converting the enhanced AST;

the purpose of the detection stage is to classify whether the application to be detected is a vulnerability, wherein the output is 1 and is a vulnerability, and the output is 0 and is a non-vulnerability. Firstly, counting the information of two nodes connected by each side in AST to obtain the times of transferring one state to the other state, and establishing a Markov chain model based on AST and a corresponding state transfer matrix thereof. And converting the values in the transition probability matrix obtained by processing into gray values to obtain corresponding gray images. And finally, detecting the generated gray level image by using the trained CNN model, and judging whether the gray level image is a vulnerability or not.

Among these, convolutional neural network models (Convolutional Neural Networks, CNN) are a type of neural network that is specifically used to process data having a grid-like structure, such as image data (which can be regarded as a two-dimensional grid of pixels). The difference from the fully connected layer is that the upper and lower neurons of CNN are not directly connected, but the parameters of the hidden layer are greatly reduced by the sharing of the "kernel" through the "convolution kernel" as an intermediary. A simple CNN is a series of layers, and each Layer converts one quantity to another by a micro-functional, and these layers mainly include a convolution Layer (Convolutional Layer), a Pooling Layer (Pooling Layer), and a fully-connected Layer (Fully Connected Layer).

An abstract syntax tree (Abstract Syntax Code, AST) is an abstract representation of the source code syntax structure. It represents the syntax structure of a programming language in the form of a tree, each node on the tree representing a structure in the source code. An abstract syntax tree is a sequential tree structure, with internal nodes being operators (e.g., "+" and "=") and leaf nodes being operands (e.g., constants and identifiers). The abstract syntax tree shows in detail how the operands and operators make up the program expressions and statements, and thus shows the overall form of the program.

In one embodiment, S1 comprises: generating AST of the code fragments in the training set through static analysis; and adding a control stream and a data stream to the AST of the code segments in the training set to obtain the enhanced AST of the code segments in the training set.

Wherein, the enhanced AST is constructed by adding various types of edges representing different types of control and data streams to the AST, so as to solve the problem that the AST cannot fully utilize structural information of code fragments, in particular semantic information such as control streams and data streams. Wherein the control flow represents all paths traversed in the execution of a program and reflects the real-time execution of a process. The data stream gathers information about the properties of a particular data item by tracking the possible definition and use of the data. Enhanced AST in a program is presented in the form of a directed multi-graph, where statements, code blocks, or values are nodes in the graph, and direct relationships (e.g., parent-child relationships and other relationships between two nodes) are recorded as edges. Since there may be a plurality of relationships between a pair of nodes, each type of relationship (nine relationships in total) is recorded using a relationship graph. Node connectivity of the relationship graph is encoded as an adjacency matrix. The graphical representation of enhanced AST is purely AST-based and can be easily extended to other programming languages.

In one embodiment, the enhanced AST specifies the following types to represent edges of data and control streams:

sibling relationship: connecting a node to its sibling node;

data flow: connecting nodes used by one variable and nodes appearing next time;

Fig. 2 illustrates the generation of an enhanced AST, taking a buffer overflow vulnerability code as an example. As shown in fig. 2, the enhanced AST specifies the following types of edges representing the data stream; there are several other edges used to represent control flow. Edges representing if, for, while statement control flows and edges representing statement orders are added. The enhanced AST is then converted into a state probability matrix and the matrix into a gray scale image.

Fig. 3 is a schematic diagram of enhanced AST conversion into a grayscale image, in which in one embodiment the grayscale image generation process is divided into three sections altogether: and generating a Markov chain, generating a state transition matrix, generating a transition probability matrix, and finally generating a corresponding gray image.

In the process of generating a Markov chain, firstly, the information of two nodes connected by one edge in an AST is counted, and the number of times that one state is transferred to the other state is obtained. There are four states in total for the subtree as shown in fig. 3: statement expressions, call statements, parameter lists, identifiers Assignment, operator, member Reference, and Identifier. As can be seen from the pointing information of the edge in the AST, the number of transitions of the state parameter list to the state identifier is 3. By counting all state transition conditions, an AST-based Markov chain model is established.

Wherein MC (Markov Chain) is a random process in the state space that goes through a transition from one state to another, which process requires "memoryless". I.e. the probability distribution of the next state can only be determined by the current state, and the events preceding it in the time series are independent of it. This particular type of "memoryless" is known as markov properties. The states of the events can be converted into a probability matrix by model conversion of the Markov chain. The state transition matrix is converted by a certain finite number of times, and finally a stable probability distribution can be obtained, which is irrelevant to the initial state probability distribution.

In the process of generating the state transition matrix, the state transition matrix is generated according to the state transition times recorded in the Markov chain model generated before. As shown in fig. 3, the number of transitions of the state parameter list to the state identifier is 3 according to the records in the markov chain model. In the state Matrix, for convenience of expression, letter A is used for representing a state parameter list, letter I is used for representing a state identifier, and data corresponding to the state transition Matrix [ A ] [ I ] is 3. Thereby generating a corresponding state transition matrix.

In one embodiment, in the process of generating the transition probability matrix, all data are normalized according to the state transition matrix to obtain the probability of one state transition to another state, and the corresponding transition probability matrix is obtained. And then converting the values into gray values to obtain corresponding gray images. And finally, inputting gray images obtained by all training sets into the CNN model to obtain the trained CNN model.

the training module is used for acquiring a corresponding enhanced abstract syntax tree AST through static analysis aiming at the code fragments in the training set; converting the enhanced AST of the code segments in the training set into a gray level image corresponding to the state probability matrix of the enhanced AST; training an original CNN model by using gray images corresponding to code segments in a training set to obtain a target CNN model;

the detection module is used for converting the source code to be detected into a gray image of a state probability matrix corresponding to the enhanced AST; and inputting the gray level image corresponding to the source code to be detected into a target CNN model to obtain a vulnerability detection result.

According to another aspect of the present application there is provided a source code vulnerability detection system comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the method described above when executing the computer program.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the application and is not intended to limit the application, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims

1. A method for detecting source code vulnerabilities, comprising:

training phase:

s2: converting a state probability matrix corresponding to an enhanced abstract syntax tree AST of the code segments in the training set into a gray level image;

and (3) detection:

s4: converting the source code to be detected into a gray level image of a state probability matrix corresponding to the enhanced abstract syntax tree AST according to the steps in S1 and S2;

s5: inputting the gray level image corresponding to the source code to be detected into the target CNN model to obtain a vulnerability detection result;

the S1 comprises the following steps: generating AST of the code fragments in the training set through static analysis; adding a control stream and a data stream to the AST of the code segments in the training set to obtain an enhanced abstract syntax tree AST of the code segments in the training set;

the step S2 comprises the following steps: s21: counting the information of two nodes connected by one edge of each sub tree in the enhanced abstract syntax tree AST to obtain the times of transferring one state to the other state; establishing a Markov chain model based on an enhanced abstract syntax tree AST by counting all state transition conditions; s22: generating a state transition matrix according to the state transition times recorded in the Markov chain model based on the enhanced abstract syntax tree AST; s23: converting the state transition matrix into a transition probability matrix, and graying values in the transition probability matrix to obtain a corresponding gray image;

the S23 includes: normalizing all data in the state transition matrix to determine the probability of one state transition to another state, and finally obtaining a transition probability matrix; and graying the values in the transition probability matrix to obtain a corresponding gray image.

2. The source code vulnerability detection method of claim 1, wherein the enhanced abstract syntax tree AST specifies the following types of edges representing data and control flows:

sibling relationship: connecting a node to its sibling node;

data flow: connecting nodes used by one variable and nodes appearing next time;

3. The source code vulnerability detection method of claim 1, wherein the state of each subtree comprises: statement expressions, call statements, parameter lists, and identifiers.

4. A source code vulnerability detection method as claimed in any one of claims 1-3, wherein S5 comprises:

5. A source code vulnerability detection apparatus for performing the source code vulnerability detection method of any one of claims 1-4, comprising:

the training module is used for acquiring a corresponding enhanced abstract syntax tree AST through static analysis aiming at the code fragments in the training set; converting the enhanced abstract syntax tree AST of the code segments in the training set into a gray level image corresponding to a state probability matrix of the AST; training an original CNN model by using the gray level image corresponding to the code segment in the training set to obtain a target CNN model;

the detection module is used for converting the source code to be detected into a gray image of a state probability matrix corresponding to the enhanced abstract syntax tree AST; and inputting the gray level image corresponding to the source code to be detected into the target CNN model to obtain a vulnerability detection result.

6. A source code vulnerability detection system comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any one of claims 1-4.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.