CN116702160A - Source code vulnerability detection method based on data dependency enhancement program slice - Google Patents

Source code vulnerability detection method based on data dependency enhancement program slice Download PDF

Info

Publication number
CN116702160A
CN116702160A CN202310982855.8A CN202310982855A CN116702160A CN 116702160 A CN116702160 A CN 116702160A CN 202310982855 A CN202310982855 A CN 202310982855A CN 116702160 A CN116702160 A CN 116702160A
Authority
CN
China
Prior art keywords
program
node
code
graph
source code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310982855.8A
Other languages
Chinese (zh)
Other versions
CN116702160B (en
Inventor
胡勇
陈晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202310982855.8A priority Critical patent/CN116702160B/en
Publication of CN116702160A publication Critical patent/CN116702160A/en
Application granted granted Critical
Publication of CN116702160B publication Critical patent/CN116702160B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a source code vulnerability detection method based on data dependency enhancement program slicing, which comprises the steps of obtaining data dependency information and control dependency information of a source code through analysis of the source code, constructing a program dependency graph, and then enhancing the program dependency graph; program slicing is carried out by taking program slicing interest points as tangent points, sub-graphs of the program dependency graph are obtained, and vulnerability classification labels of the sub-graphs are determined by whether vulnerability code sentences are contained or not; anonymizing user-defined identifiers in a source code, and then converting each semantic unit in the code into a vector by using a Word2Vec technology to form a dictionary; converting code sentences of all nodes in the subgraph obtained after program slicing into vector sequences according to a dictionary; by detecting the code to be detected, the position information of the loopholes is given to help the loopholes repairing personnel to quickly locate the loopholes, the types of the loopholes are given, and help is provided for quickly repairing the loopholes.

Description

Source code vulnerability detection method based on data dependency enhancement program slice
Technical Field
The invention relates to the technical field of source code vulnerability detection, in particular to a source code vulnerability detection method based on data dependency enhancement program slices.
Background
As people increasingly rely on the internet, software programs are increasingly needed by more people as bridges to which people connect. As the demands of people become more and more complex, the code structure of the software program naturally becomes complex, so that loopholes are more easily introduced, and the security of the software program becomes non-negligible. Relevant surveys show that for millions of software programs, an average of every 1000 lines of code will contain a vulnerability. The event of serious loss caused by software loopholes is also endless. Since the advent of software programs, attempts have been made to detect vulnerabilities of programs by various methods, such as: rule matching scanning, smudge analysis, symbol execution, fuzzy testing, code similarity measurement, etc. Vulnerability detection on source code can avoid vulnerabilities during the program development phase. The earlier the vulnerability is discovered, the less impact the vulnerability has and the cost of repairing the vulnerability is.
With the popularization and successful application of artificial intelligence in recent years, some students try to detect vulnerabilities by using artificial intelligence algorithms, and research results show excellent detection performance. Currently, the vulnerability detection algorithm processing steps based on the artificial intelligence algorithm are generally divided into three steps: firstly, preprocessing a source code through a static analysis technology, extracting and constructing a representation form containing source code grammar and semantic information, then converting data in a character form into vectors, extracting features by using a neural network, and finally, training a classifier by using the extracted feature vectors to classify. The preprocessing stage is currently commonly used for data stream and control stream analysis techniques, abstract syntax tree (Abstract Syntax Tree, AST) construction, program slicing techniques, etc. A graph neural network and a recurrent neural network are generally used in the feature extraction step. The cyclic neural network mainly performs feature extraction on the preprocessed character sequence (such as an abstract syntax tree traversal sequence and a character sequence of a program fragment), and the graph neural network mainly performs feature extraction on the preprocessed graph structure data (such as an abstract syntax tree, a control flow graph, a data dependency graph, a program dependency graph and a code attribute graph). In the classifying step, the classifier is usually trained by means of the extracted feature vectors, and the capability of the classifier for correctly classifying new data is improved.
However, as vulnerability patterns in real projects become more and more complex, current advanced methods generally use more basic static analysis techniques in the preprocessing section, lacking grammar and semantic information extraction for complex vulnerabilities. Li et al (Li Z, zou D, xu S, et al, sysevr: A framework for using deep learning to detect software vulnerabilities J IEEE Transactions on Dependable and Secure Computing, 2021, 19 (4): 2244-2258.) originally used deep learning algorithms for vulnerability detection propose to obtain program slices on a program dependency graph and convert them into strings, use a BiLSTM (Bi-directionalLong Short-Term Memory) network to extract features, and use a multi-layer perceptron for vulnerability detection. However, the conventional cyclic neural network simply arranges codes into sequences because only sequence information is accepted, so that part of strongly related grammar semantic code fragments are far apart, and semantic information among codes cannot be effectively transferred, which is unfavorable for model identification. Therefore, some scholars have tried to use graph neural networks for vulnerability detection, for example Zhuang Rongfei et al (Zhuang Rongfei. Key technology research for vulnerability mining based on graph networks [ D ]. Harbin university of industry, 2020.) to convert codes into graph structural representations and use graph networks for feature extraction, the vulnerability detection effect is significantly better than that of traditional machine learning methods. However, when the graph neural network embeds the code statement into the graph node vector, the whole model is affected by the pre-training model by adopting simple static techniques such as Word2Vec or Doc2Vec, so that good generalization cannot be realized.
Disclosure of Invention
The invention aims to provide a source code bug detection method based on a data dependency enhancement program slice, which is used for providing position information of bugs to help bug repair personnel to quickly locate bugs and providing types of bugs by detecting codes to be detected and providing help for quickly repairing bugs.
The invention is realized by the following technical scheme: a source code vulnerability detection method based on data dependency enhancement program slices comprises the following steps:
1) Generating a program dependency graph and enhancing data: the data dependency information and the control dependency information of the source code are obtained through analysis of the source code, a program dependency graph is constructed, and then enhancement operation is carried out on the program dependency graph;
2) Program slicing is carried out by taking program slicing interest points as tangent points, sub-graphs of the program dependency graph are obtained, and vulnerability classification labels of the sub-graphs are determined by whether vulnerability code sentences are contained or not; the concrete mode of determining whether the vulnerability classification label of the subgraph contains vulnerability code sentences is as follows: if the sub-graph contains the bug code statements, the sub-graph is regarded as being bug-free, the bug type is the same as the label of the program dependency graph generating the sub-graph, and if the sub-graph does not contain the bug code statements, the sub-graph is regarded as being bug-free.
3) Anonymizing user-defined identifiers in a source code, and then converting each semantic unit in the code into a vector by using a Word2Vec technology to form a dictionary;
4) Converting code sentences of nodes in the subgraph obtained after program slicing into vector sequences according to the dictionary generated in the step 3);
5 since the original code lengths of the nodes are different, the vector sequence lengths of the nodes are also different, and in order to be able to use the graph neural network in the subsequent steps, the initial node vector is embedded into a vector with uniform length by adopting the gated loop recurrent neural network.
6) And sending the subgraphs with the node vectors embedded into a graph neural network model for training and testing to obtain the vulnerability multi-classification detection model of the software source code.
7) After the source codes to be detected are processed in the steps 1) to 4), the processed source codes to be detected are subjected to reasoning and prediction by utilizing the vulnerability multi-classification detection model of the software source codes trained in the step 6), so that the vulnerability type detection is completed.
Further, in order to better realize the source code vulnerability detection method based on the data dependency enhancement program slice, the following setting mode is adopted: because the special processing of the function call statement exists in the traditional program dependency graph, the data pollution behavior occurring in the function call statement cannot be recorded, the method carries out data dependency enhancement operation on the program dependency graph, corrects the problem of inaccurate data dependency in the traditional program dependency graph through the special processing of the function call statement, enhances the data dependency relationship between each code statement and the function call statement, and carries out the enhancement operation on the program dependency graph, wherein the specific steps comprise:
1.1 After the program dependency graph is constructed, scanning all nodes to find out function call nodes taking the reference type or the pointer type as parameters;
1.2 Further processing the found function call node, finding the data dependency node of the parameter, and carrying out backward slicing on the program dependency graph by taking the node as an initial node;
1.3 For the node in the backward slice result obtained in step 1.2), selecting the node with the index of the node (namely the corresponding code line number) larger than the index of the function call node, establishing a data dependency relationship between the node and the function call node, and adding the data dependency relationship into the original program dependency graph.
Further, in order to better realize the source code vulnerability detection method based on the data dependency enhancement program slice, the following setting mode is adopted: the specific steps of the program slice are as follows:
2.1 Proceeding normal forward slice and backward slice from the tangent point, and incorporating the slice result into the final slice result;
2.2 Identifying the conditional statement nodes in the final slicing result, and taking the conditional statement nodes as tangent points to carry out forward slicing, and searching for data dependent nodes;
2.3 And 2) taking the nodes in the forward slicing result in the step 2.2) as starting points to perform backward slicing again, and incorporating the nodes with the node indexes larger than the conditional node indexes in the slicing result into the final slicing result.
Further, in order to better realize the source code vulnerability detection method based on the data dependency enhancement program slice, the following setting mode is adopted: the program slicing interest point refers to a code statement containing a code structure which is easy to cause program loopholes, and the program slicing interest point specifically refers to a code statement using one or more code structures in arithmetic expressions, pointers, arrays and sensitive library function calls.
Further, in order to better realize the source code vulnerability detection method based on the data dependency enhancement program slice, the following setting mode is adopted: the method for embedding the initial node vector into the vector with the uniform length by adopting the gated cyclic recurrent neural network comprises the following specific steps of:
5.1 Filling or cutting off the vector sequence of each node by manual operation to ensure that the lengths of the vector sequences of each node are consistent, wherein the lengths of the vector sequences are set to be 20 sequence elements;
5.2 The fixed-length vector sequence of the node is sent into a gating cyclic recurrent neural network for feature extraction, in the gating cyclic recurrent neural network, each neural unit processes a sequence element and transmits the information to the next neuron, the last neuron receives the information of all the previous neurons, the hidden state of the last neuron is taken as the embedded vector of the node, and finally the vector of each node is expressed as a 256-dimensional vector;
5.3 Parameters in the gated recurrent neural network are updated as the entire network model is back-propagated.
Further, in order to better realize the source code vulnerability detection method based on the data dependency enhancement program slice, the following setting mode is adopted: the overall architecture of the graph neural network model comprises a 4-layer graph convolution and a graph pooling convolution pooling block and a multi-layer perceptron. The loss function of the whole graph neural network is a cross entropy loss function with a penalty factor, wherein the penalty factor is used for relieving the influence caused by sample imbalance in multiple classifications.
Further, in order to better realize the source code vulnerability detection method based on the data dependency enhancement program slice, the following setting mode is adopted: in the training test process of sending the node embedded subgraphs into the graph neural network model, dividing a data set into a training set, a verification set and a test set according to the proportion of 8:1:1; in the graph neural network, the parameter updating algorithm of each layer of network adopts an Adam gradient descent algorithm, the super parameter of the graph neural network selects the optimal super parameter setting by using a ten-time cross validation method, the learning rate is set to be 0.001, the batch_size is 64, and the number of hidden layers of the convolution layer is 256.
Further, in order to better realize the source code vulnerability detection method based on the data dependency enhancement program slice, the following setting mode is adopted: the anonymizing the user-defined identifier in the source code is specifically: unified normalization of user-defined variables into 'VAR_i', wherein i is the sequence of the corresponding variable names in the code, and i epsilon (1, 2, &. Cndot. N); unified normalization of user-defined functions into FUNC_i, wherein i is the sequence of the corresponding function names in the code, and i epsilon (1, 2, &. Cndot. M); user-defined variables are unified into 'TYPE_i', wherein i is the sequence of the corresponding structure names in the code, and i epsilon (1, 2, & P).
Compared with the prior art, the invention has the following advantages:
the invention improves the program dependency graph, enhances the data dependency relationship in the original program dependency graph, enables the program dependency graph to model the call behavior of the function with real parameters so as to identify the data pollution behavior and increase the information expressed by the program dependency graph. The special processing of the function call statement exists in the traditional program dependency graph, so that the data pollution behavior generated in the function call statement cannot be recorded, therefore, the invention carries out data dependency enhancement operation on the program dependency graph, corrects the problem of inaccurate data dependency in the traditional program dependency graph through the special processing of the function call statement, and enhances the data dependency relationship between each code statement and the function call statement.
The invention provides a new slicing method, which enables a final program slice subgraph to contain more information by carrying out additional program slicing operation on a conditional statement, and a model can identify a complex condition judgment structure. In a real scene, many loopholes are caused by a complex circulation structure caused by dynamic factors, and the direct reason of the loopholes is that the circulation ending condition is set improperly, so that the condition statement has important significance in loophole detection, and the addition of relevant extra slices can supplement more relevant information for the condition statement.
The invention uses the gate control cyclic neural network to embed the nodes, can better extract the information of the code sentences, and dynamically updates the neural network, thereby ensuring that the embedded result approaches to the optimal embedded result.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a diagram of the neural network according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples, but embodiments of the present invention are not limited thereto.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
Noun interpretation:
word2Vec: pre-training models oriented to natural language and programming language;
GCN: a graph convolutional neural network;
joern: a specific open source tool name;
PDG: a program dependency graph;
joern-parse: a subfunction of the open source tool Joern;
program Slicing Points of Interest: program slice points of interest.
Example 1:
a source code bug detection method based on a data dependency enhancement program slice is used for detecting a code to be detected, giving out position information of a bug to help bug repairing personnel to quickly locate the bug and giving out the type of the bug to help quick bug repairing, and comprises the following steps:
1) Generating a program dependency graph and enhancing data: the data dependency information and the control dependency information of the source code are obtained through analysis of the source code, a program dependency graph is constructed, and then enhancement operation is carried out on the program dependency graph;
2) Program slicing is carried out by taking program slicing interest points as tangent points, sub-graphs of the program dependency graph are obtained, and vulnerability classification labels of the sub-graphs are determined by whether vulnerability code sentences are contained or not; the concrete mode of determining whether the vulnerability classification label of the subgraph contains vulnerability code sentences is as follows: if the sub-graph contains the bug code statements, the sub-graph is regarded as being bug-free, and the bug type is the same as the label of the program dependency graph generating the sub-graph, if the sub-graph does not contain the bug code statements, the sub-graph is regarded as being bug-free.
3) Anonymizing user-defined identifiers in a source code, and then converting each semantic unit in the code into a vector by using a Word2Vec technology to form a dictionary;
4) Converting code sentences of nodes in the subgraph obtained after program slicing into vector sequences according to the dictionary generated in the step 3);
5) Since the original code lengths of the nodes are different, resulting in different vector sequence lengths of the nodes, in order to be able to use the graph neural network in the subsequent step, the initial node vector is embedded into a vector of uniform length by using the gated loop recurrent neural network.
6) And sending the subgraphs with the nodes embedded into a graph neural network model for training and testing to obtain the vulnerability multi-classification detection model of the software source code.
7) After the source code to be detected is processed in the steps 1) to 4), the processed source code to be detected is subjected to reasoning and prediction by utilizing a trained vulnerability multi-classification detection model of the software source code, so that the detection of the vulnerability type is completed.
Example 2:
the embodiment is further optimized based on the above embodiment, and the same features as the foregoing technical solutions are not described herein, so as to further better implement the source code vulnerability detection method based on the data dependency enhancement program slice according to the present invention, and particularly adopt the following setting manner: because the special processing of the function call statement exists in the traditional program dependency graph, the data pollution behavior occurring in the function call statement cannot be recorded, the method carries out data dependency enhancement operation on the program dependency graph, corrects the problem of inaccurate data dependency in the traditional program dependency graph through the special processing of the function call statement, enhances the data dependency relationship between each code statement and the function call statement, and carries out the enhancement operation on the program dependency graph, wherein the specific steps comprise:
1.1 After the program dependency graph is constructed, scanning all nodes to find out function call nodes taking the reference type or the pointer type as parameters;
1.2 Further processing the found function call node, finding the data dependency node of the parameter, and carrying out backward slicing on the program dependency graph by taking the node as an initial node;
1.3 For the node in the backward slice result obtained in step 1.2), selecting the node with the index of the node (namely the corresponding code line number) larger than the index of the function call node, establishing a data dependency relationship between the node and the function call node, and adding the data dependency relationship into the original program dependency graph.
Example 3:
the embodiment is further optimized on the basis of any one of the embodiments, and the same features as the foregoing technical solutions are not described herein, so as to further better implement the source code vulnerability detection method based on the data dependency enhancement program slice, and particularly adopt the following setting mode: the specific steps of the program slice are as follows:
2.1 Proceeding normal forward slice and backward slice from the tangent point, and incorporating the slice result into the final slice result;
2.2 Identifying the conditional statement nodes in the final slicing result, and taking the conditional statement nodes as tangent points to carry out forward slicing, and searching for data dependent nodes;
2.3 And 2) taking the nodes in the forward slicing result in the step 2.2) as starting points to perform backward slicing again, and incorporating the nodes with the node indexes larger than the conditional node indexes in the slicing result into the final slicing result.
Example 4:
the embodiment is further optimized on the basis of any one of the embodiments, and the same features as the foregoing technical solutions are not described herein, so as to further better implement the source code vulnerability detection method based on the data dependency enhancement program slice, and particularly adopt the following setting mode: the program slicing interest point refers to a code statement containing a code structure which is easy to cause program loopholes, and the program slicing interest point specifically refers to a code statement using one or more code structures in arithmetic expressions, pointers, arrays and sensitive library function calls.
Example 5:
the embodiment is further optimized on the basis of any one of the embodiments, and the same features as the foregoing technical solutions are not described herein, so as to further better implement the source code vulnerability detection method based on the data dependency enhancement program slice, and particularly adopt the following setting mode: the method for embedding the initial node vector into the vector with the uniform length by adopting the gated cyclic recurrent neural network comprises the following specific steps of:
5.1 Filling or cutting off the vector sequence of each node by manual operation to ensure that the lengths of the vector sequences of each node are consistent, wherein the lengths of the vector sequences are set to be 20 sequence elements;
5.2 The fixed-length vector sequence of the node is sent into a gating cyclic recurrent neural network for feature extraction, in the gating cyclic recurrent neural network, each neural unit processes a sequence element and transmits the information to the next neuron, the last neuron receives the information of all the previous neurons, the hidden state of the last neuron is taken as the embedded vector of the node, and finally the vector of each node is expressed as a 256-dimensional vector;
5.3 Parameters in the gated recurrent neural network are updated as the entire network model is back-propagated.
Example 6:
the embodiment is further optimized on the basis of any one of the embodiments, and the same features as the foregoing technical solutions are not described herein, so as to further better implement the source code vulnerability detection method based on the data dependency enhancement program slice, and particularly adopt the following setting mode: the overall architecture of the graph neural network model comprises a 4-layer graph convolution and a graph pooling convolution pooling block and a multi-layer perceptron. The loss function of the whole graph neural network is a cross entropy loss function with a penalty factor, wherein the penalty factor is used for relieving the influence caused by sample imbalance in multiple classifications.
Example 7:
the embodiment is further optimized on the basis of any one of the embodiments, and the same features as the foregoing technical solutions are not described herein, so as to further better implement the source code vulnerability detection method based on the data dependency enhancement program slice, and particularly adopt the following setting mode: in the training test process of sending the node embedded subgraphs into the graph neural network model, dividing a data set into a training set, a verification set and a test set according to the proportion of 8:1:1; in the graph neural network, the parameter updating algorithm of each layer of network adopts an Adam gradient descent algorithm, the super parameter of the graph neural network selects the optimal super parameter setting by using a ten-time cross validation method, the learning rate is set to be 0.001, the batch_size is 64, and the number of hidden layers of the convolution layer is 256.
Example 8:
the embodiment is further optimized on the basis of any one of the embodiments, and the same features as the foregoing technical solutions are not described herein, so as to further better implement the source code vulnerability detection method based on the data dependency enhancement program slice, and particularly adopt the following setting mode: the anonymizing the user-defined identifier in the source code is specifically: unified normalization of user-defined variables into 'VAR_i', wherein i is the sequence of the corresponding variable names in the code, and i epsilon (1, 2, &. Cndot. N); unified normalization of user-defined functions into FUNC_i, wherein i is the sequence of the corresponding function names in the code, and i epsilon (1, 2, &. Cndot. M); user-defined variables are unified into 'TYPE_i', wherein i is the sequence of the corresponding structure names in the code, and i epsilon (1, 2, & P).
Example 9:
a source code bug detection method based on a data dependency enhancement program slice is used for obtaining relevant information by detecting unknown code bugs, helping bug repair staff to quickly locate bugs, giving out types of bugs, further helping bug repair staff to quickly repair bugs, and combining the following steps shown in fig. 1-2:
training phase:
1) Generating a program dependency graph and enhancing data: by analyzing the source code file, obtaining the data dependency information and the control dependency information (extracting the data flow and the control flow information) and constructing a program dependency graph, and then carrying out data dependency enhancement (data dependency enhancement program dependency graph) on the program dependency graph, wherein the specific steps of the data enhancement operation are as follows:
1.1 After the program dependency graph is constructed, scanning all nodes to find out function call nodes taking the reference type or the pointer type as parameters;
1.2 The function call node found in the previous step is further processed, the data dependency node of the parameter is found, and the node is used as an initial node to carry out backward slicing on the program dependency graph;
1.3 For the nodes in the backward slicing result obtained in the last step, selecting the nodes with the node index (namely the corresponding code line number) larger than the function call node index, and establishing a data dependency relationship between the nodes and the function call node;
2) Program slicing is carried out by taking program slicing interest points as tangent points through a slicing technology, a subgraph of a program dependency graph is obtained, and a vulnerability classification label of the subgraph is determined by whether vulnerability codes are contained or not; the program slicing interest point refers to a code statement containing a code structure which is easy to cause program loopholes, and the program slicing interest point specifically refers to a code statement using one or more code structures in arithmetic expressions, pointers, arrays and sensitive library function calls. Judging whether the sub-graph contains a vulnerability code statement or not by the vulnerability classification label of the sub-graph, wherein the vulnerability code statement comprises the following specific steps: if the sub-graph contains the bug code statements, the sub-graph is regarded as being bug-free, the bug type is the same as the label of the program dependency graph generating the sub-graph, and if the sub-graph does not contain the bug code statements, the sub-graph is regarded as being bug-free. The procedure for the program section was as follows:
2.1 Proceeding normal forward slice and backward slice from the tangent point, and incorporating the slice result into the final slice result;
2.2 Identifying the conditional statement nodes in the final slicing result, and taking the conditional statement nodes as tangent points to carry out forward slicing, and searching for data dependent nodes;
2.3 And (3) carrying out backward slicing on the forward slicing result in the previous step again, and incorporating nodes with node indexes larger than the conditional node indexes in the slicing result into the final slicing result.
3) Anonymizing user-defined identifiers in a source code, and then converting each semantic unit in the code into a vector by using Word2Vec technology to form a dictionary; the method comprises the following specific steps:
3.1 Unified normalization of user-defined variables to "VAR _ i", where i is the order in which the corresponding variable names appear in the code, and i.epsilon.1, 2, & n; unified normalization of user-defined functions into FUNC_i, wherein i is the sequence of the corresponding function names in the code, and i epsilon (1, 2, &. Cndot. M); user-defined variables are unified into 'TYPE_i', wherein i is the sequence of the corresponding structure names in the code, and i epsilon (1, 2, & P).
3.2 Training a Word2Vec pre-training model after Word segmentation processing is carried out on the codes, and taking the Word2Vec pre-training model as a dictionary.
4) Vector representation of program slices: converting the dictionary generated by the code statement of each node in the subgraph obtained after program slicing according to the previous step (step 3)) into a vector sequence;
5) Since the original code lengths of the nodes are different, the original node vector sequence lengths are also different, and in order to be able to use the graph neural network in the subsequent steps, the original node vectors are embedded into vectors with uniform lengths by adopting a gated loop recurrent neural network. The method for embedding the initial node vector into the vector with the uniform length by adopting the gated cyclic recurrent neural network comprises the following specific steps of:
5.1 Filling or cutting off the vector sequence of each node by manual operation to ensure that the lengths of the vector sequences of each node are consistent, wherein the lengths of the vector sequences are set to be 20 sequence elements;
5.2 The fixed-length vector sequence of the node is sent into a gating cyclic recurrent neural network for feature extraction, in the gating cyclic recurrent neural network, each neural unit processes a sequence element and transmits the information to the next neuron, the last neuron receives the information of all the previous neurons, the hidden state of the last neuron is taken as the embedded vector of the node, and finally the vector of each node is expressed as a 256-dimensional vector;
5.3 Parameters in the gated recurrent neural network are updated as the entire network model is back-propagated.
6) And sending the subgraphs with the nodes embedded into a graph neural network model for training and testing to obtain the vulnerability multi-classification detection model of the software source code. The overall framework of the graph neural network model comprises a 4-layer graph convolution and a graph pooling convolution pooling block and a multi-layer perceptron. The loss function of the whole graph neural network is a cross entropy loss function with a penalty factor, wherein the penalty factor is used for relieving the influence caused by sample imbalance in multiple classifications. In the training test process, the data set is divided into a training set, a verification set and a test set according to the proportion of 8:1:1. The super parameters of the network are selected to be optimal by using a ten-time cross validation method.
And (3) detection:
7) After the source code to be detected is processed in the steps 1) to 4), the processed source code to be detected is subjected to reasoning and prediction by utilizing a trained vulnerability multi-classification detection model of the software source code, so that the detection of the vulnerability type is completed.
The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent variation, etc. of the above embodiment according to the technical matter of the present invention fall within the scope of the present invention.

Claims (8)

1. A source code vulnerability detection method based on data dependency enhancement program slicing is characterized by comprising the following steps: comprising the following steps:
1) Generating a program dependency graph and enhancing data: the data dependency information and the control dependency information of the source code are obtained through analysis of the source code, a program dependency graph is constructed, and then enhancement operation is carried out on the program dependency graph;
2) Program slicing is carried out by taking program slicing interest points as tangent points, sub-graphs of the program dependency graph are obtained, and vulnerability classification labels of the sub-graphs are determined by whether vulnerability code sentences are contained or not;
3) Anonymizing user-defined identifiers in a source code, and then converting each semantic unit in the code into a vector by using a Word2Vec technology to form a dictionary;
4) Converting code sentences of nodes in the subgraph obtained after program slicing into vector sequences according to the dictionary generated in the step 3);
5) Embedding the initial node vector into a vector with uniform length by adopting a gated cyclic recurrent neural network;
6) Sending the subgraphs with the node vectors embedded into a graph neural network model for training and testing to obtain a vulnerability multi-classification detection model of the software source code;
7) After the source codes to be detected are processed in the steps 1) to 4), the processed source codes to be detected are subjected to reasoning and prediction by utilizing the vulnerability multi-classification detection model of the software source codes trained in the step 6), and the vulnerability type detection is completed.
2. The method for detecting source code vulnerabilities based on data-dependent enhancement program slices as claimed in claim 1, wherein: the specific steps of the enhancement operation on the program dependency graph comprise:
1.1 After the program dependency graph is constructed, scanning all nodes to find out function call nodes taking the reference type or the pointer type as parameters;
1.2 Further processing the found function call node, finding the data dependency node of the parameter, and carrying out backward slicing on the program dependency graph by taking the node as an initial node;
1.3 For the node in the backward slice result obtained in the step 1.2), selecting a node with a node index larger than that of the function call node, establishing a data dependency relationship between the node and the function call node, and adding the data dependency relationship into the original program dependency graph.
3. The method for detecting source code vulnerabilities based on data-dependent enhancement program slices as claimed in claim 1, wherein: the specific steps of the program slice are as follows:
2.1 Proceeding normal forward slice and backward slice from the tangent point, and incorporating the slice result into the final slice result;
2.2 Identifying the conditional statement nodes in the final slicing result, and taking the conditional statement nodes as tangent points to carry out forward slicing, and searching for data dependent nodes;
2.3 And 2) taking the nodes in the forward slicing result in the step 2.2) as starting points to perform backward slicing again, and incorporating the nodes with the node indexes larger than the conditional node indexes in the slicing result into the final slicing result.
4. The method for detecting source code vulnerabilities based on data-dependent enhancement program slices as claimed in claim 1, wherein: the program slice interest point refers to a code statement using one or more code structures in arithmetic expressions, pointers, arrays, and sensitive library function calls.
5. The method for detecting source code vulnerabilities based on data-dependent enhancement program slices as claimed in claim 1, wherein: the method for embedding the initial node vector into the vector with the uniform length by adopting the gated cyclic recurrent neural network comprises the following specific steps of:
5.1 Filling or cutting off the vector sequence of each node by manual operation to ensure that the lengths of the vector sequences of each node are consistent, wherein the lengths of the vector sequences are set to be 20 sequence elements;
5.2 The fixed-length vector sequence of the node is sent into a gating cyclic recurrent neural network for feature extraction, in the gating cyclic recurrent neural network, each neural unit processes a sequence element and transmits the information to the next neuron, the last neuron receives the information of all the previous neurons, the hidden state of the last neuron is taken as the embedded vector of the node, and finally the vector of each node is expressed as a 256-dimensional vector;
5.3 Parameters in the gated recurrent neural network are updated as the entire network model is back-propagated.
6. The method for detecting source code vulnerabilities based on data-dependent enhancement program slices as claimed in claim 1, wherein: the overall architecture of the graph neural network model comprises a 4-layer graph convolution and graph pooling convolution pooling block and a multi-layer perceptron; the loss function of the whole graph neural network is a cross entropy loss function with a penalty factor, wherein the penalty factor is used for relieving the influence caused by sample imbalance in multiple classifications.
7. The method for detecting source code vulnerabilities based on data-dependent enhancement program slices as claimed in claim 1, wherein: in the training test process of sending the node embedded subgraphs into the graph neural network model, dividing a data set into a training set, a verification set and a test set according to the proportion of 8:1:1; the super parameters of the graph neural network are selected to be optimal super parameter settings by using a ten-fold cross validation method.
8. The method for detecting source code vulnerabilities based on data-dependent enhancement program slices as claimed in claim 1, wherein: the anonymizing the user-defined identifier in the source code is specifically: unified normalization of user-defined variables into 'VAR_i', wherein i is the sequence of the corresponding variable names in the code, and i epsilon (1, 2, &. Cndot. N); unified normalization of user-defined functions into FUNC_i, wherein i is the sequence of the corresponding function names in the code, and i epsilon (1, 2, &. Cndot. M); user-defined variables are unified into 'TYPE_i', wherein i is the sequence of the corresponding structure names in the code, and i epsilon (1, 2, & P).
CN202310982855.8A 2023-08-07 2023-08-07 Source code vulnerability detection method based on data dependency enhancement program slice Active CN116702160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310982855.8A CN116702160B (en) 2023-08-07 2023-08-07 Source code vulnerability detection method based on data dependency enhancement program slice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310982855.8A CN116702160B (en) 2023-08-07 2023-08-07 Source code vulnerability detection method based on data dependency enhancement program slice

Publications (2)

Publication Number Publication Date
CN116702160A true CN116702160A (en) 2023-09-05
CN116702160B CN116702160B (en) 2023-11-10

Family

ID=87841859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310982855.8A Active CN116702160B (en) 2023-08-07 2023-08-07 Source code vulnerability detection method based on data dependency enhancement program slice

Country Status (1)

Country Link
CN (1) CN116702160B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117742769A (en) * 2024-02-19 2024-03-22 浙江金网信息产业股份有限公司 Source code intelligent analysis engine based on information creation rule base

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6029002A (en) * 1995-10-31 2000-02-22 Peritus Software Services, Inc. Method and apparatus for analyzing computer code using weakest precondition
US20070089075A1 (en) * 2005-10-14 2007-04-19 David Ward Method for optimizing integrated circuit device design and service
US20080201629A1 (en) * 2007-02-20 2008-08-21 International Business Machines Corporation Method and system for detecting synchronization errors in programs
US20090249307A1 (en) * 2008-03-26 2009-10-01 Kabushiki Kaisha Toshiba Program analysis apparatus, program analysis method, and program storage medium
US20120254827A1 (en) * 2009-09-14 2012-10-04 The Mathworks, Inc. Verification of computer-executable code generated from a model
US9378377B1 (en) * 2013-03-13 2016-06-28 Hrl Laboratories, Llc System for information flow security inference through program slicing
CN106844218A (en) * 2017-02-13 2017-06-13 南通大学 A kind of evolution influence collection Forecasting Methodology based on section of developing
CN109726120A (en) * 2018-12-05 2019-05-07 北京计算机技术及应用研究所 A kind of software defect confirmation method based on machine learning
CN111783100A (en) * 2020-06-22 2020-10-16 哈尔滨工业大学 Source code vulnerability detection method for code graph representation learning based on graph convolution network
CN112699377A (en) * 2020-12-30 2021-04-23 哈尔滨工业大学 Function-level code vulnerability detection method based on slice attribute graph representation learning
CN114861194A (en) * 2022-05-13 2022-08-05 兰州交通大学 Multi-type vulnerability detection method based on BGRU and CNN fusion model
US20220300615A1 (en) * 2021-02-12 2022-09-22 Tata Consultancy Services Limited Method and system for identifying security vulnerabilities
CN115357904A (en) * 2022-07-29 2022-11-18 南京航空航天大学 Multi-class vulnerability detection method based on program slice and graph neural network
CN115495755A (en) * 2022-11-15 2022-12-20 四川大学 Codebert and R-GCN-based source code vulnerability multi-classification detection method
CN115982053A (en) * 2023-01-17 2023-04-18 城云科技(中国)有限公司 Method, device and application for detecting software source code defects

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6029002A (en) * 1995-10-31 2000-02-22 Peritus Software Services, Inc. Method and apparatus for analyzing computer code using weakest precondition
US20070089075A1 (en) * 2005-10-14 2007-04-19 David Ward Method for optimizing integrated circuit device design and service
US20080201629A1 (en) * 2007-02-20 2008-08-21 International Business Machines Corporation Method and system for detecting synchronization errors in programs
US20090249307A1 (en) * 2008-03-26 2009-10-01 Kabushiki Kaisha Toshiba Program analysis apparatus, program analysis method, and program storage medium
US20120254827A1 (en) * 2009-09-14 2012-10-04 The Mathworks, Inc. Verification of computer-executable code generated from a model
US9378377B1 (en) * 2013-03-13 2016-06-28 Hrl Laboratories, Llc System for information flow security inference through program slicing
CN106844218A (en) * 2017-02-13 2017-06-13 南通大学 A kind of evolution influence collection Forecasting Methodology based on section of developing
CN109726120A (en) * 2018-12-05 2019-05-07 北京计算机技术及应用研究所 A kind of software defect confirmation method based on machine learning
CN111783100A (en) * 2020-06-22 2020-10-16 哈尔滨工业大学 Source code vulnerability detection method for code graph representation learning based on graph convolution network
CN112699377A (en) * 2020-12-30 2021-04-23 哈尔滨工业大学 Function-level code vulnerability detection method based on slice attribute graph representation learning
US20220300615A1 (en) * 2021-02-12 2022-09-22 Tata Consultancy Services Limited Method and system for identifying security vulnerabilities
CN114861194A (en) * 2022-05-13 2022-08-05 兰州交通大学 Multi-type vulnerability detection method based on BGRU and CNN fusion model
CN115357904A (en) * 2022-07-29 2022-11-18 南京航空航天大学 Multi-class vulnerability detection method based on program slice and graph neural network
CN115495755A (en) * 2022-11-15 2022-12-20 四川大学 Codebert and R-GCN-based source code vulnerability multi-classification detection method
CN115982053A (en) * 2023-01-17 2023-04-18 城云科技(中国)有限公司 Method, device and application for detecting software source code defects

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
夏之阳;易平;杨涛;: "基于神经网络与代码相似性的静态漏洞检测", 计算机工程, no. 12, pages 141 - 146 *
宋子韬; 胡勇: "基于图神经网络的源码漏洞检测方法研究", 通信技术, vol. 55, no. 5, pages 640 - 645 *
梁树彬; 郑力; 钟杰; 胡勇: "基于卷积神经网络的源代码漏洞检测模型", 通信技术, vol. 55, no. 4, pages 493 - 499 *
王正;胡勇;杨浩天;: "基于卷积神经网络的JPEG图像隐写分析方法研究", 现代计算机, no. 15, pages 117 - 120 *
郝学姣;汤小春;: "基于依赖标识的并发程序动态切片方法", 微电子学与计算机, no. 07, pages 206 - 209 *
郭婧;吴军华;: "基于程序依赖图的克隆检测及改进", 计算机工程与设计, vol. 33, no. 12, pages 595 - 600 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117742769A (en) * 2024-02-19 2024-03-22 浙江金网信息产业股份有限公司 Source code intelligent analysis engine based on information creation rule base
CN117742769B (en) * 2024-02-19 2024-04-30 浙江金网信息产业股份有限公司 Source code intelligent analysis engine based on information creation rule base

Also Published As

Publication number Publication date
CN116702160B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
CN111259394B (en) Fine-grained source code vulnerability detection method based on graph neural network
CN111783100B (en) Source code vulnerability detection method for code graph representation learning based on graph convolution network
CN110232280B (en) Software security vulnerability detection method based on tree structure convolutional neural network
CN111459799B (en) Software defect detection model establishing and detecting method and system based on Github
CN117951701A (en) Method for determining flaws and vulnerabilities in software code
CN112579477A (en) Defect detection method, device and storage medium
CN113420296B (en) C source code vulnerability detection method based on Bert model and BiLSTM
CN114297654A (en) Intelligent contract vulnerability detection method and system for source code hierarchy
CN113127339B (en) Method for acquiring Github open source platform data and source code defect repair system
CN116702160B (en) Source code vulnerability detection method based on data dependency enhancement program slice
CN114861194B (en) Multi-type vulnerability detection method based on BGRU and CNN fusion model
CN114185769A (en) Software defect prediction method and terminal based on bidirectional long-short term memory neural network
CN115357904B (en) Multi-class vulnerability detection method based on program slicing and graph neural network
CN115495755B (en) Codebert and R-GCN-based source code vulnerability multi-classification detection method
CN112733156A (en) Intelligent software vulnerability detection method, system and medium based on code attribute graph
CN113672931B (en) Software vulnerability automatic detection method and device based on pre-training
CN112579469A (en) Source code defect detection method and device
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN115146279A (en) Program vulnerability detection method, terminal device and storage medium
CN108229170B (en) Software analysis method and apparatus using big data and neural network
CN116340952A (en) Intelligent contract vulnerability detection method based on operation code program dependency graph
CN113434418A (en) Knowledge-driven software defect detection and analysis method and system
CN112035345A (en) Mixed depth defect prediction method based on code segment analysis
CN113297580B (en) Code semantic analysis-based electric power information system safety protection method and device
CN113868650B (en) Vulnerability detection method and device based on code heterogeneous middle graph representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant