CN113420296B

CN113420296B - C source code vulnerability detection method based on Bert model and BiLSTM

Info

Publication number: CN113420296B
Application number: CN202110770650.4A
Authority: CN
Inventors: 马之力; 马宏忠; 李志茹; 张学军; 盖继扬; 杨启帆; 赵红; 张驯; 弥海峰; 谭任远; 李玺; 朱小琴; 白万荣; 杨勇; 魏峰; 龚波; 杨凡; 高丽娜
Original assignee: STATE GRID GASU ELECTRIC POWER RESEARCH INSTITUTE; State Grid Gansu Electric Power Co Ltd; Lanzhou Jiaotong University
Current assignee: STATE GRID GASU ELECTRIC POWER RESEARCH INSTITUTE; State Grid Gansu Electric Power Co Ltd; Lanzhou Jiaotong University
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2022-05-13
Anticipated expiration: 2041-07-08
Also published as: CN113420296A

Abstract

A C source code vulnerability detection method based on a Bert model and a BilSTM is characterized in that a control dependency graph and a data dependency graph are constructed by analyzing a software source code, the code is sliced according to the control dependency and the data dependency among the codes to generate slice-level code blocks, then the generated code blocks are subjected to data cleaning and preprocessing, and each generated code block is labeled to distinguish whether the code block contains vulnerability information or not. And secondly, inputting the processed code blocks serving as training sets into a Bert pre-training model to finely adjust the standard Bert model to obtain a new Bert model. And inputting the code block into a new Bert model to learn semantic information and context relationship among codes in an unsupervised mode, and carrying out word embedding coding on the code block to obtain a word vector with maximized code semantic information and context relationship. And finally, inputting the obtained word vector into a BilSTM training detection model to obtain a source code vulnerability detection model. The invention can improve the accuracy of vulnerability detection and reduce the false alarm rate.

Description

C source code vulnerability detection method based on Bert model and BiLSTM

Technical Field

The invention relates to a software source code vulnerability detection method, in particular to a Bert model and BilSTM-based C source code vulnerability detection method.

Background

Most of network attack security events occurring in the current life are based on various different software vulnerabilities existing in device software, a software vulnerability refers to a software defect caused by factors such as technical problems and insufficient experience in a development stage of a software developer, and the defect exists in the whole stage of software deployment and operation. Therefore, an attacker can use the vulnerability utilization tool to attack the target system based on the software vulnerability defect at any time or any place, extract the administrator authority and acquire the system data and the command control authority so as to achieve the purpose of disturbing the normal operation of the system or acquiring economic benefits.

The existing relatively mature vulnerability mining technologies mainly include: the method comprises a binary vulnerability detection method, a vulnerability detection method based on pattern matching and a vulnerability detection method based on code similarity. The binary-based vulnerability detection method is characterized in that a source code is converted into a binary stream, and the binary stream is analyzed, so that most of semantic and logic information is lost, and vulnerability detection is not facilitated. The principle of the vulnerability detection method based on pattern matching is that firstly, an expert is needed to manually define a vulnerability pattern, and then whether a code segment matched with the vulnerability exists or not is searched in a source code according to the defined vulnerability pattern, so that whether the vulnerability exists or not is determined. The detection principle of the vulnerability detection method based on code similarity is that whether a vulnerability exists in a tested program is judged by analyzing and counting the similarity between codes, and the vulnerability caused by code copying and code cloning can be effectively detected, but the vulnerability detection method based on code similarity has certain limitation on detection of other types of vulnerabilities. In the three methods, the vulnerability characteristics are defined by experts with abundant experience, and vulnerability detection rules are set to carry out security detection on codes in the software. The method has the biggest defects of low detection efficiency, strong subjective dependence and incapability of realizing batch detection. Therefore, the vulnerability detection method based on automation becomes a new trend of current development, the vulnerability detection method based on deep learning at present can learn vulnerability characteristics in source codes through a multilayer neural network without manually defining vulnerability modes, the defects that batch processing and modularization cannot be achieved in the traditional method are effectively overcome, and vulnerability detection efficiency is improved. Li et al (Zhen L, D Zou, Xu S, et al, SySeVR: A Framework for Using Deep Learning to Detect Software resources Vulnerabilites [ J ]. 2018.) propose for the first time that a vulnerability detection model based on Software source codes is constructed by Using a Deep Learning technology, the model converts a processed vulnerability code block into a word vector which can be identified by Deep Learning by Using a word2vec word vector conversion technology, and then the word vector is put into a Deep Learning network training model. Patent CN201910447971 proposes to use a deep learning technique to learn vulnerability characteristics, so as to avoid manually defining vulnerability characteristics, and improve vulnerability detection efficiency, but the above method uses a Skip-Gram model in word2vec in the word vector generation stage, and the model cannot encode context in a text, and only can encode a single word. Therefore, semantic information is inevitably lost in the process of generating the word vector, so that the detection accuracy of the trained model is low. The word vector conversion process of the source code vulnerability detection method proposed by patent CN202011046243 is consistent with that of patent CN201910447971, and similarly, the source code data containing the vulnerability is converted into a corresponding vector representation thereof based on the word2vec word vector conversion technology. Therefore, semantic information is inevitably lost in the process of generating the word vector, so that the detection accuracy of the model is influenced. Patent CN201911363149 uses a semi-supervised learning technology, and uses labeled data and unlabelled data together as training data, and directly inputs code elements into an ELMo model to predict whether a source code contains vulnerability information, although the time for code processing is saved, because the ELMo model needs to set parameters of each layer in the downstream of training, only a single word can be encoded, and no negative sampling process is provided, a higher detection accuracy cannot be guaranteed. Patent CN202010576421 proposes a source code vulnerability detection method based on a graph convolution neural network, which obtains a source code vulnerability detection model by obtaining a code attribute graph corresponding to a source code, constructing a code slice graph structure based on vulnerability characteristics, then using the graph convolution network to learn vector representation of each graph node, and training, but when the training code structure is complex, the generated code attribute graph has high complexity, and the graph convolution network cannot effectively learn complex graph nodes, so that the problem of low detection accuracy still exists.

Disclosure of Invention

The invention provides a method for detecting C source code vulnerability based on a Bert model and a BilSTM, which improves vulnerability detection accuracy and reduces false alarm rate.

The technical scheme adopted by the invention is as follows:

a C source code vulnerability detection method based on a Bert model and a BilSTM mainly comprises the following steps:

step A: generating a program slice, and generating a Program Dependency Graph (PDG) and an Abstract Syntax Tree (AST) corresponding to a source code by using a Joern tool based on the source code of software, wherein the Program Dependency Graph (PDG) comprises a Control Dependency Graph (CDG) and a Data Dependency Graph (DDG) between codes, and the Abstract Syntax Tree (AST) comprises syntax information between program statements; slicing the code based on the vulnerability rules based on the control dependency graph CDG and the data dependency information in the data dependency graph DDG to finally obtain the desired program slice;

and B: b, generating symbolic representation of the code, carrying out data cleaning and formatting treatment on the original program slice obtained in the step A, namely renaming identifiers except keywords and operational characters, user-defined function names and variable names appearing in the code block according to a custom naming rule, converting each line of code sentences in the processed code block into an ordered subset consisting of single characters, and obtaining symbolic representation of the code;

and C: c, fine-tuning the standard Bert model, representing the code symbol obtained in the step B as a training set of the Bert model based on the standard Bert model, converting data in the training set into a data format required by the input of the Bert model one by one, inputting the data into the standard Bert model for fine-tuning the model, and obtaining a new Bert model more suitable for a local data set through fine-tuning;

step D: generating a vector representation corresponding to the code representation, pre-training the code representation set obtained in the step B by using a new Bert model obtained by fine tuning in the step C, converting the pre-training into a vector representation set with rich semantic information and context, and inputting the vector representation set into a neural network for model training;

step E: and (4) training a software source code vulnerability detection model, designing a double-layer BilSTM network, and taking the word vector obtained by the pretraining of the Bert model in the step D as training data of the word vector to obtain a vulnerability detection model based on a software source code.

Wherein step a further comprises:

a1: constructing a program dependency graph and an abstract syntax tree by using a public Joern tool, searching code statements matched with the vulnerability syntax rules in the abstract syntax tree based on the vulnerability syntax rules, and constructing an ordered code statement set;

a2: based on control dependency information and data dependency information in a program dependency graph, an ordered code statement set is sliced to obtain slice-level code blocks, the code blocks are matched and compared with vulnerability code blocks in an existing vulnerability library, the code blocks containing vulnerability characteristics are represented by a label 1, and the code blocks not containing the vulnerability characteristics are represented by a label 0.

Wherein step B further comprises:

b1: the code block cleaning process is represented as: the obtained code block with the label consists of ordered multi-line code statements; deleting irrelevant character string information, code comments and non-ASCII characters which do not cause the vulnerability from each line of code statements, simultaneously reserving symbols with grammar information such as quotation marks and brackets, and reserving semicolons to distinguish each line of codes;

b2: the code block formatting process is expressed as: user-defined variable names in the code block are renamed as VAR1 and VAR2 in sequence, and user-defined function names are renamed as FUN1 and FUN2 in sequence; where VAR and FUN are used to distinguish between a function and a variable, 1 and 2 represent the order of the variable or function in a code block; these mappings are performed in a one-to-one manner; after the named replacement of each code block is completed, the functions and variable indices are re-counted so that multiple variables and functions may be mapped to the same symbolic name when they appear in different code blocks.

Wherein step C further comprises:

c1: the format for converting the training set into the Bert model input is represented as follows: firstly, reading an obtained code block, and forming a single-line list by a plurality of lines of codes by taking a semicolon as a boundary to obtain a data set with a format of Text and Label, wherein the Text represents a code set, and the Label represents a Label corresponding to the code set; secondly, word embedding is carried out on the Text by adopting a self-contained cache of the Bert, a code block is expressed into a single character set, and word embedding expression of the Text, namely a Tokens array is generated, wherein the Tokens array corresponding to each piece of data takes [ CLS ] as a Text beginning mark and takes [ SEP ] as an ending expression;

c2: defining the maximum length max _ seq _ length of a sequence after characters in a Tokens array are marked, mapping each character in the obtained Tokens array to a corresponding id, and generating an input _ ids array; each character in input _ ids is replaced by a unique id, e.g., [ CLS ] is replaced by 101; and meanwhile, obtaining a corresponding segment _ ids array with the length of max _ seq _ length based on the Tokens array. And finally, the generated input _ ids and segment _ ids are used as the input of the Bert model and used for fine adjustment of the Bert model.

Wherein step D further comprises:

d1: and C, the vector representation set is the modified Bert model obtained in the step C, and the Bert model encodes input information to obtain word vector data corresponding to the original code representation and is used as the input of the BilSTM deep learning network.

Wherein step E further comprises: and (4) taking 20% of training data to test the vulnerability detection model, and evaluating the detection performance of the model by taking the detection accuracy and the training loss as evaluation indexes.

After obtaining code slices, firstly carrying out data cleaning and formatting treatment on the slices, secondly carrying out character coding and character position coding on the code slices by using a word bag and word embedding technology of a Bert model to obtain a character mapping array and a position mapping array, then learning context relation, semantic relation and position relation among the code slices in an unsupervised mode by using the trimmed Bert model to generate a vector with semantic information, and then using the vector as the input of a BilSTM model to carry out model training. Because the semantic information and the context relation of the code statement are well reserved in the word vector conversion process of the code slice and the position coding is introduced, the method provided by the invention effectively improves the accuracy of the source code detection model and reduces the false alarm rate.

Drawings

FIG. 1 is a flow chart of the training of a source code vulnerability detection model provided by the present invention;

FIG. 2 is a system architecture diagram of a deep learning-based source code vulnerability detection model according to the present invention;

FIG. 3 is a schematic diagram of the source code program slice extraction and processing process of the present invention;

FIG. 4 is a diagram illustrating the process of transforming the word vector of the Bert model to the code slice according to the present invention.

Detailed Description

The present invention will be further described with reference to the following embodiments.

A C source code vulnerability detection method based on a Bert (bidirectional Encoder responses from transformations) model and a bidirectional Long Short Term Memory network (Bi-directional Long Short-Term Memory) mainly comprises the following steps:

step A: generating a program slice, based on a source code of software, generating a Program Dependency Graph (PDG) and an Abstract Syntax Tree (AST) (abstract Syntax tree) corresponding to the source code by using a Joern tool, wherein the PDG comprises a Control Dependency Graph (CDG) (control dependency graph) and a Data Dependency Graph (DDG) (data dependency graph) among codes, and the AST comprises Syntax information among program statements; and slicing the code based on the bug rules based on the control dependency graph CDG and the data dependency information in the data dependency graph DDG to finally obtain the required program slice.

And B: and B, generating symbolic representation of the code, and carrying out data cleaning and formatting treatment on the original program slice obtained in the step A, namely renaming identifiers except keywords (such as int, char and the like) and operational characters (such as +,% and the like) and user-defined function names and variable names which appear in the code block according to a custom naming rule, and converting each line of code sentences in the processed code block into an ordered subset consisting of single characters to obtain the symbolic representation of the code.

The implementation steps of the present invention will be described in detail based on CWE-399 (resource management error vulnerability) and CWE-119 (buffer overflow vulnerability), and the implementation steps are mainly described in detail from the aspects of code block generation, code mapping of code data, fine tuning of Bert model, word vector transformation and model training. Fig. 1 mainly introduces a construction process of the source code vulnerability detection model provided by the present invention, and fig. 2 mainly introduces a framework and a use principle of the vulnerability detection model provided by the present invention.

1. A training data set (block of code with labels) is obtained.

Step 1-1: a program dependency graph and an abstract syntax tree of the source code are generated. And analyzing by using a Joern tool to obtain a program dependency graph PDG and an abstract syntax tree AST corresponding to the software source code, wherein the program dependency graph PDG comprises a data dependency graph DDG and a control dependency graph CDG of the code statements, and the abstract syntax tree AST comprises syntax information among the program statements.

Step 1-2: a code block is generated. Finding code statements matching the vulnerability syntax rules in the abstract syntax tree based on the vulnerability syntax rules (API function call, definition and use of arrays, pointers and expressions), as shown in (a) and (b) in fig. 3, finding API call functions strncpy (b, a, n) matching the vulnerability syntax rules in the source code, and based on control dependency and data dependency in the program dependency graph, cutting forward and backward code statements that may have vulnerability semantic information to form a code block shown in (c) in fig. 3. And matching and comparing the code blocks with the vulnerability code blocks in the existing vulnerability library, representing the code blocks containing vulnerability characteristics by using a label 1, and representing the code blocks not containing vulnerability characteristics by using a label 0, so as to be used for carrying out supervised classification and training a vulnerability detection model by subsequently inputting into a neural network.

2. And carrying out data cleaning and formatting operation on the obtained code block.

Step 2-1: the code block cleaning process is represented as: the resulting tagged code block consists of an ordered multi-line code statement. For each line of code statement, irrelevant character string information, code comments, non-ASCII characters and the like which do not cause the vulnerability are deleted, symbols with grammatical information such as quotation marks and brackets are reserved, and a semicolon is reserved to distinguish each line of code (each line of code in C, C + + ends with a semicolon).

Step 2-2: the code block formatting process is expressed as: the user-defined variable names in the code block are renamed to "VAR 1", "VAR 2", etc. in order. The user-defined function names are renamed as "FUN 1", "FUN 2", etc. in order. Where "VAR" and "FUN" are used to distinguish a function from a variable, "1" and "2" indicate the order of the variable or function in a code block. As shown in FIG. 3 (c), a [10], b [10] and n are renamed to VAR1[10], VAR2[10] and VAR3 in turn, and strncpy (b, a, n) is renamed to FUN1(VAR1, VAR2, VAR3), and these mappings are performed in a one-to-one manner. After the named replacement of each code block is completed, the functions and variable indices are re-counted so that multiple variables and functions may be mapped to the same symbolic name when they appear in different code blocks.

3. The training data set is pre-trained using the Bert model, generating a vector representation.

Step 3-1: the Bert model is fine-tuned based on the dataset to make the model more data-adaptive. Firstly, converting a training set into a format input by a Bert model, reading codes to obtain code blocks, forming a single-line list by a plurality of lines of codes by taking a semicolon as a boundary, and obtaining a data set with a format (Text, Label), wherein the Text represents a code set, and the Label represents a Label corresponding to the code set. And secondly, word embedding is carried out on the Text by adopting a call table carried by the Bert, the code blocks are expressed into a single character set, and word embedding expression of the Text, namely a Tokens array, is generated. As shown in FIG. 4, the Tokens array corresponding to each piece of data is marked with [ CLS ] as the beginning of text and [ SEP ] as the end. Defining the maximum length max _ seq _ length of a sequence after characters in a Tokens array are marked, mapping each character in the obtained Tokens array to be corresponding id, generating an input _ ids array, if the sequence length is greater than max _ seq _ length, only keeping the sequence with the length of max _ seq _ length, otherwise, using max _ seq _ length-len (input _ ids) 0 complement sequences, and replacing each character in the input _ ids by a unique id, for example, [ CLS ] is replaced by '101'. And meanwhile, obtaining a corresponding segment _ ids array with the length of max _ segment _ length based on the Tokens array, wherein the segment _ ids is used for distinguishing the position of each sentence in the text by using Embedding information, and the invention uses the segment _ ids to mark the position information of each character in each line of code sentences. And taking the finally generated input _ ids and segment _ ids as the inputs of the Bert model for fine adjustment of the Bert model to obtain a new Bert model.

Step 3-2: as shown in fig. 4, in a similar manner to step 3-1, the training set is first converted into a format (input _ ids, segment _ ids) required by the Bert model input, and the input is input into the trimmed Bert model, and the word vector data corresponding to the original code representation is obtained through encoding. The Bert model learns the context and semantic information of the text by adopting a transform feature extraction technology and a self attack technology. In addition, the input data segment _ ids of the Bert model represents the position information of a single sentence, and in general, the same character may represent different semantic and grammatical information in different sentences, so that different vector representations are given to the same character appearing in different positions, the context and grammatical information of a code sentence are reserved to the maximum extent, the information loss of code semantic information and code structure in the training process of the vulnerability detection model is avoided, and the accuracy of the vulnerability detection model is effectively improved.

4. And the improved BilSTM model is used for realizing a software source code vulnerability detection model.

Step 4-1: and (3) taking the word vector obtained by pretraining the Bert model in the step (3-2) as the training data of the improved BilSTM deep learning network, and training to obtain the vulnerability detection model based on the software source code. Because the Bert word vector training network model has certain classification capability, a network model which is too complicated to follow is not needed. The invention designs a single-layer BilSTM model. The loss function is binary _ cross, the optimization function is Adamax, the learning rate is 0.0001, and the number of input neurons is 128 (because the dimension of the Bert output vector is 768, the dimension of reshaping needs to be performed on the input vector before input). After the model training is finished, 20% of training data is taken to test the vulnerability detection model, and the detection performance of the model is evaluated by taking the detection accuracy and the training loss as indexes.

Referring to table 1, experiments are performed on cwe399 and cwe119 vulnerability data sets, and results show that the method provided by the invention uses a Bert model to pre-train the obtained function-level code blocks, so that semantic information and context of a source code can be effectively learned and retained, and compared with a word2vec word vector conversion technology, the method provided by the invention can ensure that a detection model obtained by training has higher detection accuracy.

TABLE 1 model classification accuracy and model training loss on CWE-399 and CWE-119 vulnerability datasets

Claims

1. A C source code vulnerability detection method based on a Bert model and a BilSTM is characterized by comprising the following steps: mainly comprises the following steps:

and C: c, fine-tuning the standard Bert model, representing the code symbol obtained in the step B as a training set of the Bert model based on the standard Bert model, converting data in the training set into a data format required by the input of the Bert model one by one, inputting the data into the standard Bert model for fine-tuning the model, and obtaining a new Bert model more suitable for a local data set through fine-tuning; the method specifically comprises the following steps:

c2: defining the maximum length max _ seq _ length of a sequence after characters in a Tokens array are marked, mapping each character in the obtained Tokens array to a corresponding id, and generating an input _ ids array; each character in input _ ids is replaced by a unique id; meanwhile, based on the Tokens array, obtaining a corresponding segment _ ids array with the length of max _ seq _ length; the finally generated input _ ids and segment _ ids are used as the input of the Bert model and used for fine adjustment of the Bert model;

step D: generating a vector representation corresponding to the code representation, pre-training the code representation set obtained in the step B by using a new Bert model obtained by fine tuning in the step C, converting the pre-training into a vector representation set with rich semantic information and context, and inputting the vector representation set into a neural network for model training; the method specifically comprises the following steps:

d1: the vector representation set is the modified Bert model obtained in the step C, and the Bert model encodes input information to obtain word vector data corresponding to the original code representation and is used as the input of the BilSTM deep learning network;

2. The method of claim 1, wherein the step a further comprises:

a1: constructing a program dependency graph and an abstract syntax tree by using a public Joern tool, searching code statements matched with vulnerability syntax rules in the abstract syntax tree based on the vulnerability syntax rules, and constructing an ordered code statement set;

3. The method of claim 1, wherein the step B further comprises:

4. The method of claim 1, wherein the step E further comprises: and (4) taking 20% of training data to test the vulnerability detection model, and evaluating the detection performance of the model by taking the detection accuracy and the training loss as evaluation indexes.