CN113420296B - C source code vulnerability detection method based on Bert model and BiLSTM - Google Patents

C source code vulnerability detection method based on Bert model and BiLSTM Download PDF

Info

Publication number
CN113420296B
CN113420296B CN202110770650.4A CN202110770650A CN113420296B CN 113420296 B CN113420296 B CN 113420296B CN 202110770650 A CN202110770650 A CN 202110770650A CN 113420296 B CN113420296 B CN 113420296B
Authority
CN
China
Prior art keywords
code
model
vulnerability
data
bert model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110770650.4A
Other languages
Chinese (zh)
Other versions
CN113420296A (en
Inventor
马之力
马宏忠
李志茹
张学军
盖继扬
杨启帆
赵红
张驯
弥海峰
谭任远
李玺
朱小琴
白万荣
杨勇
魏峰
龚波
杨凡
高丽娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
STATE GRID GASU ELECTRIC POWER RESEARCH INSTITUTE
State Grid Gansu Electric Power Co Ltd
Lanzhou Jiaotong University
Original Assignee
STATE GRID GASU ELECTRIC POWER RESEARCH INSTITUTE
State Grid Gansu Electric Power Co Ltd
Lanzhou Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by STATE GRID GASU ELECTRIC POWER RESEARCH INSTITUTE, State Grid Gansu Electric Power Co Ltd, Lanzhou Jiaotong University filed Critical STATE GRID GASU ELECTRIC POWER RESEARCH INSTITUTE
Priority to CN202110770650.4A priority Critical patent/CN113420296B/en
Publication of CN113420296A publication Critical patent/CN113420296A/en
Application granted granted Critical
Publication of CN113420296B publication Critical patent/CN113420296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Virology (AREA)
  • Debugging And Monitoring (AREA)
  • Stored Programmes (AREA)

Abstract

A C source code vulnerability detection method based on a Bert model and a BilSTM is characterized in that a control dependency graph and a data dependency graph are constructed by analyzing a software source code, the code is sliced according to the control dependency and the data dependency among the codes to generate slice-level code blocks, then the generated code blocks are subjected to data cleaning and preprocessing, and each generated code block is labeled to distinguish whether the code block contains vulnerability information or not. And secondly, inputting the processed code blocks serving as training sets into a Bert pre-training model to finely adjust the standard Bert model to obtain a new Bert model. And inputting the code block into a new Bert model to learn semantic information and context relationship among codes in an unsupervised mode, and carrying out word embedding coding on the code block to obtain a word vector with maximized code semantic information and context relationship. And finally, inputting the obtained word vector into a BilSTM training detection model to obtain a source code vulnerability detection model. The invention can improve the accuracy of vulnerability detection and reduce the false alarm rate.

Description

C source code vulnerability detection method based on Bert model and BiLSTM
Technical Field
The invention relates to a software source code vulnerability detection method, in particular to a Bert model and BilSTM-based C source code vulnerability detection method.
Background
Most of network attack security events occurring in the current life are based on various different software vulnerabilities existing in device software, a software vulnerability refers to a software defect caused by factors such as technical problems and insufficient experience in a development stage of a software developer, and the defect exists in the whole stage of software deployment and operation. Therefore, an attacker can use the vulnerability utilization tool to attack the target system based on the software vulnerability defect at any time or any place, extract the administrator authority and acquire the system data and the command control authority so as to achieve the purpose of disturbing the normal operation of the system or acquiring economic benefits.
The existing relatively mature vulnerability mining technologies mainly include: the method comprises a binary vulnerability detection method, a vulnerability detection method based on pattern matching and a vulnerability detection method based on code similarity. The binary-based vulnerability detection method is characterized in that a source code is converted into a binary stream, and the binary stream is analyzed, so that most of semantic and logic information is lost, and vulnerability detection is not facilitated. The principle of the vulnerability detection method based on pattern matching is that firstly, an expert is needed to manually define a vulnerability pattern, and then whether a code segment matched with the vulnerability exists or not is searched in a source code according to the defined vulnerability pattern, so that whether the vulnerability exists or not is determined. The detection principle of the vulnerability detection method based on code similarity is that whether a vulnerability exists in a tested program is judged by analyzing and counting the similarity between codes, and the vulnerability caused by code copying and code cloning can be effectively detected, but the vulnerability detection method based on code similarity has certain limitation on detection of other types of vulnerabilities. In the three methods, the vulnerability characteristics are defined by experts with abundant experience, and vulnerability detection rules are set to carry out security detection on codes in the software. The method has the biggest defects of low detection efficiency, strong subjective dependence and incapability of realizing batch detection. Therefore, the vulnerability detection method based on automation becomes a new trend of current development, the vulnerability detection method based on deep learning at present can learn vulnerability characteristics in source codes through a multilayer neural network without manually defining vulnerability modes, the defects that batch processing and modularization cannot be achieved in the traditional method are effectively overcome, and vulnerability detection efficiency is improved. Li et al (Zhen L, D Zou, Xu S, et al, SySeVR: A Framework for Using Deep Learning to Detect Software resources Vulnerabilites [ J ]. 2018.) propose for the first time that a vulnerability detection model based on Software source codes is constructed by Using a Deep Learning technology, the model converts a processed vulnerability code block into a word vector which can be identified by Deep Learning by Using a word2vec word vector conversion technology, and then the word vector is put into a Deep Learning network training model. Patent CN201910447971 proposes to use a deep learning technique to learn vulnerability characteristics, so as to avoid manually defining vulnerability characteristics, and improve vulnerability detection efficiency, but the above method uses a Skip-Gram model in word2vec in the word vector generation stage, and the model cannot encode context in a text, and only can encode a single word. Therefore, semantic information is inevitably lost in the process of generating the word vector, so that the detection accuracy of the trained model is low. The word vector conversion process of the source code vulnerability detection method proposed by patent CN202011046243 is consistent with that of patent CN201910447971, and similarly, the source code data containing the vulnerability is converted into a corresponding vector representation thereof based on the word2vec word vector conversion technology. Therefore, semantic information is inevitably lost in the process of generating the word vector, so that the detection accuracy of the model is influenced. Patent CN201911363149 uses a semi-supervised learning technology, and uses labeled data and unlabelled data together as training data, and directly inputs code elements into an ELMo model to predict whether a source code contains vulnerability information, although the time for code processing is saved, because the ELMo model needs to set parameters of each layer in the downstream of training, only a single word can be encoded, and no negative sampling process is provided, a higher detection accuracy cannot be guaranteed. Patent CN202010576421 proposes a source code vulnerability detection method based on a graph convolution neural network, which obtains a source code vulnerability detection model by obtaining a code attribute graph corresponding to a source code, constructing a code slice graph structure based on vulnerability characteristics, then using the graph convolution network to learn vector representation of each graph node, and training, but when the training code structure is complex, the generated code attribute graph has high complexity, and the graph convolution network cannot effectively learn complex graph nodes, so that the problem of low detection accuracy still exists.
Disclosure of Invention
The invention provides a method for detecting C source code vulnerability based on a Bert model and a BilSTM, which improves vulnerability detection accuracy and reduces false alarm rate.
The technical scheme adopted by the invention is as follows:
a C source code vulnerability detection method based on a Bert model and a BilSTM mainly comprises the following steps:
step A: generating a program slice, and generating a Program Dependency Graph (PDG) and an Abstract Syntax Tree (AST) corresponding to a source code by using a Joern tool based on the source code of software, wherein the Program Dependency Graph (PDG) comprises a Control Dependency Graph (CDG) and a Data Dependency Graph (DDG) between codes, and the Abstract Syntax Tree (AST) comprises syntax information between program statements; slicing the code based on the vulnerability rules based on the control dependency graph CDG and the data dependency information in the data dependency graph DDG to finally obtain the desired program slice;
and B: b, generating symbolic representation of the code, carrying out data cleaning and formatting treatment on the original program slice obtained in the step A, namely renaming identifiers except keywords and operational characters, user-defined function names and variable names appearing in the code block according to a custom naming rule, converting each line of code sentences in the processed code block into an ordered subset consisting of single characters, and obtaining symbolic representation of the code;
and C: c, fine-tuning the standard Bert model, representing the code symbol obtained in the step B as a training set of the Bert model based on the standard Bert model, converting data in the training set into a data format required by the input of the Bert model one by one, inputting the data into the standard Bert model for fine-tuning the model, and obtaining a new Bert model more suitable for a local data set through fine-tuning;
step D: generating a vector representation corresponding to the code representation, pre-training the code representation set obtained in the step B by using a new Bert model obtained by fine tuning in the step C, converting the pre-training into a vector representation set with rich semantic information and context, and inputting the vector representation set into a neural network for model training;
step E: and (4) training a software source code vulnerability detection model, designing a double-layer BilSTM network, and taking the word vector obtained by the pretraining of the Bert model in the step D as training data of the word vector to obtain a vulnerability detection model based on a software source code.
Wherein step a further comprises:
a1: constructing a program dependency graph and an abstract syntax tree by using a public Joern tool, searching code statements matched with the vulnerability syntax rules in the abstract syntax tree based on the vulnerability syntax rules, and constructing an ordered code statement set;
a2: based on control dependency information and data dependency information in a program dependency graph, an ordered code statement set is sliced to obtain slice-level code blocks, the code blocks are matched and compared with vulnerability code blocks in an existing vulnerability library, the code blocks containing vulnerability characteristics are represented by a label 1, and the code blocks not containing the vulnerability characteristics are represented by a label 0.
Wherein step B further comprises:
b1: the code block cleaning process is represented as: the obtained code block with the label consists of ordered multi-line code statements; deleting irrelevant character string information, code comments and non-ASCII characters which do not cause the vulnerability from each line of code statements, simultaneously reserving symbols with grammar information such as quotation marks and brackets, and reserving semicolons to distinguish each line of codes;
b2: the code block formatting process is expressed as: user-defined variable names in the code block are renamed as VAR1 and VAR2 in sequence, and user-defined function names are renamed as FUN1 and FUN2 in sequence; where VAR and FUN are used to distinguish between a function and a variable, 1 and 2 represent the order of the variable or function in a code block; these mappings are performed in a one-to-one manner; after the named replacement of each code block is completed, the functions and variable indices are re-counted so that multiple variables and functions may be mapped to the same symbolic name when they appear in different code blocks.
Wherein step C further comprises:
c1: the format for converting the training set into the Bert model input is represented as follows: firstly, reading an obtained code block, and forming a single-line list by a plurality of lines of codes by taking a semicolon as a boundary to obtain a data set with a format of Text and Label, wherein the Text represents a code set, and the Label represents a Label corresponding to the code set; secondly, word embedding is carried out on the Text by adopting a self-contained cache of the Bert, a code block is expressed into a single character set, and word embedding expression of the Text, namely a Tokens array is generated, wherein the Tokens array corresponding to each piece of data takes [ CLS ] as a Text beginning mark and takes [ SEP ] as an ending expression;
c2: defining the maximum length max _ seq _ length of a sequence after characters in a Tokens array are marked, mapping each character in the obtained Tokens array to a corresponding id, and generating an input _ ids array; each character in input _ ids is replaced by a unique id, e.g., [ CLS ] is replaced by 101; and meanwhile, obtaining a corresponding segment _ ids array with the length of max _ seq _ length based on the Tokens array. And finally, the generated input _ ids and segment _ ids are used as the input of the Bert model and used for fine adjustment of the Bert model.
Wherein step D further comprises:
d1: and C, the vector representation set is the modified Bert model obtained in the step C, and the Bert model encodes input information to obtain word vector data corresponding to the original code representation and is used as the input of the BilSTM deep learning network.
Wherein step E further comprises: and (4) taking 20% of training data to test the vulnerability detection model, and evaluating the detection performance of the model by taking the detection accuracy and the training loss as evaluation indexes.
After obtaining code slices, firstly carrying out data cleaning and formatting treatment on the slices, secondly carrying out character coding and character position coding on the code slices by using a word bag and word embedding technology of a Bert model to obtain a character mapping array and a position mapping array, then learning context relation, semantic relation and position relation among the code slices in an unsupervised mode by using the trimmed Bert model to generate a vector with semantic information, and then using the vector as the input of a BilSTM model to carry out model training. Because the semantic information and the context relation of the code statement are well reserved in the word vector conversion process of the code slice and the position coding is introduced, the method provided by the invention effectively improves the accuracy of the source code detection model and reduces the false alarm rate.
Drawings
FIG. 1 is a flow chart of the training of a source code vulnerability detection model provided by the present invention;
FIG. 2 is a system architecture diagram of a deep learning-based source code vulnerability detection model according to the present invention;
FIG. 3 is a schematic diagram of the source code program slice extraction and processing process of the present invention;
FIG. 4 is a diagram illustrating the process of transforming the word vector of the Bert model to the code slice according to the present invention.
Detailed Description
The present invention will be further described with reference to the following embodiments.
A C source code vulnerability detection method based on a Bert (bidirectional Encoder responses from transformations) model and a bidirectional Long Short Term Memory network (Bi-directional Long Short-Term Memory) mainly comprises the following steps:
step A: generating a program slice, based on a source code of software, generating a Program Dependency Graph (PDG) and an Abstract Syntax Tree (AST) (abstract Syntax tree) corresponding to the source code by using a Joern tool, wherein the PDG comprises a Control Dependency Graph (CDG) (control dependency graph) and a Data Dependency Graph (DDG) (data dependency graph) among codes, and the AST comprises Syntax information among program statements; and slicing the code based on the bug rules based on the control dependency graph CDG and the data dependency information in the data dependency graph DDG to finally obtain the required program slice.
And B: and B, generating symbolic representation of the code, and carrying out data cleaning and formatting treatment on the original program slice obtained in the step A, namely renaming identifiers except keywords (such as int, char and the like) and operational characters (such as +,% and the like) and user-defined function names and variable names which appear in the code block according to a custom naming rule, and converting each line of code sentences in the processed code block into an ordered subset consisting of single characters to obtain the symbolic representation of the code.
And C: c, fine-tuning the standard Bert model, representing the code symbol obtained in the step B as a training set of the Bert model based on the standard Bert model, converting data in the training set into a data format required by the input of the Bert model one by one, inputting the data into the standard Bert model for fine-tuning the model, and obtaining a new Bert model more suitable for a local data set through fine-tuning;
step D: generating a vector representation corresponding to the code representation, pre-training the code representation set obtained in the step B by using a new Bert model obtained by fine tuning in the step C, converting the pre-training into a vector representation set with rich semantic information and context, and inputting the vector representation set into a neural network for model training;
step E: and (4) training a software source code vulnerability detection model, designing a double-layer BilSTM network, and taking the word vector obtained by the pretraining of the Bert model in the step D as training data of the word vector to obtain a vulnerability detection model based on a software source code.
The implementation steps of the present invention will be described in detail based on CWE-399 (resource management error vulnerability) and CWE-119 (buffer overflow vulnerability), and the implementation steps are mainly described in detail from the aspects of code block generation, code mapping of code data, fine tuning of Bert model, word vector transformation and model training. Fig. 1 mainly introduces a construction process of the source code vulnerability detection model provided by the present invention, and fig. 2 mainly introduces a framework and a use principle of the vulnerability detection model provided by the present invention.
1. A training data set (block of code with labels) is obtained.
Step 1-1: a program dependency graph and an abstract syntax tree of the source code are generated. And analyzing by using a Joern tool to obtain a program dependency graph PDG and an abstract syntax tree AST corresponding to the software source code, wherein the program dependency graph PDG comprises a data dependency graph DDG and a control dependency graph CDG of the code statements, and the abstract syntax tree AST comprises syntax information among the program statements.
Step 1-2: a code block is generated. Finding code statements matching the vulnerability syntax rules in the abstract syntax tree based on the vulnerability syntax rules (API function call, definition and use of arrays, pointers and expressions), as shown in (a) and (b) in fig. 3, finding API call functions strncpy (b, a, n) matching the vulnerability syntax rules in the source code, and based on control dependency and data dependency in the program dependency graph, cutting forward and backward code statements that may have vulnerability semantic information to form a code block shown in (c) in fig. 3. And matching and comparing the code blocks with the vulnerability code blocks in the existing vulnerability library, representing the code blocks containing vulnerability characteristics by using a label 1, and representing the code blocks not containing vulnerability characteristics by using a label 0, so as to be used for carrying out supervised classification and training a vulnerability detection model by subsequently inputting into a neural network.
2. And carrying out data cleaning and formatting operation on the obtained code block.
Step 2-1: the code block cleaning process is represented as: the resulting tagged code block consists of an ordered multi-line code statement. For each line of code statement, irrelevant character string information, code comments, non-ASCII characters and the like which do not cause the vulnerability are deleted, symbols with grammatical information such as quotation marks and brackets are reserved, and a semicolon is reserved to distinguish each line of code (each line of code in C, C + + ends with a semicolon).
Step 2-2: the code block formatting process is expressed as: the user-defined variable names in the code block are renamed to "VAR 1", "VAR 2", etc. in order. The user-defined function names are renamed as "FUN 1", "FUN 2", etc. in order. Where "VAR" and "FUN" are used to distinguish a function from a variable, "1" and "2" indicate the order of the variable or function in a code block. As shown in FIG. 3 (c), a [10], b [10] and n are renamed to VAR1[10], VAR2[10] and VAR3 in turn, and strncpy (b, a, n) is renamed to FUN1(VAR1, VAR2, VAR3), and these mappings are performed in a one-to-one manner. After the named replacement of each code block is completed, the functions and variable indices are re-counted so that multiple variables and functions may be mapped to the same symbolic name when they appear in different code blocks.
3. The training data set is pre-trained using the Bert model, generating a vector representation.
Step 3-1: the Bert model is fine-tuned based on the dataset to make the model more data-adaptive. Firstly, converting a training set into a format input by a Bert model, reading codes to obtain code blocks, forming a single-line list by a plurality of lines of codes by taking a semicolon as a boundary, and obtaining a data set with a format (Text, Label), wherein the Text represents a code set, and the Label represents a Label corresponding to the code set. And secondly, word embedding is carried out on the Text by adopting a call table carried by the Bert, the code blocks are expressed into a single character set, and word embedding expression of the Text, namely a Tokens array, is generated. As shown in FIG. 4, the Tokens array corresponding to each piece of data is marked with [ CLS ] as the beginning of text and [ SEP ] as the end. Defining the maximum length max _ seq _ length of a sequence after characters in a Tokens array are marked, mapping each character in the obtained Tokens array to be corresponding id, generating an input _ ids array, if the sequence length is greater than max _ seq _ length, only keeping the sequence with the length of max _ seq _ length, otherwise, using max _ seq _ length-len (input _ ids) 0 complement sequences, and replacing each character in the input _ ids by a unique id, for example, [ CLS ] is replaced by '101'. And meanwhile, obtaining a corresponding segment _ ids array with the length of max _ segment _ length based on the Tokens array, wherein the segment _ ids is used for distinguishing the position of each sentence in the text by using Embedding information, and the invention uses the segment _ ids to mark the position information of each character in each line of code sentences. And taking the finally generated input _ ids and segment _ ids as the inputs of the Bert model for fine adjustment of the Bert model to obtain a new Bert model.
Step 3-2: as shown in fig. 4, in a similar manner to step 3-1, the training set is first converted into a format (input _ ids, segment _ ids) required by the Bert model input, and the input is input into the trimmed Bert model, and the word vector data corresponding to the original code representation is obtained through encoding. The Bert model learns the context and semantic information of the text by adopting a transform feature extraction technology and a self attack technology. In addition, the input data segment _ ids of the Bert model represents the position information of a single sentence, and in general, the same character may represent different semantic and grammatical information in different sentences, so that different vector representations are given to the same character appearing in different positions, the context and grammatical information of a code sentence are reserved to the maximum extent, the information loss of code semantic information and code structure in the training process of the vulnerability detection model is avoided, and the accuracy of the vulnerability detection model is effectively improved.
4. And the improved BilSTM model is used for realizing a software source code vulnerability detection model.
Step 4-1: and (3) taking the word vector obtained by pretraining the Bert model in the step (3-2) as the training data of the improved BilSTM deep learning network, and training to obtain the vulnerability detection model based on the software source code. Because the Bert word vector training network model has certain classification capability, a network model which is too complicated to follow is not needed. The invention designs a single-layer BilSTM model. The loss function is binary _ cross, the optimization function is Adamax, the learning rate is 0.0001, and the number of input neurons is 128 (because the dimension of the Bert output vector is 768, the dimension of reshaping needs to be performed on the input vector before input). After the model training is finished, 20% of training data is taken to test the vulnerability detection model, and the detection performance of the model is evaluated by taking the detection accuracy and the training loss as indexes.
Referring to table 1, experiments are performed on cwe399 and cwe119 vulnerability data sets, and results show that the method provided by the invention uses a Bert model to pre-train the obtained function-level code blocks, so that semantic information and context of a source code can be effectively learned and retained, and compared with a word2vec word vector conversion technology, the method provided by the invention can ensure that a detection model obtained by training has higher detection accuracy.
TABLE 1 model classification accuracy and model training loss on CWE-399 and CWE-119 vulnerability datasets
Figure DEST_PATH_IMAGE002

Claims (4)

1. A C source code vulnerability detection method based on a Bert model and a BilSTM is characterized by comprising the following steps: mainly comprises the following steps:
step A: generating a program slice, and generating a Program Dependency Graph (PDG) and an Abstract Syntax Tree (AST) corresponding to a source code by using a Joern tool based on the source code of software, wherein the Program Dependency Graph (PDG) comprises a Control Dependency Graph (CDG) and a Data Dependency Graph (DDG) between codes, and the Abstract Syntax Tree (AST) comprises syntax information between program statements; slicing the code based on the vulnerability rules based on the control dependency graph CDG and the data dependency information in the data dependency graph DDG to finally obtain the desired program slice;
and B: b, generating symbolic representation of the code, carrying out data cleaning and formatting treatment on the original program slice obtained in the step A, namely renaming identifiers except keywords and operational characters, user-defined function names and variable names appearing in the code block according to a custom naming rule, converting each line of code sentences in the processed code block into an ordered subset consisting of single characters, and obtaining symbolic representation of the code;
and C: c, fine-tuning the standard Bert model, representing the code symbol obtained in the step B as a training set of the Bert model based on the standard Bert model, converting data in the training set into a data format required by the input of the Bert model one by one, inputting the data into the standard Bert model for fine-tuning the model, and obtaining a new Bert model more suitable for a local data set through fine-tuning; the method specifically comprises the following steps:
c1: the format for converting the training set into the Bert model input is represented as follows: firstly, reading an obtained code block, and forming a single-line list by a plurality of lines of codes by taking a semicolon as a boundary to obtain a data set with a format of Text and Label, wherein the Text represents a code set, and the Label represents a Label corresponding to the code set; secondly, word embedding is carried out on the Text by adopting a self-contained cache of the Bert, a code block is expressed into a single character set, and word embedding expression of the Text, namely a Tokens array is generated, wherein the Tokens array corresponding to each piece of data takes [ CLS ] as a Text beginning mark and takes [ SEP ] as an ending expression;
c2: defining the maximum length max _ seq _ length of a sequence after characters in a Tokens array are marked, mapping each character in the obtained Tokens array to a corresponding id, and generating an input _ ids array; each character in input _ ids is replaced by a unique id; meanwhile, based on the Tokens array, obtaining a corresponding segment _ ids array with the length of max _ seq _ length; the finally generated input _ ids and segment _ ids are used as the input of the Bert model and used for fine adjustment of the Bert model;
step D: generating a vector representation corresponding to the code representation, pre-training the code representation set obtained in the step B by using a new Bert model obtained by fine tuning in the step C, converting the pre-training into a vector representation set with rich semantic information and context, and inputting the vector representation set into a neural network for model training; the method specifically comprises the following steps:
d1: the vector representation set is the modified Bert model obtained in the step C, and the Bert model encodes input information to obtain word vector data corresponding to the original code representation and is used as the input of the BilSTM deep learning network;
step E: and (4) training a software source code vulnerability detection model, designing a double-layer BilSTM network, and taking the word vector obtained by the pretraining of the Bert model in the step D as training data of the word vector to obtain a vulnerability detection model based on a software source code.
2. The method of claim 1, wherein the step a further comprises:
a1: constructing a program dependency graph and an abstract syntax tree by using a public Joern tool, searching code statements matched with vulnerability syntax rules in the abstract syntax tree based on the vulnerability syntax rules, and constructing an ordered code statement set;
a2: based on control dependency information and data dependency information in a program dependency graph, an ordered code statement set is sliced to obtain slice-level code blocks, the code blocks are matched and compared with vulnerability code blocks in an existing vulnerability library, the code blocks containing vulnerability characteristics are represented by a label 1, and the code blocks not containing the vulnerability characteristics are represented by a label 0.
3. The method of claim 1, wherein the step B further comprises:
b1: the code block cleaning process is represented as: the obtained code block with the label consists of ordered multi-line code statements; deleting irrelevant character string information, code comments and non-ASCII characters which do not cause the vulnerability from each line of code statements, simultaneously reserving symbols with grammar information such as quotation marks and brackets, and reserving semicolons to distinguish each line of codes;
b2: the code block formatting process is expressed as: user-defined variable names in the code block are renamed as VAR1 and VAR2 in sequence, and user-defined function names are renamed as FUN1 and FUN2 in sequence; where VAR and FUN are used to distinguish between a function and a variable, 1 and 2 represent the order of the variable or function in a code block; these mappings are performed in a one-to-one manner; after the named replacement of each code block is completed, the functions and variable indices are re-counted so that multiple variables and functions may be mapped to the same symbolic name when they appear in different code blocks.
4. The method of claim 1, wherein the step E further comprises: and (4) taking 20% of training data to test the vulnerability detection model, and evaluating the detection performance of the model by taking the detection accuracy and the training loss as evaluation indexes.
CN202110770650.4A 2021-07-08 2021-07-08 C source code vulnerability detection method based on Bert model and BiLSTM Active CN113420296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110770650.4A CN113420296B (en) 2021-07-08 2021-07-08 C source code vulnerability detection method based on Bert model and BiLSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110770650.4A CN113420296B (en) 2021-07-08 2021-07-08 C source code vulnerability detection method based on Bert model and BiLSTM

Publications (2)

Publication Number Publication Date
CN113420296A CN113420296A (en) 2021-09-21
CN113420296B true CN113420296B (en) 2022-05-13

Family

ID=77720560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110770650.4A Active CN113420296B (en) 2021-07-08 2021-07-08 C source code vulnerability detection method based on Bert model and BiLSTM

Country Status (1)

Country Link
CN (1) CN113420296B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806750B (en) * 2021-09-24 2024-02-23 深信服科技股份有限公司 File security risk detection method, training method, device and equipment of model
CN113986345B (en) * 2021-11-01 2024-05-07 天津大学 Pre-training enhanced code clone detection method
CN114547619B (en) * 2022-01-11 2024-04-19 扬州大学 Vulnerability restoration system and restoration method based on tree
CN114491540A (en) * 2022-02-22 2022-05-13 南通大学 Security vulnerability detection method based on GraphCodeBERT
CN114861194B (en) * 2022-05-13 2024-03-08 兰州交通大学 Multi-type vulnerability detection method based on BGRU and CNN fusion model
CN115563626B (en) * 2022-10-21 2023-08-22 四川大学 CVE-oriented vulnerability availability prediction method
CN115495755B (en) * 2022-11-15 2023-04-07 四川大学 Codebert and R-GCN-based source code vulnerability multi-classification detection method
CN116048615B (en) * 2023-01-31 2023-08-25 安徽工业大学 Distributed program slicing method, device and equipment based on natural language processing
CN116661852B (en) * 2023-04-06 2023-12-08 华中师范大学 Code searching method based on program dependency graph
CN117786705B (en) * 2024-02-28 2024-05-14 南京信息工程大学 Statement-level vulnerability detection method and system based on heterogeneous graph transformation network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510241A (en) * 2009-03-12 2009-08-19 南京大学 Binary detecting and positioning device for shaping overflow leak
CN101714118A (en) * 2009-11-20 2010-05-26 北京邮电大学 Detector for binary-code buffer-zone overflow bugs, and detection method thereof

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245496B (en) * 2019-05-27 2021-04-20 华中科技大学 Source code vulnerability detection method and detector and training method and system thereof
CN111026548B (en) * 2019-11-28 2023-05-09 国网甘肃省电力公司电力科学研究院 Power communication equipment test resource scheduling method for reverse deep reinforcement learning
CN111026799B (en) * 2019-12-06 2023-07-18 安翰科技(武汉)股份有限公司 Method, equipment and medium for structuring text of capsule endoscopy report
CN111460820B (en) * 2020-03-06 2022-06-17 中国科学院信息工程研究所 Network space security domain named entity recognition method and device based on pre-training model BERT
CN111401061A (en) * 2020-03-19 2020-07-10 昆明理工大学 Method for identifying news opinion involved in case based on BERT and Bi L STM-Attention
CN111540470B (en) * 2020-04-20 2023-08-25 北京世相科技文化有限公司 Social network depression tendency detection model based on BERT transfer learning and training method thereof
CN112131352A (en) * 2020-10-10 2020-12-25 南京工业大学 Method and system for detecting bad information of webpage text type
CN112668013B (en) * 2020-12-31 2023-04-07 西安电子科技大学 Java source code-oriented vulnerability detection method for statement-level mode exploration
CN112989811B (en) * 2021-03-01 2022-09-09 哈尔滨工业大学 History book reading auxiliary system based on BiLSTM-CRF and control method thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510241A (en) * 2009-03-12 2009-08-19 南京大学 Binary detecting and positioning device for shaping overflow leak
CN101714118A (en) * 2009-11-20 2010-05-26 北京邮电大学 Detector for binary-code buffer-zone overflow bugs, and detection method thereof

Also Published As

Publication number Publication date
CN113420296A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
CN113420296B (en) C source code vulnerability detection method based on Bert model and BiLSTM
CN112215013B (en) Clone code semantic detection method based on deep learning
CN116627708B (en) Storage fault analysis system and method thereof
CN116245513B (en) Automatic operation and maintenance system and method based on rule base
CN101751385B (en) Multilingual information extraction method adopting hierarchical pipeline filter system structure
CN112306494A (en) Code classification and clustering method based on convolution and cyclic neural network
CN112257441B (en) Named entity recognition enhancement method based on counterfactual generation
CN112394973B (en) Multi-language code plagiarism detection method based on pseudo-twin network
CN113947161A (en) Attention mechanism-based multi-label text classification method and system
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN115437952A (en) Statement level software defect detection method based on deep learning
CN116340952A (en) Intelligent contract vulnerability detection method based on operation code program dependency graph
CN115495755A (en) Codebert and R-GCN-based source code vulnerability multi-classification detection method
CN111967267A (en) XLNET-based news text region extraction method and system
CN115953123A (en) Method, device and equipment for generating robot automation flow and storage medium
CN116702160B (en) Source code vulnerability detection method based on data dependency enhancement program slice
CN116956289B (en) Method for dynamically adjusting potential blacklist and blacklist
CN110866172B (en) Data analysis method for block chain system
CN116643759A (en) Code pre-training model training method based on program dependency graph prediction
CN116882402A (en) Multi-task-based electric power marketing small sample named entity identification method
CN115757695A (en) Log language model training method and system
CN113342982B (en) Enterprise industry classification method integrating Roberta and external knowledge base
CN115840815A (en) Automatic abstract generation method based on pointer key information
CN117113359B (en) Pre-training vulnerability restoration method based on countermeasure migration learning
CN115630647A (en) Entity and relation tandem type extraction method for text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant