CN113297584A - Vulnerability detection method, device, equipment and storage medium - Google Patents

Vulnerability detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN113297584A
CN113297584A CN202110855058.4A CN202110855058A CN113297584A CN 113297584 A CN113297584 A CN 113297584A CN 202110855058 A CN202110855058 A CN 202110855058A CN 113297584 A CN113297584 A CN 113297584A
Authority
CN
China
Prior art keywords
code
function
program
slicing
slice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110855058.4A
Other languages
Chinese (zh)
Inventor
贾鹏
王炎
刘嘉勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202110855058.4A priority Critical patent/CN113297584A/en
Publication of CN113297584A publication Critical patent/CN113297584A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The embodiment of the application provides a vulnerability detection method, a vulnerability detection device, vulnerability detection equipment and a storage medium, and relates to the technical field of network information security, wherein the method comprises the following steps: first, the binary code of the program to be detected is reverse compiled into a pseudo code. Detecting a danger function in the pseudo code, taking the danger function as a slicing point, extracting a slicing code related to the calling of the danger function, converting the slicing code into vector representation, taking the vectorized slicing code as input, and judging whether the program to be detected contains a bug or not through a detection neural network. The method can be used for cross-architecture and cross-platform binary code vulnerability recognition scenes, fine-grained vulnerability detection is realized on the level of binary codes, automatic feature extraction can be effectively realized, high false alarm influence caused by different compiling options and patch codes is relieved, and the method has extremely high accuracy and extremely low false alarm rate and false alarm rate.

Description

Vulnerability detection method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of network information security, in particular to a vulnerability detection method, device, equipment and storage medium.
Background
Many of the current network attacks are implemented by vulnerabilities, so the vulnerabilities are discovered to be an important research direction in the security field.
Because there are a large number of reused code libraries or shared code logic in a software system (e.g., similar objects have similar processing logic in different uses), there are widely recurring bugs in actual programs that have similar characteristics to each other but are not discovered. Also, many developers do not perform deep security analysis to discover potential vulnerabilities in code when reusing libraries. Thus, repeated vulnerability detection has gained widespread popularity, particularly as vulnerability availability increases.
The existing binary code clone vulnerability detection is mainly divided into two detection methods based on pattern matching and code similarity. The method based on pattern matching requires a vulnerability pattern defined in advance by an expert to perform vulnerability detection. The code similarity detection is based on the principle that a vulnerability library is built in advance, similarity comparison is carried out on the vulnerability library and unknown codes, if high similarity exists, the code is indicated to have a vulnerability, and otherwise, the code is normal. However, these detection methods have the disadvantages of high false alarm rate/false negative rate, low accuracy, and the like.
Therefore, a new detection method with low false alarm rate or missing alarm rate and high detection accuracy is urgently needed by those skilled in the art.
Disclosure of Invention
The embodiment of the application provides a vulnerability detection method, a vulnerability detection device, vulnerability detection equipment and a storage medium, and aims to solve at least one technical problem.
A first aspect of an embodiment of the present application provides a vulnerability detection method, where the method includes:
decompiling a program to be detected to obtain a pseudo code of the program to be detected;
detecting whether the pseudo code contains a danger function or not;
when a danger function is contained, extracting a forward slicing code segment and a backward slicing code segment of the danger function;
combining the forward slice code segments and backward slice code segments to obtain complete slice codes of a hazard function;
transcoding the complete slice into a vector representation;
and inputting the vector representation into a detection neural network for detection so as to detect whether the program to be detected contains a bug.
Optionally, the neural network comprises a BiGRU layer, a self-attention layer, a flattening layer, a full-link layer, and an activation layer, and the neural network is pre-trained.
Optionally, the extracting the forward slice code segments and the backward slice code segments of the risk function includes:
extracting a control dependency graph and a data dependency graph of each function in the pseudo code, and constructing a program dependency graph;
analyzing parameters and return values of dangerous function call to perform forward slicing based on the program dependency graph to obtain forward slicing code segments;
and analyzing parameters and return values of the dangerous function call to perform backward slicing based on the program dependency graph, and obtaining backward slicing code segments.
Optionally, the method further comprises:
removing all non-ASCII characters and comments in the pseudo code;
and performing symbolization processing on the variable name and the function name.
A second aspect of the embodiments of the present application provides a vulnerability detection apparatus, the apparatus including:
the decompiling module is used for decompiling the program to be detected to obtain a pseudo code of the program to be detected;
a hazard function determination module, configured to detect whether the pseudo code includes a hazard function;
the slicing module is used for extracting a forward slicing code segment and a backward slicing code segment of the danger function when the danger function is contained;
a slice code combining module for combining the forward slice code segment and the backward slice code segment to obtain a complete slice code of the hazard function;
a vector representation conversion module for converting the full slice code into a vector representation;
and the detection neural network is used for receiving the vector representation so as to detect whether the program to be detected contains a bug.
Optionally, the neural network comprises a BiGRU layer, a self-attention layer, a flattening layer, a full-link layer, and an activation layer, and the neural network is pre-trained.
Optionally, the slicing module includes:
the program dependency graph constructing submodule is used for extracting a control dependency graph and a data dependency graph of each function in the pseudo code and constructing a program dependency graph;
a forward slice code segment obtaining submodule for analyzing the parameters and return values of the dangerous function call based on the program dependency graph to perform forward slice to obtain a forward slice code segment;
and the backward slicing code segment acquisition submodule is used for analyzing parameters and return values called by the danger function based on the program dependency graph to carry out backward slicing so as to obtain a backward slicing code segment.
Optionally, the apparatus further comprises:
the removing module is used for removing all non-ASCII characters and annotations in the pseudo code;
and the symbolization processing module is used for symbolizing the variable names and the function names.
A third aspect of embodiments of the present application provides a readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps in the method according to the first aspect of the present application.
A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps in the method according to the first aspect of the present application.
By adopting the vulnerability detection method provided by the application, firstly, the binary code of the program to be detected is reversely compiled into the pseudo code. Detecting a danger function in the pseudo code, taking the danger function as a slicing point, extracting a slicing code related to the calling of the danger function, converting the slicing code into vector representation, taking the vectorized slicing code as input, and judging whether the program to be detected contains a bug or not through a detection neural network. The method can be used for cross-architecture and cross-platform binary code vulnerability recognition scenes, fine-grained vulnerability detection is realized on the level of binary codes, automatic feature extraction can be effectively realized, high false alarm influence caused by different compiling options and patch codes is relieved, and the method has extremely high accuracy and extremely low false alarm rate and false alarm rate.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a flowchart of a vulnerability detection method according to an embodiment of the present application;
FIG. 2 is a flow chart of slicing a risk function according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a neural network for detection according to an embodiment of the present application;
fig. 4 is a schematic diagram of functional modules of a vulnerability detection apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Some problems still exist for the existing binary code clone vulnerability detection technology, such as:
firstly, there is a high cost of manpower and material resources in defining vulnerability characteristics based on the requirement of experts for pattern matching, and a high false alarm rate or missing report rate is usually generated.
Secondly, only code multiplexing loopholes can be detected based on code similarity detection, and for the condition that the loopholes have different code structures but similar loophole triggering scenes can cause higher false alarm.
Thirdly, due to the existence of the bug patch, the difference between the bug code and the patch code is very small, and the bug detection by taking the function as a unit can cause higher false alarm rate.
Therefore, the current binary code vulnerability detection method cannot meet the current requirement for high-precision vulnerability detection, and particularly, when the binary code is compiled through different compiling options, the detection precision of the original detection technology is lower.
Aiming at the problems in the prior art, the application provides a new vulnerability detection method which can identify more similar vulnerabilities and resist the influence of different compiling options and patch codes.
Referring to fig. 1, fig. 1 is a flowchart of a vulnerability detection method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
and step S110, decompiling the program to be detected to obtain the pseudo code of the program to be detected.
Decompilation, which is called Reverse compilation entirely, is a computer software Reverse engineering (Reverse engineering) work that derives design elements such as ideas, principles, structures, algorithms, processing procedures, operation methods and the like used by software products of others by performing 'Reverse analysis and research' work on target programs (such as executable programs) of the software of others, and may derive source codes under certain specific conditions. Decompilation is the inverse process of compilation, and since the codes of the application programs are all machine codes, it is not beneficial to understand the association relationship of each function in the codes, and therefore, the detection program needs to be decompilated for subsequent detection.
In the prior art, decompiling is usually to decompile a binary program to obtain a corresponding assembly code, but each instruction in the assembly code contains less semantic information and is easily affected by different architectures, different platforms and different compiling options. In order to avoid the defects of the traditional decompilation method, the vulnerability detection method can be used for cross-architecture and cross-platform binary code vulnerability identification scenes, and the binary code of the program to be detected is directly subjected to decompilation operation to obtain corresponding pseudo codes, wherein the pseudo codes have the advantages that: the method is similar to common C codes, supports corresponding syntax semantic analysis, and has high similarity between the recovered pseudo codes and source codes.
And step S120, detecting whether the pseudo code contains a danger function.
And detecting whether the obtained pseudo code contains a danger function or not, wherein the danger function is a predefined function with high risk. Because not all library/API functions will generate bugs, it is not necessary or inefficient to analyze all functions related to the library/API in the program, and only the part calling the dangerous function needs to be bug-detected. After determining the library/API function with the danger, extracting the function containing the library/API function with the danger, and further analyzing to determine whether the vulnerability exists.
In an embodiment of the application, a data set matching mode is adopted for detection, vulnerability triggering conditions caused by improper use of functions in all common vulnerability types are analyzed, then the vulnerability functions are summarized, and a dangerous library/API function set is constructed. Generalizing in the dataset summarizes the common 66 hazard/API functions: access, alloca, alert, calloc, close, connect, execute, execlp, fclose, fgets, fopen, fprintf, fputc, fread, free, freopen, fscafcaff, fwrite, getev, gets, listen, malloc, memcpy, memmset, mkstemp, mktemp, openn, pop, printf, putc, tchpuar, putev, puts, RAND32, RAND64, realloc, remove, rename, scanf, setsockopt, snprintf, sprintf, sqrt, ssf, sstchthredfree, Loacthredstredcastquardt, Louth, strostryredtstr, stryre, str, scriptmark, script, netdown, copy, replay, copy, runtime, replay;
when a dangerous function in the data set is detected in the pseudo code (the program calls a function in the library through the API, and the called function is recorded in the dangerous function set), the program to be detected has a vulnerability risk and needs to be further analyzed.
And step S130, when the danger function is contained, extracting a forward slice code segment and a backward slice code segment of the danger function.
When the fact that the danger function is contained in the pseudo code is detected, the danger function is subjected to program slicing to obtain a forward slicing code segment and a backward slicing code segment of the danger function. Program slices are defined as: it is a program analysis technique for decomposing a program, aiming at extracting code fragments satisfying certain constraints from the program. The method is a program decomposition technology, and aims to understand and recognize the whole program by decomposing the program by searching relevant characteristics in the program and then analyzing and researching a program slice obtained by decomposition. In short, the method can find the relevant parts of the method codes, and eliminate the irrelevant parts, so that the program can be debugged, tested, maintained and the like conveniently.
The method starts from sentences which may cause the vulnerability, slices are carried out, the sentences which specifically cause the vulnerability can be located in subsequent analysis, and compared with the existing method for analyzing by taking functions as units, the method is thinner in analysis granularity, and therefore more accurate detection can be achieved.
In one embodiment of the present application, a flow diagram of a program slicing process is shown in FIG. 2. Extracting forward and backward slice code segments of the hazard function, comprising:
step A, extracting a control dependency graph and a data dependency graph of each function in the pseudo code, and constructing a program dependency graph;
extracting a control dependency graph and a data dependency graph of each function in the pseudo code; control dependencies, e.g., whether statement a is executed or not, are determined by the execution result of statement b, data according to, e.g.: statement a reads one of the variables written by statement b. And constructing a program dependency graph of the program to be detected based on the control dependency graph and the data dependency graph of each function.
As shown in fig. 2, when a danger function call is detected in the pseudo code, such as when line 5 is detected to relate to a danger function call, the line of code is extracted.
B, analyzing parameters and return values of dangerous function call based on the program dependency graph to perform forward slicing to obtain forward slicing code segments;
and D, forward slicing the danger function based on the program dependence graph constructed in the step A. Forward slices refer to all statements and predicates found for a given point of interest that are affected by the value of the variable for that point. Analyzing the calling parameters and the return values of the danger function, and constructing a set afect (v/n), wherein v represents the variable of the output of the danger function, n represents the interest point, and the set is the forward slice code segment of the danger function.
And C, analyzing parameters and return values of the dangerous function call based on the program dependency graph to perform backward slicing, and obtaining backward slicing code segments.
Backward slicing is the opposite of forward slicing, which is to construct a set, the select (v/n), such that the set is composed of all statements and predicates that affect v at n points, v representing the variables received by the hazard function, and n representing the points of interest.
When a situation containing multiple danger functions, each detected danger function should be sliced to obtain a complete slice of each danger function for analysis of each called danger function.
As shown in fig. 2, forward and backward slicing is performed based on the line code and the position thereof as slicing references, and a forward slice code segment and a backward slice code segment are obtained.
And S140, combining the forward slice code segment and the backward slice code segment to obtain a complete slice code of the danger function.
And after the forward slicing code segment and the backward slicing code segment of the danger function are obtained, assembling the final complete slicing code called by the danger library/API function by removing repeated code statements according to the code sequence. As shown in fig. 2, the codes obtained by slicing are assembled according to the code order, and then the complete slice code can be obtained.
In an embodiment of the present application, after obtaining the complete slice code, a symbolization process is further performed on the slice code segment, where the processing method includes:
1. all non-ASCII characters and comments in the pseudo-code are removed.
Because during decompilation, the decompilation tool automatically adds some annotations. And some abnormal conditions in decompilation can cause some non-ASCII characters in the decompilated pseudo code, which are useless or cause interference to vulnerability detection and need to be removed.
2. And performing symbolization processing on the variable name and the function name.
Because variable names in the pseudo code are customized by a decompiler, although the naming mode is fixed, some memory address-related naming is involved, which results in a large bag of words in a word vector model and is not beneficial to the expansion of a vulnerability detection model, and therefore, the variable names need to be uniformly symbolized.
The variable name symbolization rule is as follows: the variable names are replaced by the symbols "VAR" + "number", where the number refers to the order in which the variables first appear in the slice code, starting with the number 1, such as "VAR 1". For example, the variable name in the code fragment, such as local _1b, is renamed as VAR 1.
The method and the device also perform symbolization processing on the function name, but only symbolize the user-defined function name, and because the library/API function name and the vulnerability are high in relevance, the user-defined name is large in difference, and the difference needs to be reduced. The rule is as follows: the user-defined function is only signed, the function name is replaced by the symbol "FUN" + "number", here the number also refers to the order of the first appearance of the function name in the slice code, but there is a discrepancy between the number start value and the start value of the number in the variable sign, the number in the function sign starts from 0 like FUN0, and the function name "FUN 0" is only used for naming the function where the slice code is located, and the symbol of the user-defined function appearing in the code segment starts from "FUN 1". For example, the name of the function where the slice code is located, char _ param _1, is replaced with FUN0, and if there are also user-defined functions in the code segment, the replacement is started from FUN 1.
And step S150, converting the complete slice code into vector representation.
And inputting the complete code slice segment of the risk function into a word vector model, wherein the word vector model is trained in advance to obtain vector representation of the slice code. In one embodiment of the present application, independent byte units such as void, fun0 in slice code are used as word input word vector model, which converts each input word into a vector, for example, fun0 into a vector
Figure 23989DEST_PATH_IMAGE001
. The Word vector model is a Word2Vec Word vector model based on a skip-gram mode, the Word vector model is trained in advance, and for the training process, the application trains a pseudo code Word stock to the basic Word2Vec to obtain the Word vector model which can be used for converting the pseudo code.
Wherein said transcoding the full slice into a vector representation comprises:
performing lexical analysis on the symbolized slice codes; each slice is code decomposed into a fixed number of tokens.
The method comprises the steps that a Token part scene is translated into a lexical unit which is a product of structural scanning of a program and represents a grammatical structure, slice codes are analyzed, the slice codes are converted into a structural body formed by a series of tokens, the number of tokens input by each slice code is unified to be 500, if the number of tokens exceeds 500, the first 500 tokens are intercepted, and if the number of tokens is less than 500, 0 is filled and supplemented.
And S160, inputting the vector representation into a detection neural network for detection so as to detect whether the program to be detected contains a bug.
And inputting the vector representation obtained in the step S150 into a detection neural network for detection, wherein the detection neural network represents whether the target program to be detected contains the bug or not according to the vector.
In one embodiment of the present application, the neural network for detection is shown in fig. 3 and includes a BiGRU layer, a self-attention layer, a flattening layer, a full link layer and an activation layer, and the neural network for detection is pre-trained.
Gru (gate recovery unit) is one of Recurrent Neural Networks (RNN). Like LSTM (Long-Short Term Memory), it is proposed to solve the problems of Long-Term Memory and gradient in back-propagation. In a unidirectional neural network architecture, states are always output from front to back. Therefore, the bidirectional neural network is provided, and the current output is determined according to the previous time step and the next time step, so that the extraction of the deep level features of the text is facilitated.
The vulnerability is often caused by a plurality of code sentences with associated semantics, the distance between the sentences is long, and in order to capture the long-distance semantic association, a BiGRU network which is more advantageous in capturing long text context semantic information is adopted to establish the context semantic association. The BiGRU, bidirectional Gated regenerative Unity, bi-directional Gated cyclic unit used in the present application is a neural network model composed of unidirectional and opposite-directional GRUs. At each time, the input provides two GRUs in opposite directions simultaneously, and the output is determined by both of the unidirectional GRUs.
In one embodiment of the present application, the BiGRU layer mainly includes 2 BiGRU layers, and the number of nodes in each BiGRU layer is 256.
Although BiGRU can extract context semantic information in slice code over long distances very well, different timestamps in BiGRU have different degrees of importance. In order to embody the importance, the code statements which are more relevant to the vulnerability have more important functions, and a self-attention mechanism is also adopted in the neural network for detection. This attention mechanism enables efficient processing of sequential data and takes into account the context of each timestamp. The self-attention layer adopts a sigmoid activation function, and the calculation formula is as follows:
Figure 165120DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure 663098DEST_PATH_IMAGE003
which represents the data of the query, is,
Figure 434745DEST_PATH_IMAGE004
an input matrix representing the query data,
Figure 385383DEST_PATH_IMAGE005
the representation of the critical data is shown,
Figure 267889DEST_PATH_IMAGE006
an input matrix representing key data, T represents a transpose operation,
Figure 620372DEST_PATH_IMAGE007
a query (query) key value parameter is represented,
Figure 625238DEST_PATH_IMAGE008
represents a key (key) key value parameter,
Figure 63172DEST_PATH_IMAGE009
indicating an attention parameter;
Figure 483789DEST_PATH_IMAGE010
and
Figure 956359DEST_PATH_IMAGE011
respectively, the values of the offset are indicated,
Figure 69808DEST_PATH_IMAGE012
Figure 995039DEST_PATH_IMAGE013
an intermediate representation of the self-attentional value is shown,
Figure 219347DEST_PATH_IMAGE014
to represent
Figure 343161DEST_PATH_IMAGE015
Relative to
Figure 627512DEST_PATH_IMAGE016
The value of the self-attentive force of (c),
Figure 40038DEST_PATH_IMAGE017
representing the final output after multiplying the attention value,
Figure 68037DEST_PATH_IMAGE018
representing the sigmoid activation function.
The method comprises the steps that a flattening layer is connected behind a self-attention layer and used for conducting multidimensional input in a one-dimensional mode, a full-connection layer is connected behind the flattening layer, an activation layer is connected behind the full-connection layer and can use sigmoid as an activation function to obtain final output, the flattening layer, the full-connection layer and the activation layer form a two-classifier, the two-classifier receives the BiGRU layer and the features extracted from the attention layer to achieve vulnerability detection, and finally, the loss value is calculated through binary cross entropy of the neural network.
The detection neural network is obtained through training, and the training process is similar to the detection method process of the application and can be explained mutually. Specifically, a training sample can be established to train a plurality of basic models, wherein one training sample is a complete section code sample carrying a label, wherein the complete section code sample can be obtained through the following processes from the first step to the third step, and the label is used for representing whether a program corresponding to the complete section code sample contains a vulnerability. Then, a preset network is trained by using the training sample, wherein the preset network can be a self-attention neural network. The training process is shown as the fifth step in the following steps.
The whole training stage mainly comprises five steps, namely, randomly selecting a plurality of binary programs, and performing decompilation operation on the selected binary codes to obtain pseudo codes corresponding to the binary codes. Secondly, finding dangerous functions (which can be based on the dangerous function set in the above embodiment of the present application) in the program pseudo code, extracting dangerous library/API function calls in each function in the pseudo code, and then extracting forward slice segments and backward slice segments of parameters and return values of the library/API function calls by using a program slicing technology; and thirdly, assembling the forward slice segment and the backward slice segment of each dangerous library/API function call in the program, wherein each assembled code segment is related to the corresponding library/API function call. In the training process, after a complete code slice is assembled, analyzing an actual vulnerability trigger code statement of each function, confirming whether the assembled slice code segment contains a vulnerability, marking each code slice with a label, if the assembled slice code segment contains the vulnerability, marking the assembled slice code segment as '1', otherwise marking the assembled slice code segment as '0'; and fourthly, symbolizing the assembled slice code segments to reduce the difference influence caused by non-ASCII characters, annotations, custom function names, variable names and the like. The training phase further comprises the steps of performing lexical analysis on the symbolized slice codes, converting each slice code into a structural body consisting of a series of tokens, and then constructing a corpus based on the tokens of all slice sequences. Based on the corpus, the dimension of the Word2Vec Word vector model Word vector of the training base is 100. The trained Word2Vec Word vector model can map all tokens into vector representations, and then assemble the vector representations of the tokens, so that each slice code can be vectorized to be used as the input of the deep learning model. After the training of the Word2Vec model is completed, vector representation of each code slice can be obtained; and fifthly, inputting the data converted into the vector into a BiGRU neural network based on a self-attention mechanism for training to obtain a detection neural network model.
According to the vulnerability detection method, firstly, the binary codes of the program to be detected are inversely compiled into the pseudo codes. Detecting a danger function in the pseudo code, taking the danger function as a slicing point, extracting a slicing code related to the calling of the danger function, converting the slicing code into vector representation, taking the vectorized slicing code as input, and judging whether the program to be detected contains a bug or not through a detection neural network. The method can be used for cross-architecture and cross-platform binary code vulnerability recognition scenes, fine-grained vulnerability detection is realized on the level of binary codes, automatic feature extraction can be effectively realized, high false alarm influence caused by different compiling options and patch codes is relieved, and the method has extremely high accuracy and extremely low false alarm rate and false alarm rate.
Compared with the existing binary code vulnerability detection method, the method has the following advantages that: 1. semantic features of relevant codes of the danger function can be automatically extracted for judgment without manpower, and the problem that experts are needed to define vulnerability features is solved; 2. the analysis is carried out based on a decompiling method, so that the influence caused by compiling difference and patch codes can be effectively resisted; 3. because the method and the device automatically extract the features and do not depend on the inherent features, the method and the device can identify the clone loopholes and detect unknown loopholes because the loopholes have similar loophole features; 4. the method has high expansibility, and the accuracy and the feasibility of the model are higher and higher with the increase of the vulnerability sample set, so that the method can be effectively used for cloning vulnerability identification, unknown vulnerability identification, vulnerability type identification and other scenes.
Based on the same inventive concept, an embodiment of the present application provides a vulnerability detection apparatus. Referring to fig. 4, fig. 4 is a schematic diagram of a vulnerability detection apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus includes:
the decompiling module 410 is used for decompiling the program to be detected to obtain a pseudo code of the program to be detected;
a hazard function determining module 420, configured to detect whether a hazard function is included in the pseudo code;
a slicing module 430, configured to, when a danger function is included, extract a forward slicing code segment and a backward slicing code segment of the danger function;
a slice code combining module 440, configured to combine the forward slice code segments and the backward slice code segments to obtain complete slice codes of the risk function;
a vector representation conversion module 450 for converting the full slice code into a vector representation;
a detection neural network 460 for receiving the vector representation to detect whether the program to be detected contains a bug.
Preferably, the slicing module includes:
the program dependency graph constructing submodule is used for extracting a control dependency graph and a data dependency graph of each function in the pseudo code and constructing a program dependency graph;
a forward slice code segment obtaining submodule for analyzing the parameters and return values of the dangerous function call based on the program dependency graph to perform forward slice to obtain a forward slice code segment;
and the backward slicing code segment acquisition submodule is used for analyzing parameters and return values called by the danger function based on the program dependency graph to carry out backward slicing so as to obtain a backward slicing code segment.
In an optional embodiment of the present application, the neural network for detection includes a BiGRU layer, a self-attention layer, a flattening layer, a full link layer, and an activation layer, and the neural network for detection is pre-trained.
Further, the apparatus further comprises:
the removing module is used for removing all non-ASCII characters and annotations in the pseudo code;
and the symbolization processing module is used for symbolizing the variable names and the function names.
Based on the same inventive concept, another embodiment of the present application provides a readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the vulnerability detection method according to any of the above embodiments of the present application.
Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the electronic device implements the steps in the vulnerability detection method according to any of the above embodiments of the present application.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method, the device, the equipment and the storage medium for detecting the vulnerability provided by the application are introduced in detail, a specific example is applied in the method to explain the principle and the implementation mode of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A vulnerability detection method, the method comprising:
decompiling a program to be detected to obtain a pseudo code of the program to be detected;
detecting whether the pseudo code contains a danger function or not;
when a danger function is contained, extracting a forward slicing code segment and a backward slicing code segment of the danger function;
combining the forward slice code segments and backward slice code segments to obtain complete slice codes of a hazard function;
transcoding the complete slice into a vector representation;
and inputting the vector representation into a detection neural network for detection so as to detect whether the program to be detected contains a bug.
2. The method of claim 1, wherein the neural network comprises a BiGRU layer, a self-attention layer, a flattening layer, a full connectivity layer, and an activation layer, and wherein the neural network is pre-trained.
3. The method of claim 1, wherein extracting forward slice code segments and backward slice code segments of the risk function comprises:
extracting a control dependency graph and a data dependency graph of each function in the pseudo code, and constructing a program dependency graph;
analyzing parameters and return values of dangerous function call to perform forward slicing based on the program dependency graph to obtain forward slicing code segments;
and analyzing parameters and return values of the dangerous function call to perform backward slicing based on the program dependency graph, and obtaining backward slicing code segments.
4. The method of claim 1, further comprising:
removing all non-ASCII characters and comments in the pseudo code;
and performing symbolization processing on the variable name and the function name.
5. A vulnerability detection apparatus, the apparatus comprising:
the decompiling module is used for decompiling the program to be detected to obtain a pseudo code of the program to be detected;
a hazard function determination module, configured to detect whether the pseudo code includes a hazard function;
the slicing module is used for extracting a forward slicing code segment and a backward slicing code segment of the danger function when the danger function is contained;
a slice code combining module for combining the forward slice code segment and the backward slice code segment to obtain a complete slice code of the hazard function;
a vector representation conversion module for converting the full slice code into a vector representation;
and the detection neural network is used for receiving the vector representation so as to detect whether the program to be detected contains a bug.
6. The apparatus of claim 5, wherein the neural network comprises a BiGRU layer, a self-attention layer, a flattening layer, a full connectivity layer, and an activation layer, and wherein the neural network is pre-trained.
7. The apparatus of claim 5, wherein the slicing module comprises:
the program dependency graph constructing submodule is used for extracting a control dependency graph and a data dependency graph of each function in the pseudo code and constructing a program dependency graph;
a forward slice code segment obtaining submodule for analyzing the parameters and return values of the dangerous function call based on the program dependency graph to perform forward slice to obtain a forward slice code segment;
and the backward slicing code segment acquisition submodule is used for analyzing parameters and return values called by the danger function based on the program dependency graph to carry out backward slicing so as to obtain a backward slicing code segment.
8. The apparatus of claim 5, further comprising:
the removing module is used for removing all non-ASCII characters and annotations in the pseudo code;
and the symbolization processing module is used for symbolizing the variable names and the function names.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in any one of claims 1 to 4.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 4 when executing the computer program.
CN202110855058.4A 2021-07-28 2021-07-28 Vulnerability detection method, device, equipment and storage medium Pending CN113297584A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110855058.4A CN113297584A (en) 2021-07-28 2021-07-28 Vulnerability detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110855058.4A CN113297584A (en) 2021-07-28 2021-07-28 Vulnerability detection method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113297584A true CN113297584A (en) 2021-08-24

Family

ID=77331188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110855058.4A Pending CN113297584A (en) 2021-07-28 2021-07-28 Vulnerability detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113297584A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806750A (en) * 2021-09-24 2021-12-17 深信服科技股份有限公司 File security risk detection method, model training method, device and equipment
WO2023168302A3 (en) * 2022-03-02 2023-11-16 Sentinel Labs Israel Ltd. Systems, methods, and devices for executable file classification

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870752A (en) * 2012-12-18 2014-06-18 百度在线网络技术(北京)有限公司 Method and device for detecting Flash XSS (Cross Site Script) vulnerabilities and equipment
CN104519007A (en) * 2013-09-26 2015-04-15 深圳市腾讯计算机系统有限公司 Loophole detection method and server
CN106295346A (en) * 2015-05-20 2017-01-04 深圳市腾讯计算机系统有限公司 A kind of application leak detection method, device and the equipment of calculating
CN112699377A (en) * 2020-12-30 2021-04-23 哈尔滨工业大学 Function-level code vulnerability detection method based on slice attribute graph representation learning
CN112906004A (en) * 2021-01-26 2021-06-04 北京顶象技术有限公司 Vulnerability detection method and device based on assembly code and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870752A (en) * 2012-12-18 2014-06-18 百度在线网络技术(北京)有限公司 Method and device for detecting Flash XSS (Cross Site Script) vulnerabilities and equipment
CN104519007A (en) * 2013-09-26 2015-04-15 深圳市腾讯计算机系统有限公司 Loophole detection method and server
CN106295346A (en) * 2015-05-20 2017-01-04 深圳市腾讯计算机系统有限公司 A kind of application leak detection method, device and the equipment of calculating
CN112699377A (en) * 2020-12-30 2021-04-23 哈尔滨工业大学 Function-level code vulnerability detection method based on slice attribute graph representation learning
CN112906004A (en) * 2021-01-26 2021-06-04 北京顶象技术有限公司 Vulnerability detection method and device based on assembly code and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
崔艳鹏等: "基于程序切片技术的云计算软件安全模型研究", 《技术研究》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806750A (en) * 2021-09-24 2021-12-17 深信服科技股份有限公司 File security risk detection method, model training method, device and equipment
CN113806750B (en) * 2021-09-24 2024-02-23 深信服科技股份有限公司 File security risk detection method, training method, device and equipment of model
WO2023168302A3 (en) * 2022-03-02 2023-11-16 Sentinel Labs Israel Ltd. Systems, methods, and devices for executable file classification

Similar Documents

Publication Publication Date Title
Cheng et al. Deepwukong: Statically detecting software vulnerabilities using deep graph neural network
KR101904911B1 (en) Method for Automatically Detecting Security Vulnerability Based on Hybrid Fuzzing, and Apparatus thereof
US11568055B2 (en) System and method for automatically detecting a security vulnerability in a source code using a machine learning model
CN110737899B (en) Intelligent contract security vulnerability detection method based on machine learning
US10620945B2 (en) API specification generation
Nguyen et al. Recommending API usages for mobile apps with hidden markov model
KR20190041912A (en) System for detecting security vulnerability based on binary, method and program thereof
CN108491228B (en) Binary vulnerability code clone detection method and system
CN113297584A (en) Vulnerability detection method, device, equipment and storage medium
CN106295346B (en) Application vulnerability detection method and device and computing equipment
CN109670318B (en) Vulnerability detection method based on cyclic verification of nuclear control flow graph
US11327722B1 (en) Programming language corpus generation
CN113821804B (en) Cross-architecture automatic detection method and system for third-party components and security risks thereof
CN113312268A (en) Intelligent contract code similarity detection method
CN112115326B (en) Multi-label classification and vulnerability detection method for Etheng intelligent contracts
CN114942879A (en) Source code vulnerability detection and positioning method based on graph neural network
CN113468525A (en) Similar vulnerability detection method and device for binary program
CN109902487B (en) Android application malicious property detection method based on application behaviors
CN112688966A (en) Webshell detection method, device, medium and equipment
CN116108446A (en) Vulnerability patch existence detection method based on deep learning
CN113468524A (en) RASP-based machine learning model security detection method
CN116074092B (en) Attack scene reconstruction system based on heterogram attention network
CN116821903A (en) Detection rule determination and malicious binary file detection method, device and medium
CN110737469A (en) Source code similarity evaluation method based on semantic information on functional granularities
CN111190813B (en) Android application network behavior information extraction system and method based on automatic testing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210824