CN114254323A - Software vulnerability analysis method and system based on PCODE and Bert - Google Patents

Software vulnerability analysis method and system based on PCODE and Bert Download PDF

Info

Publication number
CN114254323A
CN114254323A CN202111333255.6A CN202111333255A CN114254323A CN 114254323 A CN114254323 A CN 114254323A CN 202111333255 A CN202111333255 A CN 202111333255A CN 114254323 A CN114254323 A CN 114254323A
Authority
CN
China
Prior art keywords
vulnerability analysis
pcode
bert
vulnerability
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111333255.6A
Other languages
Chinese (zh)
Inventor
韩文杰
庞建民
单征
周鑫
岳峰
李明亮
祝迪
王其涵
刘光明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Original Assignee
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN202111333255.6A priority Critical patent/CN114254323A/en
Publication of CN114254323A publication Critical patent/CN114254323A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Virology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention belongs to the field of network security, and relates to a software vulnerability analysis method and system based on PCODE and Bert, wherein a vulnerability analysis frame is constructed, the frame is used for performing inverse compilation on input binary program content to generate a PCODE intermediate language, and a Bert neural network is used for performing vector mapping and feature extraction and classification on the PCODE intermediate language; collecting a corpus data set, a training data set and a test data set, pre-training a Bert neural network in a frame by using the corpus data set in sequence, training the Bert neural network by using the training data set to learn vulnerability semantic features so as to generate a vulnerability analysis model, and evaluating and optimizing the performance of the vulnerability analysis model by using the test data set so as to generate a final vulnerability analysis model; and identifying vulnerability categories in the target binary program file by utilizing a framework of the final vulnerability analysis model. The invention can find meaningful information and bugs of the program code segments, improve the analysis effect of the cross-architecture and multi-type software vulnerability, and improve the efficiency and accuracy.

Description

Software vulnerability analysis method and system based on PCODE and Bert
Technical Field
The invention belongs to the technical field of network space security, and particularly relates to a software vulnerability analysis method and system based on PCODE and Bert.
Background
The development of the internet technology enables the number and the types of software to be rapidly increased, and brings potential safety hazards while providing convenience for people. Lawbreakers use more and more vulnerabilities to pose a threat to the security of network spaces, so the importance of software vulnerability analysis techniques is self-evident. Heart bleeding holes such as OpenSSL that was exposed in 2014 have affected billions of internet users. However, in the face of a huge number of vulnerabilities, it is very time-consuming and laborious to analyze them manually. Therefore, the research on the intelligent analysis of the software vulnerability is urgent.
The traditional software vulnerability analysis method mainly comprises a software vulnerability static analysis method, a software vulnerability dynamic analysis method and an analysis method of mixing the two methods, wherein the static vulnerability analysis method has advantages in speed, but the false alarm rate is generally high; the dynamic vulnerability analysis method can accurately locate vulnerabilities, but for large-scale software systems, the method often needs a large amount of resources. The static and dynamic combined vulnerability analysis method has better detection efficiency and accuracy, but is generally suitable for specific vulnerability types. At present, many research works are developed on the level of source codes, but in a real network space, most of software is packed into binary programs to be provided for users, and particularly with the popularization of internet of things equipment, common internet of things equipment is difficult to find corresponding source codes to analyze the vulnerability of the software, so the closed source of the software restricts the method. In addition, the architecture of the internet of things device is diverse, the same program can be compiled into different architectures, the assembly languages of the different architectures are very different, and therefore, the disclosed vulnerability application can continue to endanger the security of the network space in the different architectures.
Disclosure of Invention
Therefore, the invention provides a software vulnerability analysis method and system based on PCODE and Bert, wherein the PCODE is used for carrying out software vulnerability analysis, the BERT model is used for carrying out feature learning on code segments in a training set, and prediction is made on the software vulnerability according to the output of a classifier, so that the analysis effect of cross-architecture and multi-type software vulnerability can be improved, and the security performance of a network space can be effectively ensured.
According to the design scheme provided by the invention, a software vulnerability analysis method based on PCODE and Bert is provided, which comprises the following contents:
constructing a vulnerability analysis framework, performing decompiling on the input binary program file content by using the framework to generate a pcode intermediate language, and performing vector mapping, feature extraction and classification on the pcode intermediate language by using a Bert neural network;
collecting a vector mapping pre-trained unmarked program corpus data set used for a Bert neural network in a frame, a training data set used for training vulnerability semantic features of Bert neural network learning software in the frame to generate a vulnerability analysis model, and a test data set used for evaluating and optimizing the generated vulnerability analysis model, pre-training the Bert neural network in the frame by using the corpus data set, then training the vulnerability semantic features of the Bert neural network learning software in the frame by using the training data set to generate the vulnerability analysis model, and finally evaluating and optimizing the vulnerability analysis model performance by using the test data set and generating a final vulnerability analysis model used for target software vulnerability analysis prediction;
and identifying vulnerability categories in the target binary program file by utilizing a vulnerability analysis framework containing the finally generated vulnerability analysis model.
As the software vulnerability analysis method based on the PCODE and the Bert, the vulnerability analysis framework utilizes a decompilation tool to decompilate the input into the PCODE intermediate language, and carries out normalized and standardized preprocessing on the decompilated PCODE intermediate language to obtain the intermediate language representation with the unified format input as the Bert neural network model.
As the software vulnerability analysis method based on PCODE and Bert, the preprocessing comprises the following steps: normalizing the instruction sequence and the invalid instruction in the program code segment; and standardizing the function name and the variable in the program code segment.
As the software vulnerability analysis method based on PCODE and Bert, the method further comprises the steps of utilizing an embedding layer to map PCODE intermediate language representation into a vector space in a Bert neural network in a vulnerability analysis framework, utilizing a plurality of transform encoders connected in series to extract features of embedded vectors through continuous iteration and back propagation, and utilizing a classifier to classify according to the extracted features.
As the software vulnerability analysis method based on PCODE and Bert, the BERT neural network model utilizes cross entropy as a loss function to evaluate the difference between the current probability distribution of the model and the real distribution of a data set in the model training process.
As the software vulnerability analysis method based on PCODE and Bert, each transform encoder extracts the context semantic features of the current node by reading fixed-length input, and fuses the context semantic features of the current node into a vector space by using a self-attention mechanism to be used as an output vector of the current layer to be transmitted to the next layer.
As the software vulnerability analysis method based on PCODE and Bert, the classifier is further constructed by utilizing two full-connected layers and a softmax layer, wherein the softmax activation function is expressed as:
Figure BDA0003349570310000021
where i represents a single class in the class number k, viRepresenting the corresponding value of the category i.
Further, the present invention provides a software vulnerability analysis system based on PCODE and Bert, comprising: a framework building module, a model generating module and an object identifying module, wherein,
the framework construction module is used for constructing a vulnerability analysis framework, performing decompiling on the input binary program file content by using the framework to generate a pcode intermediate language, and performing vector mapping and feature extraction and classification on the pcode intermediate language by using a Bert neural network;
the model generation module is used for collecting a pretrained unmarked program corpus data set used for vector mapping of a Bert neural network in a frame, a training data set used for training vulnerability semantic features of a Bert neural network learning software in the frame to generate a vulnerability analysis model, and a test data set used for evaluating and optimizing the generated vulnerability analysis model, pre-training the Bert neural network in the frame by using the corpus data set, training the vulnerability semantic features of the Bert neural network learning software in the frame by using the training data set to generate the vulnerability analysis model, and evaluating and optimizing the vulnerability analysis model performance by using the test data set to generate a final vulnerability analysis model used for target software vulnerability analysis prediction;
and the target identification module is used for identifying the vulnerability category in the target binary program file by utilizing the vulnerability analysis framework containing the finally generated vulnerability analysis model.
The invention has the beneficial effects that:
the invention uses the pcode in the middle zone of the high-level language and the low-level assembly language, makes full use of the applicability of the pcode to the cross-architecture, uses the BERT neural network to perform sequence learning on a binary file data set containing multiple vulnerabilities on the pcode intermediate language, predicts the vulnerability of software through learning the characteristic mode of the vulnerabilities, can find meaningful information and vulnerabilities in program code segments, does not need to access original source codes, realizes semantic learning of programs with different architectures, can improve vulnerability recognition efficiency and accuracy, and can effectively ensure the security performance of network space. And further verified by a real data set experiment, the scheme can efficiently and accurately mine the weak points in the file, and has a good application prospect.
Description of the drawings:
FIG. 1 is a schematic flow chart of the software vulnerability analysis method based on PCODE and Bert in the embodiment;
FIG. 2 is a schematic block diagram of the software vulnerability analysis in an embodiment;
FIG. 3 is a schematic diagram showing the pretreatment before and after the comparison in the examples;
FIG. 4 is a schematic structural diagram of the BERT model in the embodiment;
FIG. 5 is a schematic diagram of a transform-encoder structure in the example.
The specific implementation mode is as follows:
in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.
The software vulnerability analysis is to realize the protection of the software system by mining software bugs. The software vulnerability analysis task takes a source code or a binary file as input, and the software vulnerability is predicted through analysis, so that the task has important significance in maintaining network space security, but the existing method has the problems of vulnerability cross-architecture, multi-type and software closed source and the like. The embodiment of the invention provides a software vulnerability analysis method based on PCODE and Bert, which is shown in figure 1 and comprises the following contents:
s101, constructing a vulnerability analysis framework, performing decompiling on the input binary program file content by using the vulnerability analysis framework to generate a pcode intermediate language, and performing vector mapping, feature extraction and classification on the pcode intermediate language by using a Bert neural network;
s102, collecting a vector mapping pre-trained unmarked program corpus data set for a Bert neural network in a frame, generating a training data set of a vulnerability analysis model for training vulnerability semantic features of Bert neural network learning software in the frame, and a test data set for evaluating and optimizing the generated vulnerability analysis model, pre-training the Bert neural network in the frame by using the corpus data set, training the vulnerability semantic features of the Bert neural network learning software in the frame by using the training data set to generate the vulnerability analysis model, and evaluating and optimizing the vulnerability analysis model performance by using the test data set to generate a final vulnerability analysis model for target software vulnerability analysis prediction;
and S103, identifying vulnerability categories in the target binary program file by using a vulnerability analysis framework containing the finally generated vulnerability analysis model. .
BERT is a pre-trained context classification model, and the network architecture thereof is based on a Transformer encoder structure. For software vulnerability analysis, the biggest problem is that it is difficult to collect a large number of samples to meet the training requirement of the neural network, and the BERT well solves the problem. The BERT model is first pre-trained on a large unlabeled corpus, learns linguistic structures, and then fine-tunes to specific downstream tasks to solve some NLP tasks (e.g., emotion classification), which may yield better results than just training on the training set. Meanwhile, compared with a traditional model-dependent shallow memory mechanism, a self-attention mechanism (self-attention) in the Transformer is not easy to ignore important information of a long sequence, so that the self-attention mechanism shows a great advantage in the capability of extracting context semantics from a structured text. And meanwhile, BERT adopts a sequence-to-sequence system structure, and context semantic perception can be better realized by encoding the context of token into the word vector of token.
In the embodiment of the scheme, the intermediate language PCODE is used for analyzing the software vulnerability, the BERT model is used for learning the characteristics of the code segments in the training set, and the prediction is made on the software vulnerability according to the output of the classifier, so that the analysis effect of the cross-architecture and multi-type software vulnerability can be improved, and the safety performance of the network space can be effectively ensured.
In the vulnerability analysis framework, the input is reversely compiled into a pcode intermediate language by utilizing a decompilation tool, and the decompilated pcode intermediate language is subjected to normalized and standardized preprocessing to obtain an intermediate language representation with a uniform format, which is used as the input of the Bert neural network model.
The decompiling tool Ghidra is a software reverse engineering framework for network security task development. By utilizing the method to reverse the binary file, network security professionals can better know potential vulnerabilities in networks and systems. As an open source reverse tool, the Ghidra supports different processor architectures, and researchers can develop corresponding plug-ins to adapt to different research requirements. For experts engaged in binary security research, the open-source Ghidra decompilation engine can customize a suitable analysis tool for the current software due to good expansibility, and meets the research requirement. While Ghidra provides many useful tools in its own user interface, its good interactivity may be convenient for the user to use. Meanwhile, in the face of the requirement of large-scale binary file analysis, Ghidra provides a way for a user to run and start in a headless mode, so that the user can perform batch analysis through an interface. Pcode can be used as an intermediate language for ghidra, independent of the machine, intended to emulate a general purpose processor. By analyzing in the middle zone between the low-level machine-specific assembly language and the high-level programming language, meaningful information and vulnerabilities of a given program can be discovered without providing access to the original source code. A basic pcode operation unit mainly comprises an operation code and a node variable.
(1) An operation code. The actions of Pcode are determined by the type of opcode, which defines the arithmetic or logical operations performed by the general-purpose processor and is a simulation of machine instructions. All pcode operations follow the basic instruction specification, taking one or more node variables as input and generating one node variable as output. That is, the operands of the Pcode opcode are not fixed, e.g., the move opcode can be used as a unary opcode or a dyadic opcode. Pcode defines 63 types of operation codes, which can be classified into data movement, arithmetic operation, logic operation and floating point number comparison.
(2) And (4) node variables. A node variable is an abstract representation of a register or memory space. The method is composed of triples (address space, space offset and size), so that in a pcode, a data area can be positioned through the initial address and size of parameters, node variables are of no type, a single pcode operation forces each node variable to be interpreted into an integer type, a floating point type or a Boolean type, and all operations on data in a pcode language are realized by depending on the node variables. Pcode defines all node variables in a Static Single Assignment (SSA) mode, meaning that there is no indirect impact between instructions containing node variables.
(3) An address space. The address space of Pcode is an abstract abstraction of Random Access Memory (RAM) that interacts directly with the CPU. It is simply understood as a sequence of index bytes that can be read and written by pcode operation, to abstract memory space accessible to the analog processor and space that models the general purpose registers of the processor, divided into three address spaces in pcode: 1) constant address space: any constant value required to encode pcode operations is designated const. 2) Register address space: the space used to model the general purpose registers of the processor is encoded and named register. 3) Unique address space: the method is used for saving intermediate values when modeling the instruction behaviors, and can be abstractly regarded as an infinite array named unique.
The Ghidra is used as an open source decompilation engine, and the expansibility is better. pcode has good cross-architecture applicability as a one-door intermediate language. Meanwhile, the mips architecture and the PowerPC architecture and the like are subjected to good decompilation analysis by relying on a powerful decompilation engine ghidra (a plurality of decompilation tools have poor decompilation effect on the mips architecture). The pcode has good data definition specification, clear data flow and control flow, better understanding of program semantics and better readability compared with assembly language and other intermediate languages. However, the current native pcode influences deep learning due to factors such as compiling noise and variable definition, in the scheme, the problems are well solved through preprocessing work such as normalization and standardization, the context semantics of the program is better understood through learning the semantics of the program by utilizing the deep learning, and the vulnerability of the software is better classified.
As a software vulnerability analysis method based on PCODE and Bert in the embodiment of the present invention, further, the preprocessing includes: normalizing the instruction sequence and the invalid instruction in the program code segment; and standardizing the function name and the variable in the program code segment. Further, in the Bert neural network in the vulnerability analysis framework. Further, the BERT neural network model utilizes cross entropy as a loss function to evaluate the difference between the current probability distribution of the model and the true distribution of the data set in the process of model training. Furthermore, each transform encoder extracts the context semantic features of the current node by reading fixed-length input, and fuses the context semantic features of the current node into a vector space by using a self-attention mechanism to be used as an output vector of the current layer to be transmitted to the next layer.
Referring to fig. 2, a source file or a binary file is used as an input, the source file or the binary file is decompiled on an intermediate language pcode by using a ghidra decompilation tool, normalization and standardization processing is performed on an obtained result, differences caused by factors such as compiling noise are eliminated, semantic information of the pcode is improved and mapped into a vector space, then the vector space is input into an attention neural network, semantic information of program context is learned, and finally the system uses the output of the neural network as a basis for classifying vulnerability categories.
In order to eliminate the learning of the program semantics by factors such as constants, register names and compiling noise in the decompilated codes, a normalized and standardized expression method of the decompilated codes is provided on the pcode intermediate language, so that the neural network focuses more on the learning of the context logic and semantics of the program. In FIG. 3, (a) is a code segment of pcode after decompilation of the program, and (b) is a code segment after normalization and standardization.
The program code segment is mainly divided into an area for realizing the function, a preparation area for executing the function by the processor and a cleaning area when the function is executed, so that the subsequent neural network can learn the program semantics, therefore, the standardization processing of the scheme in the preprocessing process mainly comprises the following work:
(1) standardizing the general region, wherein when entering a function body, a compiler executes a series of operations to open up a stack space for a new function, assigns an ebp value to esp, moves an esp register pointer downwards, and sets a flag register, such as lines 1-11 shown in a in fig. 3; when exiting the function, the compiler will store the function return value, destroy the memory space opened up by the function body, and reset the flag register, as shown in line 32-37 of fig. 3 a. These instruction sequences are not meaningful for neural networks to learn about vulnerability models, and therefore need to be deleted.
(2) And deleting invalid instructions, after the function body statement is executed, the register eax mainly stores the return value of the function, which does not help the neural network to learn the bug pattern, so that the instructions need to be deleted. Corresponding to line 31 shown in a in fig. 3, it is deleted during the normalization process.
Due to the lack of debugging information, the decompiled code lacks information such as variable names, function names and register names, and the information has important value in software vulnerability analysis, so that the scheme performs standardized processing on the pcode decompiled code segment to recover the information. The main work involved in standardization may include the following:
(1) function name normalization, which replaces address offsets in decompiled code with a uniformly named method for function calls that occur inside the function body. For the user-defined sub-functions, naming is performed in order of function call, as shown by address offset 0x401460 appearing in line 29 in fig. 3a, corresponding to line 19 in fig. 3 b, replacing it with the custom function name fun1 in the order of appearance; for a system function call, the program's function table is looked up, the system call function name corresponding to the address offset is found, and then the address offset is replaced with the function name, line 14, line 26 as shown in a in fig. 3, and line 4, line 16 as shown in b in fig. 3, with the system call function name memcpy, printLine.
(2) Variable normalization, in terms of variables, is the process that the neural network is expected to focus more on the type of variables that the program operates on during learning, rather than on the volume values of the variables, so that the specific values of the variables will be eliminated as much as possible during normalization and normalization. The Ghidra decompiler maintains a register mapping table in which common registers are mapped as constants, such as the esp register is mapped as register 0x 10, and the name of the register is restored using the mapping table in the normalization process, for example, line 18 shown as a in fig. 3 identifies register 0x 0 as an eax register, and thus line 8 shown as b in fig. 3 replaces it with eax; for the const type, since a specific constant value or address offset value has little influence on the semantic pattern learning of the program context, it is subjected to normalized replacement. In this case, it can be first distinguished whether its type is constant type or address offset type, and for the constant type, a const field is used for replacement, for example, line 3 shown in b in fig. 3; for the address offset type, address-offset is used for replacement, as shown in line 15 of fig. 3 b. For unique types, because variables are named using the SSA naming convention in the pcode intermediate language, this means that each variable has one and only one definition, and the variable value is independent of its location in the code. And so named in the form of 10 in the order in which they appear in the code fragment, such as line 6 shown in b in fig. 3, which names (unique,0x3a0) as (unique, 1).
In this scheme, the BERT model may use a serial connection method to connect 12 transform encoder feature extractors, and the overall framework of the BERT model is shown in fig. 4. Wherein each transform-encoder module is composed of a self-attention mechanism layer and a feedforward neural network layer, and reads fixed-length input. The self-attention mechanism layer fuses the context semantic information of the current node into the input vector and propagates to the next layer as the output vector of the current layer. In the bert model constructed herein, there are 12 stacked transform blocks, each having a feedforward network with 768 hidden units and 12 attention heads, the structure of which is shown in fig. 5.
When the Transformer calculates the attention mechanism, Q, K, V three auxiliary matrices can be defined, the similarity of each K matrix is calculated by using the Q matrix as a weight, and all V matrices are subjected to weighted summation. For the vector with larger dimension, the result of the attention calculation is larger, so that the neural network pays attention to the region and the like, and the vector is divided by the region
Figure BDA0003349570310000061
To eliminate such effects.
Figure BDA0003349570310000062
In the BERT model, two stages of pre-training and fine-tuning are divided. In the pre-training phase, the BERT model is trained to predict masked words from a given sentence in the dataset, in this case, the google code dataset may be selected as the pre-training dataset. Through the pre-training of a large number of pcode code segments, the model learns which words belong to a certain specific position and further learns the structural characteristics of the pcode language. BERT will fine-tune a specific task, which is a software vulnerability analysis in the model, after a pre-training phase. Firstly, loading parameters in the pre-training process into the model, wherein the parameters are kept unchanged in the loading pre-training process, and the step is called as frozen. And then inputting a pre-processed training set into the model for training aiming at the software vulnerability analysis task, wherein the loaded parameters can be continuously changed along with the training, so that the software vulnerability analysis task is more suitable for the software vulnerability analysis task, and the process is called fine-tuning.
After learning features by the neural network, a classifier is required to be used for mapping feature vectors to labels, in the scheme, the classifier can be constructed by using two full-connection layers and softmax, because the output of the neural network is in multiple ranges, cannot be directly compared as the probability of classification, and cannot be propagated backwards, so that a softmax activation function is required for mapping the output of the neural network to a (0, 1) interval.
With a total of k classes Si,i∈(0,k]Then softmax is formulated as follows:
Figure BDA0003349570310000071
where i denotes a certain class in k, viRepresenting the corresponding value of the category. The softmax calculation is between 0 and 1 and the values of all classes sum to 1.
And finally, evaluating the difference between the probability distribution obtained by the neural network at present and the real distribution on the data set by adopting the cross entropy as the loss function of the model, wherein the prediction result of the neural network is closer to the real value of the data set when the loss value is smaller, so that the accuracy of the prediction model can be improved by reducing the value of the loss function. Further, based on the foregoing method, an embodiment of the present invention further provides a software vulnerability analysis system based on PCODE and Bert, including: a framework building module, a model generating module and an object identifying module, wherein,
the framework construction module is used for constructing a vulnerability analysis framework, performing decompiling on the input binary program file content by using the framework to generate a pcode intermediate language, and performing vector mapping and feature extraction and classification on the pcode intermediate language by using a Bert neural network;
the model generation module is used for collecting a pretrained unmarked program corpus data set used for vector mapping of a Bert neural network in a frame, a training data set used for training vulnerability semantic features of a Bert neural network learning software in the frame to generate a vulnerability analysis model, and a test data set used for evaluating and optimizing the generated vulnerability analysis model, pre-training the Bert neural network in the frame by using the corpus data set, training the vulnerability semantic features of the Bert neural network learning software in the frame by using the training data set to generate the vulnerability analysis model, and evaluating and optimizing the vulnerability analysis model performance by using the test data set to generate a final vulnerability analysis model used for target software vulnerability analysis prediction;
and the target identification module is used for identifying the vulnerability category in the target binary program file by utilizing the vulnerability analysis framework containing the finally generated vulnerability analysis model. The method comprises the steps of performing sequence learning on a binary file data set containing multiple vulnerabilities by using a neural network on a pcode intermediate language to predict software vulnerability through learning of characteristic patterns of the vulnerabilities. And further, by carrying out data experiments on the real data set, the scheme can efficiently and accurately mine the weak points in the file, and has a good application prospect.
Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.
Based on the foregoing system, an embodiment of the present invention further provides a server, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method described above.
Based on the system, the embodiment of the invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method.
The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the system embodiment, and for the sake of brief description, reference may be made to the corresponding content in the system embodiment for the part where the device embodiment is not mentioned.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing system embodiments, and are not described herein again.
In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and system may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the system according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A software vulnerability analysis method based on PCODE and Bert is characterized by comprising the following contents:
constructing a vulnerability analysis framework, performing decompiling on the input binary program file content by using the framework to generate a pcode intermediate language, and performing vector mapping, feature extraction and classification on the pcode intermediate language by using a Bert neural network;
collecting a vector mapping pre-trained unmarked program corpus data set used for a Bert neural network in a frame, a training data set used for training vulnerability semantic features of Bert neural network learning software in the frame to generate a vulnerability analysis model, and a test data set used for evaluating and optimizing the generated vulnerability analysis model, pre-training the Bert neural network in the frame by using the corpus data set, then training the vulnerability semantic features of the Bert neural network learning software in the frame by using the training data set to generate the vulnerability analysis model, and finally evaluating and optimizing the vulnerability analysis model performance by using the test data set and generating a final vulnerability analysis model used for target software vulnerability analysis prediction;
and identifying vulnerability categories in the target binary program file by utilizing a vulnerability analysis framework containing the finally generated vulnerability analysis model.
2. The PCODE and Bert based software vulnerability analysis method of claim 1, wherein in the vulnerability analysis framework, a decompilation tool is used to decompilate the input into the PCODE intermediate language, and the decompilated PCODE intermediate language is preprocessed in a normalized and standardized way to obtain the intermediate language representation in a unified format as the input of the Bert neural network model.
3. The PCODE and Bert based software vulnerability analysis method of claim 2, wherein the preprocessing comprises: normalizing the instruction sequence and the invalid instruction in the program code segment; and standardizing the function name and the variable in the program code segment.
4. The PCODE and Bert based software vulnerability analysis method of claim 1, wherein in the Bert neural network in the vulnerability analysis framework, the PCODE intermediate language representation is mapped into vector space by using an embedding layer, the embedded vector is feature extracted by using a plurality of transform encoders connected in series through continuous iteration and back propagation, and the classifier is used for classifying according to the extracted features.
5. The PCODE and Bert based software vulnerability analysis method of claim 4, wherein the Bert neural network in the vulnerability analysis framework utilizes cross entropy as a loss function to evaluate the difference between the current probability distribution of the model and the true distribution of the data set during training.
6. The PCODE and Bert based software vulnerability analysis method of claim 4, wherein each transform encoder, by reading fixed length input, extracts current node context semantic features and uses a self-attention mechanism to fuse the current node context semantic features into a vector space to be propagated to the next layer as the output vector of the current layer.
7. The PCODE and Bert based software vulnerability analysis method of claim 4, characterized in that two layers of fully connected layer and softmax layer are utilized to construct the classifier, wherein softmax activation function is expressed as:
Figure FDA0003349570300000011
where i represents a single class in the class number k, viRepresenting the corresponding value of the category i.
8. A software vulnerability analysis system based on PCODE and Bert, comprising: a framework building module, a model generating module and an object identifying module, wherein,
the framework construction module is used for constructing a vulnerability analysis framework, performing decompiling on the input binary program file content by using the framework to generate a pcode intermediate language, and performing vector mapping and feature extraction and classification on the pcode intermediate language by using a Bert neural network;
the model generation module is used for collecting a pretrained unmarked program corpus data set used for vector mapping of a Bert neural network in a frame, a training data set used for training vulnerability semantic features of a Bert neural network learning software in the frame to generate a vulnerability analysis model, and a test data set used for evaluating and optimizing the generated vulnerability analysis model, pre-training the Bert neural network in the frame by using the corpus data set, training the vulnerability semantic features of the Bert neural network learning software in the frame by using the training data set to generate the vulnerability analysis model, and evaluating and optimizing the vulnerability analysis model performance by using the test data set to generate a final vulnerability analysis model used for target software vulnerability analysis prediction;
and the target identification module is used for identifying the vulnerability category in the target binary program file by utilizing the vulnerability analysis framework containing the finally generated vulnerability analysis model.
9. A server, comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to perform the method of any one of claims 1 to 7.
10. A computer-readable medium, on which a computer program for execution by a processor is stored, the computer program being adapted to perform the method of any of claims 1 to 7.
CN202111333255.6A 2021-11-11 2021-11-11 Software vulnerability analysis method and system based on PCODE and Bert Pending CN114254323A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111333255.6A CN114254323A (en) 2021-11-11 2021-11-11 Software vulnerability analysis method and system based on PCODE and Bert

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111333255.6A CN114254323A (en) 2021-11-11 2021-11-11 Software vulnerability analysis method and system based on PCODE and Bert

Publications (1)

Publication Number Publication Date
CN114254323A true CN114254323A (en) 2022-03-29

Family

ID=80792405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111333255.6A Pending CN114254323A (en) 2021-11-11 2021-11-11 Software vulnerability analysis method and system based on PCODE and Bert

Country Status (1)

Country Link
CN (1) CN114254323A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676437A (en) * 2022-04-08 2022-06-28 中国人民解放军战略支援部队信息工程大学 Quantum neural network-based software vulnerability detection method and device
CN115168856A (en) * 2022-07-29 2022-10-11 山东省计算中心(国家超级计算济南中心) Binary code similarity detection method and Internet of things firmware vulnerability detection method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676437A (en) * 2022-04-08 2022-06-28 中国人民解放军战略支援部队信息工程大学 Quantum neural network-based software vulnerability detection method and device
CN115168856A (en) * 2022-07-29 2022-10-11 山东省计算中心(国家超级计算济南中心) Binary code similarity detection method and Internet of things firmware vulnerability detection method
CN115168856B (en) * 2022-07-29 2023-04-21 山东省计算中心(国家超级计算济南中心) Binary code similarity detection method and Internet of things firmware vulnerability detection method

Similar Documents

Publication Publication Date Title
CN111639344B (en) Vulnerability detection method and device based on neural network
Bilgin et al. Vulnerability prediction from source code using machine learning
CN108388425B (en) Method for automatically completing codes based on LSTM
CN112733137B (en) Binary code similarity analysis method for vulnerability detection
Hu et al. Neural network model extraction attacks in edge devices by hearing architectural hints
CN112307473A (en) Malicious JavaScript code detection model based on Bi-LSTM network and attention mechanism
CN114254323A (en) Software vulnerability analysis method and system based on PCODE and Bert
CN112668013B (en) Java source code-oriented vulnerability detection method for statement-level mode exploration
Althar et al. Software systems security vulnerabilities management by exploring the capabilities of language models using NLP
CN112507337A (en) Implementation method of malicious JavaScript code detection model based on semantic analysis
CN112764738A (en) Code automatic generation method and system based on multi-view program characteristics
CN113011461A (en) Software demand tracking link recovery method and electronic device based on classification enhanced through knowledge learning
Guo et al. HyVulDect: a hybrid semantic vulnerability mining system based on graph neural network
Liu et al. A practical black-box attack on source code authorship identification classifiers
Zhao et al. Semantics-aware obfuscation scheme prediction for binary
Andreopoulos Malware detection with sequence-based machine learning and deep learning
Şahin Malware detection using transformers-based model GPT-2
Lin et al. Towards interpreting ML-based automated malware detection models: A survey
Naeem et al. Identifying vulnerable IoT applications using deep learning
Zhang et al. Introducing DRAIL–a step towards declarative deep relational learning
Tian et al. Fine-grained obfuscation scheme recognition on binary code
CN114282182A (en) Countermeasure software generation method and device and server
Alrabaee et al. Bindeep: Binary to source code matching using deep learning
CN109657247B (en) Method and device for realizing self-defined grammar of machine learning
Jin et al. Current and future research of machine learning based vulnerability detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination