CN113569253A

CN113569253A - Vulnerability detection method and device based on context semantics

Info

Publication number: CN113569253A
Application number: CN202110829910.0A
Authority: CN
Inventors: 沈伍强; 梁哲恒; 裴求根; 曾纪钧; 龙震岳; 温柏坚; 张小陆
Original assignee: Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2021-10-29

Abstract

The invention provides a vulnerability detection method and device based on context semantics. The method comprises the steps of symbolizing the obtained code segments, converting the vulnerability characteristics of each code program segment into specific symbolic representations, wherein the same characteristics are mapped to the same symbolic representations; converting the symbolic representation into a vector; inputting the vector into an ELM-based source code vulnerability detection model for vulnerability detection, wherein the ELM-based source code vulnerability detection model is obtained by using a training data set in advance for training, and comprises an input layer, a hidden layer and an output layer. The invention fully utilizes the advantage that the neural network automatically extracts complex features, performs symbolic representation on the input codes, converts the symbolic representation into vectors, retains the context semantic information of the codes and effectively improves the detection effect.

Description

Vulnerability detection method and device based on context semantics

Technical Field

The invention relates to the field of information security application, in particular to a vulnerability detection method and device based on context semantics.

Background

The electric power enterprise is the subject of network security responsibility, network security is brought into an electric power enterprise network security production management system, and the professional force construction of network security level protection in the electric power industry is enhanced according to the requirements of a national level protection system. And a new scheme for the safety protection of the power grid information system is perfected by combining the new potential of the power production safety and the safety guarantee requirement. At present, the more complex the software of the power information system becomes, the software bugs caused by design defects and implementation errors become inevitable problems in engineering, and the power grid software information system with bugs brings serious safety risks to a power grid.

The traditional vulnerability detection technology has a series of significant defects from the use of manually defined features to code similarity measurement. In the working process of manually defining the vulnerability characteristics, mistakes are easy to make and labor is consumed, complete characteristics are difficult to obtain, and the vulnerability-related information is only contained usually, so that high false alarm rate and high missing report rate are easily caused. Moreover, the code similarity method has a limited application range and only has a good effect on the aspect of vulnerabilities caused by code cloning. In the vulnerability detection process through traditional machine learning technologies such as decision trees and Support Vector Machines (SVMs), vulnerability features are mainly extracted from pre-classified vulnerabilities, however vulnerability detection modes based on such features can only be generally used for specific vulnerabilities.

With the further development of informatization security research, the problems of incomplete extracted features and the like are solved based on the research of a neural network, so that the intelligence of vulnerability detection is further improved. At present, a bidirectional long-short term memory (Bi-LSTM) -based network is applied to research and compare fire and heat in the aspect of software vulnerability detection, however, due to the complex context information processing and iterative training mechanism under the network architecture, the training cost is high. How to effectively reduce the training cost and improve the training efficiency and simultaneously ensure the detection effect is a problem worthy of research.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a vulnerability detection method based on context semantics aiming at the characteristics of an electric power information system and the defects of the current detection means, and simultaneously improves the vulnerability detection efficiency and precision.

Another objective of the present invention is to provide a vulnerability detection apparatus based on context semantics.

The technical scheme is as follows: according to a first aspect of the present invention, a vulnerability detection method based on context semantics is provided, which includes the following steps:

symbolizing the obtained code segments, and converting the vulnerability characteristics of each code program segment into specific symbolic representations, wherein the same characteristics are mapped to the same symbolic representations;

converting the symbolic representation into a vector;

inputting the vector into an ELM-based source code vulnerability detection model for vulnerability detection, wherein the ELM-based source code vulnerability detection model is obtained by using a training data set in advance for training, and comprises an input layer, a hidden layer and an output layer.

The symbolizing the acquired code segment comprises the following steps:

symbolizing a function call: symbolizing the name of the defined function as FN;

symbolizing variables: the variable names, including parameters and local variables, are symbolized as VN;

data type symbolization: symbolizing the data types of the variables and the user-defined functions as TN;

the notation N mentioned in the notation is a number indicating the index of the first occurrence of the function.

Further, symbolizing the acquired code segment further includes: setting the priority of symbolization expression, and constructing a multi-Level symbolization mechanism according to the symbolized priority, wherein Level1 comprises a symbolized group F, Level2 comprises two symbolized groups F + V and F + T, and Level 3 comprises a symbolized group F + V + T.

Further, the conversion of symbolic representation into vectors is realized through a doc2vec model, the doc2vec model is used for sliding sampling words with fixed length from a sentence every time for the symbolized statement, one word is taken as a predicted word, the other words are taken as input words, the word vector corresponding to the input word and the sentence vector corresponding to the sentence are taken as the input of an input layer, the vector of the sentence and the word vector sampled this time are added for averaging or accumulating to form a new vector X, and then the vector is used for predicting the predicted sentence in the window this time.

The source code vulnerability detection model based on the ELM activates signals through an input layer and extracts features through a hidden layer, neural units of different hidden layers correspond to the neural unit weights and self-bias of different input layers, and finally an output layer outputs results according to different hidden layer weights and self-bias. Further, a kernel function is introduced to optimize the ELM, the kernel function adopts a radial basis function, the ELM combined with the kernel function is called KELM, and an output function of the KELM is as follows:

wherein λ is a regularization factor with a value between [0 and 1], I is an identity matrix, H is a hidden layer output matrix, a non-superscript T is an expected output matrix, the superscript T represents a transpose of the matrix, N represents a training data set number, L represents a neuron number of the hidden layer, Ω represents a kernel function, and the calculation method is as follows:

wherein h (x)_i) Is the hidden layer with respect to the input x_iOutput vector of, K (x)_i,x_j) Representing the radial basis function.

According to a second aspect of the present invention, there is provided a vulnerability detection apparatus based on context semantics, including:

the symbolic representation module is used for symbolizing the acquired code segments and converting the vulnerability characteristics of each code program segment into specific symbolic representation, wherein the same characteristics are mapped to the same symbolic representation;

a vector representation module for converting the symbolic representation into a vector;

and the vulnerability detection module is used for inputting the vector into an ELM-based source code vulnerability detection model for vulnerability detection, wherein the ELM-based source code vulnerability detection model is trained in advance by using a training data set, and comprises an input layer, a hidden layer and an output layer.

Has the advantages that: compared with the prior art, the invention has the following beneficial effects:

on one hand, aiming at the defects of feature selection of the existing vulnerability detection technology through methods such as rule and code similarity measurement, the invention fully utilizes the advantage of automatic extraction of complex features by a neural network, and simultaneously introduces an ELM and a kernel method, wherein the ELM trains a detection model by using a non-iterative training mechanism, so that the vulnerability detection efficiency can be effectively improved, and the precision performance can be effectively improved by the kernel method; on the other hand, aiming at the fact that the current method for converting symbolic representation of codes input by source deep learning into vector representation is easy to ignore semantic information in the context of texts, the method and the device make full use of the advantages of doc2vec in the aspect of source code vector representation by introducing doc2vec, and effectively improve the detection effect.

Drawings

FIG. 1 is a general flowchart of a vulnerability detection method based on context semantics according to the present invention;

fig. 2 is a schematic diagram of a vulnerability detection method based on context semantics according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an ELM structure with a hidden layer network according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The general flow of the vulnerability detection method based on context semantics is shown in fig. 1, and mainly comprises the following steps: (1) symbolizing the obtained code segments, and converting the vulnerability characteristics of each code program segment into specific symbolic representations, wherein the same characteristics are mapped to the same symbolic representations; (2) converting the symbolic representation into a vector; (3) inputting the vector into an ELM-based source code vulnerability detection model for vulnerability detection, wherein the ELM-based source code vulnerability detection model is obtained by using a training data set in advance for training, and comprises an input layer, a hidden layer and an output layer.

In the following description, the power information system, the grid information system, the power service platform, and the grid service system are referred to as the same meaning, and they are used interchangeably herein, and generally refer to a software and hardware system that implements informatization management of automatic control and scheduling in the whole process of power generation, transmission and consumption including links of power generation, transmission and transformation, power distribution, power utilization, scheduling, etc. through information technologies such as communication, automatic control, computer, network, sensing, etc. The electric power information system is a large system which is very large in scale, wide-area distribution and hierarchical hierarchy and consists of a plurality of complex heterogeneous subsystems, and the safety of the electric power information system is multifactorial and multidimensional.

Referring to fig. 2, the vulnerability detection method based on context semantics according to the present invention is specifically described in conjunction with the training and verification of the model.

Step S1, data preprocessing, including symbolizing and code representing the acquired code segments.

The vulnerability detection preprocessing method based on context semantics is program symbolization and vector representation. The benefit of the symbolic representation is that training efficiency can be improved by reducing the length of the code program segment. In symbolization, vulnerability characteristics of each code program segment, such as local variables, user-defined functions, data types, etc., are converted into short and fixed-length symbolic representations, with identical characteristics mapped to identical symbolic representations.

1) Program symbolization

A code segment consists of several program statements (i.e. lines of code) that are semantically related to each other in terms of data dependency or control dependency. Which may be further translated into symbolic representations using symbolization. The symbolic representations are then collected as a corpus for training the vector representation tool doc2 vec.

The invention passes three symbolization types, as follows:

(i) symbolizing a function call: the defined function name is symbolized as FN. This tokenization type is assigned priorities because vulnerabilities are mainly due to improper use of library/API function calls. The symbolization of the definition function may improve the signal-to-noise ratio (SNR) of the library/API function in the hole information.

(ii) Symbolizing variables: the variable names, including parameters and local variables, are symbolized as VN. In practice, variables account for a large proportion of the code.

(iii) Data type symbolization: the data type notation of the variables and user-defined functions is TN. Its priority is lowest because many data types are not related to vulnerability information.

The symbol N mentioned in the notation above is a number that indicates the index where the function first appears, while noting that multiple functions appearing in different code segments may map to the same symbol name. Since the symbolization of V and T may have different effects on the SNR of the hole information in different data sets. And constructing a multi-level symbolization mechanism according to the symbolization priority. Level1 includes a symbolized group F. Level2 comprises two symbolized groups F + V and F + T. Level 3 comprises a symbolized group F + V + T. V and T denote the variable symbolization and the data type symbolization, respectively, described above.

Such as a segment of code: static void sysgo ()

list<char>dataList；

Sysgo1(dataList)；

......

After symbolization: static void F1()

list<char>V1；

F2(V1)；

......

2) Code representation

Since the neural network can only accept vectors as input, the symbolic representation of the source code needs to be further converted into a vector representation. The invention is implemented in this way as doc2 vec.

The doc2vec method is an unsupervised algorithm that learns fixed-length feature representations from variable-length text (e.g., sentences, paragraphs, or documents). The method can obtain vector expressions of sentences, paragraphs and documents, is an extension of word2vec, does not need to fix the length of the sentences, and can accept the sentences with different lengths as training samples. The doc2vec algorithm is used to predict a vector to represent different documents, and the structure of the model potentially overcomes the shortcomings of the bag-of-words model.

The doc2vec model is inspired by the word2vec model. when the word vector is predicted in word2vec, the predicted word contains word meaning, and the same structure is constructed in doc2vec, so that the doc2vec overcomes the defect that no semantics exist in a word bag model. Assuming that training samples exist now, each sentence is a training sample, and like word2vec, there are two training modes for doc2vec, one is a distributed memory paragraph vector similar to the CBOW model in word2vec, and the other is a distributed bag-of-words paragraph vector similar to the Skip-gram model in word2 vec.

doc2vec can learn a fixed-length feature representation from text of arbitrary length, ranging from sentences to documents. Furthermore, the sentence Vector Paragraph Vector therein can remember the topic of the Paragraph, which enables it to extract global features better than word2 vec. Whereas word2vec converts words into a vector representation in a one-to-one manner, the length of the converted vector varies with the length of the input text. In order to meet the requirement of the neural network on fixed-length input, the vector generated by word2vec needs to be further processed to obtain a corresponding fixed-length form. Unlike word2vec, doc2vec can output fixed length vectors directly from input text of arbitrary length. Furthermore, doc2vec may also hold more semantic information from the context of the input text than word2 vec. Thus, doc2vec shows great potential in source code vector representation.

In doc2vec, each sentence is represented by a unique vector, represented by a column of the matrix D. Each word is also represented by a unique vector, represented by a certain column of the matrix W. Each time, sliding samples words with fixed length from a sentence, and takes one word as a predicted word and the other words as input words. And taking a Word Vector corresponding to the input Word and a sentence Vector Paragraph Vector corresponding to the Word as input of an input layer, adding the Word Vector of the Word and the Word Vector sampled at this time for averaging or accumulating to form a new Vector X, and predicting the predicted Word and sentence in the window at this time by using the Vector. By sharing the sentence vector Paragraph vector in several training sessions of the same sentence, the gist of the word vector expression obtained by training becomes more and more accurate. In the process, context semantics can be completely reserved, so that the method has positive significance for extracting the vulnerability characteristics. After training, all word vectors and sentence vectors corresponding to each sentence in the training sample can be obtained.

Step S2, inputting the preprocessed vector into a source code vulnerability detection model based on ELM (Extreme Learning Machine) to perform vulnerability detection.

(1) Model structure

Fig. 3 shows an ELM structure with a hidden layer network, where d, L, and m represent the number of input layer neurons, hidden layer neurons, and output layer neurons, respectively. ω is the input weight connecting the input layer and the hidden layer, b is the threshold of the hidden layer neurons, β is the output weight connecting the hidden layer and the output layer. ω and b are randomly generated from the (-1,1) and (0,1) ranges under uniform distribution. They remain frozen throughout the training of the model.

(2) ELM model

Given a training data set

Where N represents the number of training data sets, d, L, and m represent the number of input layer neurons, hidden layer neurons, and output layer neurons, respectively, and x_i，t_iOne is sample input data and the other is a sample tag value;

the ELM model can be expressed as:

Hβ＝T

where T is the expected output matrix and H is the hidden layer output matrix. H (x)_i) Is the hidden layer with respect to the input x_iThe output vector of (1). g (-) is the activation function of ELM. Omega_j·x_iRepresenting the inner product of the input weight and the ith training sample feature.

The output weight β may be obtained by:

wherein H + refers to Moore-Penrose generalized inverse matrix of H, L refers to neuron number of hidden layer, I refers to N unit matrices, and lambda refers to regularization factor with value between [0,1 ].

The ELM output function is:

the optimization objective of the ELM model can be expressed as:

wherein f (x)_i) And t_iRespectively representing the prediction of the ith sampleA label and a genuine label.

By adopting the source codes of known bugs in a certain type of information system of the power grid as a data set, the source codes comprise buffer error bugs, resource management error bugs and the like and samples of all library/API function calls. By classifying the data set into two parts, the proportions are 80% and 20%, respectively, with the larger part being used for training and the other part being used for testing. Each sample in the dataset is a symbolic representation with a true label. The same type of sample labeling was performed based on historical "vulnerable" samples, which were collected for both experiments. And are all preprocessed with a symbolized set of F + V.

(3) ELM based on kernel function optimization

Due to the randomness of the ELM model input layer weight and hidden layer bias, the model is easy to generate an unstable state. A kernel function is introduced to optimize the ELM. By mapping the data into a high-dimensional space, the non-linear problem is converted into a linear problem. The combined nucleus approach has two benefits over conventional ELM. On one hand, the problem that the number of hidden layer nodes in the traditional ELM depends on manual setting is solved, and better stability is shown. On the other hand, the kernel function maps the data to a high-dimensional space, and the distribution of the data in the transformed space is very smooth. In fact, the smoothed new data makes the classification problem easier, so the model can show better results. The Radial Basis Function (RBF) is the preferred kernel function because it has only one hyper-parameter, which simplifies model configuration and training costs. The RBF kernel function can be expressed as:

where x and y represent samples, one is the historical input vector and the other is the output value, γ is the unique hyper-parameter representing the gaussian kernel function, and | x-y | represents the norm of the vector.

The kernel matrix is denoted by Ω, and the calculation formula for defining the kernel function is as follows:

the KELM output function is as follows:

from the above formula, it can be seen that the ELM combined kernel method can avoid the problem that the number of hidden layer nodes in the conventional ELM depends on manual setting.

In order to further improve the performance of the algorithm, an ISSA-KELM classifier model is established, an ISSA algorithm is used for searching for the optimal regularization coefficient C and the optimal kernel parameter g of the KELM classifier in the training process, then the trained KELM model is tested and evaluated on a test set, and a classification result is output.

The semantic analysis is to realize automatic semantic analysis of various languages by establishing an effective model and system so as to realize the understanding of the expressed semantics of the whole text. The vulnerability detection method based on context semantics utilizes input and output semantic analysis to collect and position high-risk behaviors of Web application, processes code segments of the Web application according to the high-risk behaviors, and solves the problem of training efficiency of a vulnerability detection model by using ELM. In addition, a kernel method is introduced to improve the accuracy of the ELM. Experimental results show that the ELM adopting the nuclear method is effective combination of efficiency and precision. Particularly for the problem of data preprocessing, the vector representation of doc2vec is used to perform well on a large data set, and the precision of vulnerability detection can be effectively improved by an appropriate symbolization level.

According to another embodiment of the present invention, there is provided a vulnerability detection apparatus based on context semantics, including:

Wherein the symbolization module comprises:

function call symbolization unit: symbolizing the defined function name as FN;

variable symbolization unit: symbolizing the variable names including parameters and local variables as VN;

data type symbolization unit: symbolizing the data types of the variables and the user-defined functions as TN;

the reference N is a number indicating the index at which the function first appears.

Further, the symbolization module further includes a multilevel construction unit, configured to set a priority of the symbolization, and construct a multilevel symbolization mechanism according to the symbolization priority, where Level1 includes a symbolization group F, Level2 includes two symbolization groups F + V and F + T, and Level 3 includes a symbolization group F + V + T.

Further, the vector representation module converts the symbolic representation into a vector through a doc2vec model, wherein the doc2vec model represents each sentence by a unique vector and a certain column of a matrix D; each word is also represented by a unique vector, represented by a certain column of the matrix W; the words with fixed length are sampled from a sentence each time in a sliding way, one word is taken as a predicted word, the other words are taken as input words, word vectors corresponding to the input words and sentence vectors corresponding to the sentence are taken as input of an input layer, the word vectors corresponding to the sentence and the word vectors corresponding to the current sampling are added for averaging or accumulating to form a new vector X, and then the vector is used for predicting the predicted word and sentence in the window.

The vulnerability detection module comprises a model training unit and a training data set is given

the ELM model can be expressed as:

Hβ＝T

The output weight β may be obtained by:

The ELM output function is:

the optimization objective of the ELM model can be expressed as:

wherein f (x)_i) And t_iRespectively representing a prediction tag and a true tag of the ith sample.

The vulnerability detection module further comprises a model optimization unit, wherein a kernel function is introduced to optimize the ELM, the kernel function adopts a radial basis function, the ELM combined with the kernel function is called KELM, and the output function is as follows:

It should be understood that the vulnerability detection apparatus based on context semantics in the embodiment of the present invention may implement all technical solutions in the above method embodiments, functions of each functional module thereof may be implemented according to the method in the above method embodiments, and specific implementation processes and calculation formulas that are not described in detail in the apparatus embodiment may refer to relevant descriptions in the above embodiments.

Based on the same technical concept as the method embodiment, according to another embodiment of the present invention, there is provided a computer apparatus including: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the processors implement the steps in the method embodiments.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A vulnerability detection method based on context semantics is characterized by comprising the following steps:

converting the symbolic representation into a vector;

2. The method according to claim 1, wherein symbolizing the obtained code segment comprises:

3. The method according to claim 2, wherein symbolizing the obtained code segment further comprises: setting the priority of symbolization expression, and constructing a multi-Level symbolization mechanism according to the symbolized priority, wherein Level1 comprises a symbolized group F, Level2 comprises two symbolized groups F + V and F + T, and Level 3 comprises a symbolized group F + V + T.

4. The method for vulnerability detection based on context semantics of claim 1, wherein the conversion of symbolic representation into vectors is implemented by a doc2vec model, the doc2vec model slides and samples words of fixed length from a sentence every time for the symbolized sentence, takes one of the words as a predicted word and the others as input words, the word vector corresponding to the input word and the sentence vector corresponding to the sentence are used as input of an input layer, the word vector corresponding to the sentence and the word vector corresponding to the sentence are added for averaging or accumulating to form a new vector X, and the vector X is used for predicting the predicted word in the current window.

5. The method for vulnerability detection based on context semantics of claim 1, further comprising: introducing a kernel function to optimize the ELM, wherein the kernel function adopts a radial basis function, the ELM combined with the kernel function is called KELM, and the output function is as follows:

6. A vulnerability detection device based on context semantics, comprising:

7. The apparatus according to claim 6, wherein the symbolic representation module comprises:

function call symbolization unit: symbolizing the defined function name as FN;

8. The contextual semantic based vulnerability detection apparatus according to claim 7, wherein the symbolization module further comprises a multi-Level construction unit for setting the priority of the symbolization, and constructing a multi-Level symbolization mechanism according to the symbolized priority, wherein Level1 comprises a symbolization group of F, Level2 comprises two symbolization groups of F + V and F + T, and Level 3 comprises a symbolization group of F + V + T.

9. The context semantics-based vulnerability detection device of claim 6, wherein the vector representation module realizes conversion of symbolic representation into vectors through a doc2vec model, the doc2vec model slides and samples words of fixed length from a sentence at a time for a symbolized sentence, takes one word as a predicted word and the others as input words, the word vector corresponding to the input word and the sentence vector corresponding to the sentence are used as input of an input layer, the word vector corresponding to the sentence and the word vector corresponding to the current sampling are added for averaging or accumulating to form a new vector X, and the vector X is used for predicting the predicted sentence in the current window.

10. The context semantics based vulnerability detection apparatus according to claim 6, wherein the vulnerability detection module comprises a model optimization unit, a kernel function is introduced to optimize ELM, the kernel function adopts a radial basis function, the ELM combined with the kernel function is called KELM, and the output function is as follows: