CN114266049A

CN114266049A - Code detection method and device and electronic equipment

Info

Publication number: CN114266049A
Application number: CN202111593496.4A
Authority: CN
Inventors: 闫龙川; 郭永和; 何永远; 陈智雨; 牛佳宁; 彭元龙; 袁孝宇
Original assignee: State Grid Information and Telecommunication Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-01

Abstract

The invention provides a code detection method, a code detection device and electronic equipment, wherein data cleaning operation is carried out on target data to obtain first data, each word in the first data is compared with a word in a preset word bank in position to determine a characteristic value corresponding to each word in the first data, the characteristic values corresponding to each word in the first data are combined according to a word arrangement sequence to obtain a characteristic matrix corresponding to the target data, and a preset code detection model is called to process the characteristic matrix to obtain a vulnerability detection result of the target data; the preset code detection model is obtained based on training of a training sample; the training samples comprise vulnerability types and feature matrixes of vulnerability code samples. In the invention, because the preset code detection model is obtained based on a large number of training samples, the target data to be subjected to code detection is analyzed by the preset code detection model obtained through training, the accuracy of vulnerability detection can be improved, and the information security requirement is met.

Description

Code detection method and device and electronic equipment

Technical Field

The invention relates to the field of data processing, in particular to a code detection method and device and electronic equipment.

Background

In recent years, the information security incident caused by software is endless, the information security in software is mainly a software bug, and the main reason of the software bug is caused by the high complexity of the software. As the amount of software has exploded, software vulnerabilities have appeared more and more frequently.

At present, when software vulnerability detection is carried out, detection is generally carried out manually according to experience, the detection accuracy is low, and the information security requirement cannot be met.

Disclosure of Invention

In view of this, the invention provides a code detection method, a code detection device and an electronic device, so as to solve the problem that the detection accuracy is low when software vulnerability detection is performed manually according to experience.

In order to solve the technical problems, the invention adopts the following technical scheme:

a code detection method, comprising:

acquiring target data to be subjected to code detection, and performing data cleaning operation on the target data to obtain first data;

comparing the position of each word in the first data with the position of each word in a preset word bank to determine a characteristic value corresponding to each word in the first data;

combining the characteristic values corresponding to each word in the first data according to a word arrangement sequence to obtain a characteristic matrix corresponding to the target data;

calling a preset code detection model to process the characteristic matrix to obtain a vulnerability detection result of the target data; the preset code detection model is obtained based on training of a training sample; the training sample comprises a vulnerability type and a characteristic matrix of the vulnerability code sample.

Optionally, the generating process of the preset code detection model includes:

acquiring a code sample and a code repair sample corresponding to the code sample in a preset data acquisition mode;

performing difference analysis on the code sample and the code repairing sample to obtain a vulnerability code sample which corresponds to the code sample and exists in the code sample but does not exist in the code repairing sample;

determining a patching code sample corresponding to the vulnerability code sample from the code patching sample;

determining a feature matrix and a vulnerability type corresponding to the vulnerability code sample, and determining a feature matrix and a vulnerability type corresponding to the patching code sample;

and training a preset code detection model by using the feature matrix and the vulnerability type respectively corresponding to the vulnerability code sample and the patch code sample, and stopping training until the training stopping condition is met.

Optionally, determining a feature matrix and a vulnerability type corresponding to the vulnerability code sample includes:

combining each word in the vulnerability code sample to obtain a preset word bank;

comparing the position of each word in the vulnerability code sample with a word in a preset word bank to determine a characteristic value corresponding to each word in the vulnerability code sample;

combining the characteristic values corresponding to each word in the vulnerability code sample according to a word arrangement sequence to obtain an initial characteristic matrix corresponding to the vulnerability code sample;

performing clustering analysis on the initial feature matrix corresponding to the vulnerability code sample to obtain a clustering result and a feature matrix corresponding to the vulnerability code sample;

and acquiring a vulnerability type corresponding to the clustering result, and taking the vulnerability type corresponding to the clustering result as the vulnerability type corresponding to the vulnerability code sample in the clustering result.

Optionally, performing difference analysis on the code sample and the code repairing sample to obtain a vulnerability code sample corresponding to the code sample, which exists in the code sample and does not exist in the code repairing sample, includes:

performing difference analysis on the code sample and the code repairing sample, and labeling the code sample based on a difference analysis result to obtain a labeling result;

and in the screening and labeling result, the marks which exist in the code sample and do not exist in the code repairing sample are characterized, and the code part corresponding to the marks in the code sample is used as a vulnerability code sample.

Optionally, the comparing the position of each word in the first data with a word in a preset word bank to determine a feature value corresponding to each word in the first data includes:

determining the position of each word in the first data in a preset word bank;

constructing initial characteristic information corresponding to each word in the first data, setting an identifier corresponding to the position in the initial characteristic information as a first numerical value, and setting an identifier not corresponding to the position as a second numerical value;

under the condition that the number of words in the first data is smaller than a preset threshold value, performing data complementing operation on the initial characteristic information to enable the data volume of the initial characteristic information to be the preset threshold value;

and taking the initial characteristic information corresponding to each word in the first data after data complementation as the characteristic value corresponding to the word.

A code detection apparatus comprising:

the data processing module is used for acquiring target data to be subjected to code detection and performing data cleaning operation on the target data to obtain first data;

the characteristic determining module is used for comparing the position of each word in the first data with the position of each word in a preset word bank so as to determine a characteristic value corresponding to each word in the first data;

the matrix determination module is used for combining the characteristic values corresponding to each word in the first data according to a word arrangement sequence to obtain a characteristic matrix corresponding to the target data;

the vulnerability detection module is used for calling a preset code detection model to process the characteristic matrix to obtain a vulnerability detection result of the target data; the preset code detection model is obtained based on training of a training sample; the training sample comprises a vulnerability type and a characteristic matrix of the vulnerability code sample.

Optionally, the method further comprises a model generation module, wherein the model generation module comprises:

the sample acquisition submodule is used for acquiring a code sample and a code repair sample corresponding to the code sample in a preset data acquisition mode;

the sample analysis submodule is used for carrying out difference analysis on the code sample and the code repairing sample to obtain a vulnerability code sample which corresponds to the code sample and exists in the code sample but does not exist in the code repairing sample;

the sample determining submodule is used for determining a patching code sample corresponding to the vulnerability code sample from the code patching sample;

the vulnerability determining submodule is used for determining a feature matrix and a vulnerability type corresponding to the vulnerability code sample and determining a feature matrix and a vulnerability type corresponding to the patch code sample;

and the model training submodule is used for training a preset code detection model by using the feature matrix and the vulnerability type respectively corresponding to the vulnerability code sample and the patching code sample, and stopping training until the training stopping condition is met.

Optionally, the vulnerability determination submodule includes:

the first combination unit is used for combining each word in the vulnerability code sample to obtain a preset word bank;

the comparison unit is used for comparing the position of each word in the vulnerability code sample with the position of a word in a preset word bank so as to determine a characteristic value corresponding to each word in the vulnerability code sample;

the second combination unit is used for combining the characteristic values corresponding to each word in the vulnerability code sample according to a word arrangement sequence to obtain an initial characteristic matrix corresponding to the vulnerability code sample;

the clustering unit is used for carrying out clustering analysis on the initial characteristic matrix corresponding to the vulnerability code sample to obtain a clustering result and a characteristic matrix corresponding to the vulnerability code sample;

and the type determining unit is used for acquiring the vulnerability type corresponding to the clustering result and taking the vulnerability type corresponding to the clustering result as the vulnerability type corresponding to the vulnerability code sample in the clustering result.

Optionally, the sample analysis submodule is specifically configured to:

and carrying out difference analysis on the code sample and the code repairing sample, labeling the code sample based on a difference analysis result to obtain a labeling result, screening out a labeling result, representing the mark which exists in the code sample and does not exist in the code repairing sample, and taking the code part corresponding to the mark in the code sample as a vulnerability code sample.

An electronic device, comprising: a memory and a processor;

wherein the memory is used for storing programs;

the processor calls a program and is used to perform the code detection method described above.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a code detection method, a code detection device and electronic equipment, wherein target data to be subjected to code detection are obtained, data cleaning operation is carried out on the target data to obtain first data, position comparison is carried out on each word in the first data and words in a preset word bank to determine a characteristic value corresponding to each word in the first data, the characteristic values corresponding to each word in the first data are combined according to a word arrangement sequence to obtain a characteristic matrix corresponding to the target data, and a preset code detection model is called to process the characteristic matrix to obtain a vulnerability detection result of the target data; the preset code detection model is obtained based on training of a training sample; the training sample comprises a vulnerability type and a characteristic matrix of the vulnerability code sample. In the invention, because the preset code detection model is obtained based on a large number of training samples, the target data to be subjected to code detection is analyzed by the preset code detection model obtained through training, the accuracy of vulnerability detection can be improved, and the information security requirement is met.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method of detecting a code according to an embodiment of the present invention;

FIG. 2 is a flowchart of another code detection method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method of detecting a code according to another embodiment of the present invention;

FIG. 4 is a flowchart of a method of another code detection method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a default code detection model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a code detection apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In recent years, information security events triggered by software have become endless, and people are concerned about protecting information infrastructure. The main reason for software bugs stems from the high complexity of the software. Although some methods for detecting and analyzing software bugs have been proposed, software bugs are not eliminated, and as the number of software increases dramatically, bugs appear more and more frequently.

The current vulnerability code detection technology mainly has the following two problems:

1. the code inspection speed is limited, automatic and batch detection cannot be realized, and the code inspection speed is difficult to keep up with the rapid increase of the number of codes in the current internet and software industries.

2. The code inspection coupling degree is high, the whole process from discovery and verification of the vulnerability to utilization of the whole process mainly depends on the experience of a safety engineer, the coupling degree in the whole process is high, modularization of vulnerability excavation cannot be achieved, and the code inspection coupling degree is also one of factors for limiting the code inspection speed.

Besides, the existing static detection method and some commercial tools have the same improvement space in the accuracy and the false alarm rate of vulnerability detection. Most of tools mature in the engineering at present are based on fixed pattern matching, and intelligent vulnerability detection cannot be achieved.

In order to solve the problems that automatic detection cannot be achieved, manual detection is performed according to experience, and detection accuracy is low, the inventor finds that vulnerability detection can be researched in an artificial intelligence mode through research. By means of the deep learning technology, intelligent, high-efficiency and large-batch automatic detection of code vulnerabilities can be realized, so that the outstanding contradiction between the vulnerability detection method and actual requirements on detection precision and detection efficiency is solved, and meanwhile, manpower is liberated.

The invention provides a code detection method, a code detection device and electronic equipment, wherein target data to be subjected to code detection is obtained, data cleaning operation is carried out on the target data to obtain first data, position comparison is carried out on each word in the first data and words in a preset word bank to determine a characteristic value corresponding to each word in the first data, the characteristic values corresponding to each word in the first data are combined according to a word arrangement sequence to obtain a characteristic matrix corresponding to the target data, and a preset code detection model is called to process the characteristic matrix to obtain a vulnerability detection result of the target data; the preset code detection model is obtained based on training of a training sample; the training sample comprises a vulnerability type and a characteristic matrix of the vulnerability code sample. In the invention, because the preset code detection model is obtained based on a large number of training samples, the target data to be subjected to code detection is analyzed by the preset code detection model obtained through training, the accuracy of vulnerability detection can be improved, and the information security requirement is met.

On the basis of the above, another embodiment of the present invention provides a code detection method, and with reference to fig. 1, the method may include:

and S11, acquiring target data to be subjected to code detection, and performing data cleaning operation on the target data to obtain first data.

In this embodiment, the vulnerability detection mode is a static detection mode, that is, the static code is directly detected, and in this embodiment, data to be subjected to code detection is referred to as target data.

In order to avoid the influence of useless data in the target data on vulnerability detection, such as reduction of processing efficiency, in this embodiment, a data cleansing operation is performed on the target data, where the data cleansing operation may be an operation of removing comments and the like, and the data cleansing operation is completed to obtain the first data.

And S12, comparing the position of each word in the first data with the position of a word in a preset word bank to determine a characteristic value corresponding to each word in the first data.

In this embodiment, the preset word library may be a word library used in training a preset code detection model, and words added in advance are stored in the word library, and the words are words extracted from a code vulnerability sample, and may be obtained based on a jieba word segmentation technique in natural language processing.

The words in the preset word bank are arranged according to an appointed sequence, and the appointed sequence arrangement can be set according to actual conditions. For example, the words are arranged according to the initial letter sequence, or the words in the same code vulnerability sample are placed in sequence, and the words in different code vulnerability samples are arranged according to the arrangement sequence of the code vulnerability samples.

After the words in the preset lexicon are arranged in the designated order, the words in the preset lexicon are provided with sequence numbers, for example, the sequence number of the word arranged in the first is 1, and the sequence number of the word arranged in the second is 2 … ….

In another implementation manner of the present invention, referring to fig. 2, step S12 may include:

and S21, determining the position of each word in the first data in a preset word bank.

Specifically, the operation is the same for each word in the first data when the determination of the feature value is made.

First, the position of a word in a preset lexicon can be determined based on a one-hot algorithm, and the word is looked up in the preset lexicon by taking the word as an Injection, and if the word is looked up in the preset lexicon, the word is located at the 2 nd.

S22, constructing initial feature information corresponding to each word in the first data, setting the mark corresponding to the position in the initial feature information as a first numerical value, and setting the mark not corresponding to the position as a second numerical value.

In this embodiment, coding is performed using the TFIDF algorithm. Specifically, the initial characteristic information corresponding to the word is that the number of data is the number of words of the first data, and each data is initially a set value, for example, zero. Further, it may be a random value.

If the number of words of the first data is 30, a matrix of [0,0,0,0 … … …,0,0,0,0, 0] is constructed, and the number of 0 is 30.

Then, in the initial feature information, the identifier corresponding to the position of the word in the preset lexicon is set as a first numerical value, where the first numerical value may be 1, and the identifier in this embodiment is the numerical value corresponding to the position, taking the position of the word in the 2 nd as an example, that is, the second numerical value is set as 1, and the remaining numerical values except the second are all second numerical values, such as 0.

The initial characteristic information corresponding to the word is changed to [0,1,0,0,0,0, 0 … … …,0,0,0,0,0 ].

And S23, under the condition that the number of the words in the first data is smaller than a preset threshold, performing data complement operation on the initial characteristic information to enable the data volume of the initial characteristic information to be the preset threshold.

In this embodiment, the preset threshold is a maximum value of the number of words of the vulnerability code sample used when the preset code detection model is trained, for example, 50.

In order to ensure the uniformity of data, in this embodiment, data with non-uniform word number is subjected to number uniform processing operation, and data that is less than a preset threshold is subjected to data complementation operation, so that the data amount of the initial characteristic information is the preset threshold.

Still taking the above-mentioned initial feature information as being changed to [0,1,0,0,0,0, 0 … … …,0,0,0,0, 0] as an example, if the number of words is 30, and is less than 50, 20 data need to be complemented, and at this time, the value of the complemented data may be-1.

The initial characteristic information after data complementation is as follows:

[0,1,0,0,0,0………，0,0,0,0,0,-1,-1……-1]。

if the number of words is larger than 50, only the first 50 words can be reserved, and the values after 50 words can be deleted.

It should be noted that, the data complementing operation may also be performed first, and then the operation of "setting the identifier corresponding to the position in the initial feature information as a first numerical value, and setting the identifier not corresponding to the position as a second numerical value" is performed.

And S24, taking the initial characteristic information corresponding to each word in the first data after data complementation as the characteristic value corresponding to the word.

After the initial characteristic information subjected to data complementation is determined, the initial characteristic information subjected to data complementation is the characteristic value corresponding to the word.

And S13, combining the characteristic values corresponding to each word in the first data according to the word arrangement sequence to obtain a characteristic matrix corresponding to the target data.

In this embodiment, after determining the feature value corresponding to each word in the first data, the feature values corresponding to each word are combined according to the arrangement order of the words in the first data, so as to obtain the feature matrix corresponding to the target data. For example, the feature matrix may be:

and S14, calling a preset code detection model to process the feature matrix to obtain the vulnerability detection result of the target data.

The preset code detection model is obtained based on training of a training sample; the training sample comprises a vulnerability type and a characteristic matrix of the vulnerability code sample.

After a preset code detection model is obtained based on training of the training sample, the feature matrix is input into the preset code detection model, and a vulnerability detection result of the target data can be obtained. The vulnerability detection result may be whether a vulnerability exists or not, and when a vulnerability exists, the vulnerability detection result further includes a vulnerability type.

In this embodiment, target data to be subjected to code detection is obtained, data cleaning operation is performed on the target data to obtain first data, position comparison is performed on each word in the first data and a word in a preset word bank to determine a feature value corresponding to each word in the first data, the feature values corresponding to each word in the first data are combined according to a word arrangement sequence to obtain a feature matrix corresponding to the target data, and a preset code detection model is called to process the feature matrix to obtain a vulnerability detection result of the target data; the preset code detection model is obtained based on training of a training sample; the training sample comprises a vulnerability type and a characteristic matrix of the vulnerability code sample. In the invention, because the preset code detection model is obtained based on a large number of training samples, the target data to be subjected to code detection is analyzed by the preset code detection model obtained through training, the accuracy of vulnerability detection can be improved, and the information security requirement is met.

The above embodiment mentioned the preset code detection model, and now the generation process of the preset code detection model is described, referring to fig. 3, which may include:

and S31, acquiring the code sample and the code repair sample corresponding to the code sample by a preset data acquisition mode.

In this embodiment, there are two sources when acquiring vulnerability data, and the vulnerability data is obtained from CVE security virtualization data bases, applications, references and more (CVE security virtualization) or directly downloaded from sard (software access Reference database project). Selecting three different software applications of Openssl, Binutils and Linux from the Cvedetails website, and acquiring vulnerability code C files of different affiliated CWE types, wherein the vulnerability code C files are called code samples. The acquired vulnerability C file usually contains statements irrelevant to the known vulnerability, which seriously affect the accuracy of the model, thereby causing false reports and missing reports. Therefore, in addition to performing crawler on the C file of the bug code, a C file obtained by performing fixed repair on the C file of the bug code needs to be obtained, which is referred to as a code repair sample corresponding to the code sample in this embodiment.

And S32, performing difference analysis on the code sample and the code repairing sample to obtain a vulnerability code sample which corresponds to the code sample and exists in the code sample but does not exist in the code repairing sample.

In this embodiment, a difference between a code sample and the code repair sample is analyzed, where the difference exists in the code sample and does not exist in the code repair sample, that is, the code sample is a bug code sample.

Specifically, step S32 includes:

1) and performing difference analysis on the code sample and the code repairing sample, and labeling the code sample based on a difference analysis result to obtain a labeling result.

Specifically, a diff technology is used for analyzing the code sample and the code repairing sample, and the denoised distinguishing patch is obtained through processing, wherein the patch comprises a diff part of plus and minus. The diff part is divided into two parts, namely a part with a distinguished code sample and a part with the code repairing sample, the two parts are marked by plus and minus, wherein, plus represents that the statement is a bug repairing part, and minus represents that the statement is a bug part. The patch marked with "+", is the marked result.

2) And in the screening and labeling result, the marks which exist in the code sample and do not exist in the code repairing sample are characterized, and the code part corresponding to the marks in the code sample is used as a vulnerability code sample.

In this embodiment, an xpath regular matching screening is used to obtain a code part marked with "-", that is, the code part is used as a bug code segment, and in order to weaken the interference of a bug irrelevant statement and improve the accuracy of a model, data cleaning operations such as removing comments and the like can be performed on the bug code segment to obtain a bug code sample. Finally, the vulnerability code samples obtained from the Cvedetails website and the SARD website are 407 samples and 3628 samples respectively.

And S33, determining a patching code sample corresponding to the vulnerability code sample from the code patching samples.

In this step, the "-" marked statement in the patch is the patch code sample corresponding to the bug code sample.

S34, determining a feature matrix and a vulnerability type corresponding to the vulnerability code sample, and determining a feature matrix and a vulnerability type corresponding to the patch code sample.

In another implementation manner of the present invention, referring to fig. 4, step S33 may include:

and S41, combining each word in the vulnerability code sample to obtain a preset word bank.

The generation process of the preset lexicon in the embodiment has been described in the above corresponding parts, please refer to the corresponding description in the above embodiment.

S42, performing position comparison on each word in the vulnerability code sample and words in a preset word bank to determine a characteristic value corresponding to each word in the vulnerability code sample.

And S43, combining the characteristic values corresponding to each word in the vulnerability code sample according to a word arrangement sequence to obtain an initial characteristic matrix corresponding to the vulnerability code sample.

The generation process of the initial feature matrix has been explained in the above embodiments, and please refer to the corresponding descriptions in the above embodiments.

And S44, performing clustering analysis on the initial feature matrix corresponding to the vulnerability code sample to obtain a clustering result and a feature matrix corresponding to the vulnerability code sample.

Before cluster analysis is carried out, data cleaning operation of deduplication operation can be carried out on the initial characteristic matrix corresponding to the vulnerability code sample, so that repeated data are reduced.

On feature selection, KMeans is used to determine a feature matrix. The core process of K-Means is that if samples are to be classified into K classes, K samples are randomly selected from the samples, an initial characteristic matrix of the K samples is used as an initial clustering center, and different clustering centers correspond to different classes. Then, the selected clustering centers are respectively subjected to distance calculation with each sample, each sample has K distances, the minimum distance is calculated, the corresponding clustering center is obtained, the samples are divided into the same category as the corresponding clustering center, and the mean value of all samples in each category is calculated to be used as a new clustering center. This process is repeated until the new cluster center and the old cluster center are equal, and the iteration ends.

After iteration is finished, clustering results can be obtained, a vulnerability code sample in each clustering result corresponds to the same vulnerability type, and then specific types of vulnerability types, such as memory leakage, are determined manually.

In addition, through continuous iterative clustering, the initial characteristic matrix corresponding to the vulnerability code sample is continuously changed, and the value in the characteristic matrix is a value between 0 and 1.

And (3) verifying whether the same CWE type has certain similarity or not based on a classification algorithm of K-Means, so that the accuracy evaluation of the prediction model is more convincing.

S45, acquiring the vulnerability type corresponding to the clustering result, and taking the vulnerability type corresponding to the clustering result as the vulnerability type corresponding to the vulnerability code sample in the clustering result.

The vulnerability type is the manually determined vulnerability type, and the vulnerability type corresponding to the clustering result is the same as the vulnerability type corresponding to the vulnerability code sample in the clustering result. And taking the vulnerability type corresponding to the clustering result as the vulnerability type corresponding to the vulnerability code sample in the clustering result, thereby determining the vulnerability type of the vulnerability code sample, wherein the vulnerability types are known vulnerability types.

And determining the characteristic matrix corresponding to the patch code sample in the same process as determining the characteristic matrix corresponding to the vulnerability code sample, wherein the vulnerability types corresponding to the patch code sample are uniformly set to be free of vulnerabilities.

And S34, training a preset code detection model by using the feature matrix and the vulnerability type respectively corresponding to the vulnerability code sample and the patching code sample, and stopping training until the training stopping condition is met.

In the embodiment, in order to not depend on prior knowledge related to a program, the code similarity matching is performed by using a deep learning method, the preset code detection model can be an LSTM model, the model accuracy is verified to evaluate which detection model different vulnerability codes collected are suitable for by continuously adjusting the hyper-parameters, and the experimental result is analyzed.

The LSTM model in the invention adopts a Keras framework, an eight-layer neural network model is designed, and with reference to FIG. 5, each layer of neural network is described:

a first layer: and (5) a Marking layer. One advantage of LSTM is that it can handle variable-length sequences, whereas when using Keras building models, if the LSTM layer is used directly as the first layer of network input, the input size needs to be specified. If the variable-length sequence is used, only a Masking layer or an embedding layer is added in front of the LSTM layer. First, the sequence is converted into a fixed-length sequence, such as: selecting the maximum length of a sequence, and supplementing-1 to the sequence with the length less than the maximum length. The filter character is then specified in mask _ value in the Masking layer.

It should be noted that the above-mentioned data complementing operation may be implemented by a scraping layer, and if the data complementing operation is already performed when determining the feature value, the scraping layer may be omitted. And if the data complementing operation is not carried out when the characteristic value is determined, carrying out the data complementing operation by using a Markking layer. X1 … … Xn in FIG. 5 refers to the model inputs, such as the matrices described above.

Second to fifth layers: the LSTM layer. The long-short term memory network LSTM has a unique door mechanism, and can solve the problems that MLP and CNN cannot process variable-length sequences and long-distance dependence. The encoding of the current word will depend on the intermediate states and output values that result from the LSTM encoding of the previous word. The multilayer LSTM is formed by overlapping the LSTMs, and has the advantages of expressing features more abstractly at a high layer, reducing the number of neurons, increasing the recognition accuracy and reducing the training time. In addition, the weights are also partially discarded to prevent overfitting when constructing the LSTM for each layer.

A sixth layer: dropout layer. Dropout means that during the DL training process, some neurons are randomly selected according to a certain probability, and the role of the neurons in the network is not considered temporarily, so that the overfitting effect can be improved. Each time Dropout is done, it is equivalent to finding a thinner network from the original network.

A seventh layer: a Batch Normalization layer. The learning data distribution is the essence of the neural network learning process, and the difference of the distribution of train and test sets can reduce the generalization capability of the network. The different distribution of each batch of train data will result in the network learning different distribution at each epoch, and the training speed will also be reduced. A normalization pre-processing of the data is required. The problem that the distribution of intermediate layer data changes in the training process is solved, and then the BatchNormalization algorithm is available.

An eighth layer: and fully connecting the Dense layers. And outputting the probability of the sample distributed in each category through linear operation and a Softmax activation function, wherein the probability value is the category.

By designing the eight-layer neural network model, training a preset code detection model by using the feature matrix and the vulnerability type respectively corresponding to the vulnerability code sample and the patching code sample, and stopping training until the training stopping condition is met.

During modeling, 80% of vulnerability code samples in three applications of Linux, Openssl and Binutils are used as a training set, and the rest are used as a verification set to verify and predict the accuracy.

In order to verify the accuracy of the preset code detection model, in this embodiment, an MLP model and a CNN model are used for comparison, so as to prove the validity of the LSTM model.

On the basis of the embodiment, in order to better show the algorithm process, a front-end page is made in a webpage form. In the framework of the Web server, Python-based flash, which is a typical Web micro-framework, was chosen. When the visualization process is realized, the specific flow is as follows: and calling the app.py file by the shell file to find a flash entry. Then creates an instance of the flash class, specifies the static file location, and registers different services into the flash using the register blueprint. Generating HTML with Python is rather cumbersome because the HTML must be manually escaped to secure the application. The template is specified in the first blueprint view. py using render _ template () function for rendering and display to the front end. The location of js specified in html correlates with the location of the static file specified in flash. And after the html webpage is loaded, the browser calls back the function in the js, sets the trigger function by the id with different elements and makes different correspondences. Crawling cvedetails detailed information, obtaining and obtaining a git _ url vulnerability C file and a repaired C file, processing data, switching data source model services of spider data and downloaded data, CNN, LSTM and simple MLP, inputting parameters required by the functions by an html page, acquiring a value input by a js, transmitting the value to a second blueprint, transmitting the parameters to other function functions such as a model by a flash blueprint, and transmitting a returned result to a foreground page for feedback by the js. And displaying the acquired vulnerability code C file and the repaired C file and displaying the result. And (3) application of a vulnerability C file and a repaired C file is required to be crawled, a crawler is clicked to obtain, a log output by a console is displayed on the right side, and a path for saving an output result and visualized data are displayed and called as a result.

In the embodiment, the LSTM is used for detecting the vulnerability, so that the detection accuracy is improved. And the FLASK is used for interacting with the front end, so that the interface visualization is realized, and the use by a user is facilitated. In addition, the Scapy crawler completes a preliminary leak library construction model, and does not directly use the existing data. 2) And realizing structural representation of the vulnerability code segment. And aiming at the collected vulnerability codes, carrying out vulnerability-related blacklist statistics based on a jieba word segmentation technology in natural language processing, and further designing a blacklist-based vulnerability code segment characteristic value determination mode by using TFIDF.

Optionally, on the basis of the above embodiment of the code detection method, another implementation manner of the present invention provides a code detection apparatus, and with reference to fig. 6, the code detection apparatus may include:

the data processing module 11 is configured to acquire target data to be subjected to code detection, and perform data cleaning operation on the target data to obtain first data;

the characteristic determining module 12 is configured to compare positions of each word in the first data with words in a preset word bank to determine a characteristic value corresponding to each word in the first data;

a matrix determining module 13, configured to combine feature values corresponding to each word in the first data according to a word arrangement order to obtain a feature matrix corresponding to the target data;

the vulnerability detection module 14 is used for calling a preset code detection model to process the feature matrix to obtain a vulnerability detection result of the target data; the preset code detection model is obtained based on training of a training sample; the training sample comprises a vulnerability type and a characteristic matrix of the vulnerability code sample.

Further, the method also comprises a model generation module, wherein the model generation module comprises:

Further, the vulnerability determination submodule includes:

Further, the sample analysis submodule is specifically configured to:

Further, the feature determination module 12 includes:

the position determining submodule is used for determining the position of each word in the first data in a preset word bank;

the numerical value determining submodule is used for constructing initial characteristic information corresponding to each word in the first data, setting an identifier corresponding to the position in the initial characteristic information as a first numerical value, and setting an identifier not corresponding to the position as a second numerical value;

the data complementing sub-module is used for carrying out data complementing operation on the initial characteristic information under the condition that the number of the words in the first data is smaller than a preset threshold value, so that the data amount of the initial characteristic information is equal to the preset threshold value;

and the characteristic value determining submodule is used for taking initial characteristic information corresponding to each word in the first data after data complementation as a characteristic value corresponding to the word.

It should be noted that, for the working processes of each module, sub-module, and unit in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.

Optionally, on the basis of the embodiments of the code detection method and apparatus, another implementation manner of the present invention provides an electronic device, including: a memory and a processor;

wherein the memory is used for storing programs;

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. a code detection method, is characterized in that, comprises:

acquiring target data to be subjected to code detection, and performing a data cleaning operation on the target data to obtain first data;

performing a position comparison between each word in the first data and a word in a preset thesaurus to determine the feature value corresponding to each word in the first data;

The feature values corresponding to each word in the first data are combined according to the word arrangement order to obtain a feature matrix corresponding to the target data;

Calling a preset code detection model to process the feature matrix to obtain a vulnerability detection result of the target data; the preset code detection model is obtained by training based on training samples; the training samples include vulnerability types and characteristics of the vulnerability code samples matrix.

2. The code detection method according to claim 1, wherein the generation process of the preset code detection model comprises:

Obtain a code sample and a code repair sample corresponding to the code sample by using a preset data acquisition method;

Performing a difference analysis on the code sample and the code repair sample, to obtain a vulnerability code sample corresponding to the code sample that exists in the code sample and does not exist in the code repair sample;

Determine the patch code sample corresponding to the vulnerability code sample from the code repair sample;

Determine the feature matrix and vulnerability type corresponding to the vulnerability code sample, and determine the feature matrix and vulnerability type corresponding to the patch code sample;

The preset code detection model is trained by using the feature matrix and the vulnerability type corresponding to the vulnerability code sample and the patch code sample respectively, and the training is stopped when the training stop condition is satisfied.

3. code detection method according to claim 2, it is characterised in that determining the corresponding feature matrix and vulnerability type of the vulnerability code sample, comprising:

Combining each word in the vulnerability code sample to obtain a preset vocabulary;

Compare the position of each word in the vulnerability code sample with the words in the preset thesaurus to determine the feature value corresponding to each word in the vulnerability code sample;

Combining the feature values corresponding to each word in the vulnerability code sample according to the word arrangement order to obtain an initial feature matrix corresponding to the vulnerability code sample;

Perform cluster analysis on the initial feature matrix corresponding to the vulnerability code sample, to obtain a clustering result and a feature matrix corresponding to the vulnerability code sample;

The vulnerability type corresponding to the clustering result is acquired, and the vulnerability type corresponding to the clustering result is used as the vulnerability type corresponding to the vulnerability code sample in the clustering result.

4. The code detection method according to claim 2, wherein the code sample and the code repair sample are subjected to a difference analysis to obtain the code sample corresponding to the code sample, existing in the code sample and not detected. Vulnerable code samples present in the code repair samples, including:

Performing a difference analysis on the code sample and the code repair sample, and labeling the code sample based on the difference analysis result to obtain a labeling result;

The marking results are screened to represent the identifiers that exist in the code samples but not in the code repair samples, and the code parts corresponding to the identifiers in the code samples are used as vulnerability code samples.

5. The code detection method according to claim 1, wherein each word in the first data is compared with words in a preset vocabulary to determine the number of words in the first data. The eigenvalues corresponding to each word, including:

determining that each word in the first data is located in a preset thesaurus;

Constructing initial feature information corresponding to each word in the first data, and setting the identifier corresponding to the position in the initial feature information as the first value, and setting the identifier not corresponding to the position as the second value value;

When the number of words in the first data is less than the preset threshold, perform a data complementing operation on the initial feature information, so that the data amount of the initial feature information is the preset threshold;

The initial feature information after data complementing corresponding to each word in the first data is used as the feature value corresponding to the word.

6. a code detection device, is characterized in that, comprises:

a data processing module, used for acquiring target data to be subjected to code detection, and performing a data cleaning operation on the target data to obtain first data;

A feature determination module, configured to perform a position comparison between each word in the first data and a word in a preset thesaurus to determine a feature value corresponding to each word in the first data;

a matrix determination module, configured to combine the eigenvalues corresponding to each word in the first data according to the word arrangement order to obtain a eigenmatrix corresponding to the target data;

A vulnerability detection module, configured to call a preset code detection model to process the feature matrix to obtain a vulnerability detection result of the target data; the preset code detection model is obtained by training based on training samples; the training samples include vulnerability codes Vulnerability type and feature matrix of the sample.

7. The code detection device according to claim 6, further comprising a model generation module, the model generation module comprising:

A sample acquisition sub-module, configured to acquire code samples and code repair samples corresponding to the code samples through a preset data acquisition method;

A sample analysis submodule, configured to perform a difference analysis on the code sample and the code repair sample, and obtain the vulnerabilities corresponding to the code sample that exist in the code sample and do not exist in the code repair sample code samples;

a sample determination sub-module, configured to determine the patch code sample corresponding to the vulnerability code sample from the code repair sample;

A vulnerability determination submodule, used for determining the feature matrix and vulnerability type corresponding to the vulnerability code sample, and determining the feature matrix and vulnerability type corresponding to the patch code sample;

The model training submodule is used for training the preset code detection model by using the feature matrix and the vulnerability type corresponding to the vulnerability code sample and the patch code sample respectively, and stops training until the training stop condition is met.

8. The code detection device according to claim 7, wherein the vulnerability determination submodule comprises:

a first combining unit, configured to combine each word in the vulnerability code sample to obtain a preset thesaurus;

A comparison unit, configured to compare the position of each word in the vulnerability code sample with a word in a preset thesaurus to determine the feature value corresponding to each word in the vulnerability code sample;

a second combining unit, configured to combine the feature values corresponding to each word in the vulnerability code sample according to the word arrangement order to obtain an initial feature matrix corresponding to the vulnerability code sample;

a clustering unit, configured to perform cluster analysis on the initial feature matrix corresponding to the vulnerability code sample, to obtain a clustering result and a feature matrix corresponding to the vulnerability code sample;

A type determination unit, configured to acquire the vulnerability type corresponding to the clustering result, and use the vulnerability type corresponding to the clustering result as the vulnerability type corresponding to the vulnerability code sample in the clustering result.

9. The code detection device according to claim 7, wherein the sample analysis submodule is specifically used for:

The difference analysis is performed on the code sample and the code repair sample, and the code sample is annotated based on the difference analysis result, the annotation result is obtained, and the annotation results are screened out, and the sign exists in the code sample and does not exist. The identifier existing in the code repair sample, and the code part corresponding to the identifier in the code sample is taken as the vulnerability code sample.

10. An electronic device, comprising: a memory and a processor;

Wherein, the memory is used to store programs;

The processor invokes the program and is used to execute the code detection method according to any one of claims 1-5.