CN108549817A

CN108549817A - A kind of software security flaw prediction technique based on text deep learning

Info

Publication number: CN108549817A
Application number: CN201810353774.0A
Authority: CN
Inventors: 危胜军; 钟浩; 单纯; 胡昌振; 牛中盈
Original assignee: Beijing Institute of Technology BIT; Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Technology BIT; Beijing Institute of Computer Technology and Applications
Priority date: 2018-04-19
Filing date: 2018-04-19
Publication date: 2018-09-18

Abstract

The present invention is a kind of software security flaw prediction technique based on text deep learning, using deep neural network model and shallow-layer machine learning algorithm from history software source code text learning characteristic and knowledge, can be used in predicting the security breaches in new software source code.The present invention uses the Structural Characteristics in deep neural network model learning software source code text feature, using the feature learnt as the input of grader, adjustment is trained to grader, obtains optimal loophole prediction model, the loophole prediction of the new software module for the software.

Description

A kind of software security flaw prediction technique based on text deep learning

Technical field

The present invention relates to a kind of software security flaw prediction techniques based on text deep learning, belong to software security flaw Electric powder prediction.

Background technology

Software security flaw prediction can know in advance in software source code module there is a possibility that software vulnerability or Quantity, for software developer according to prediction result, limited time and funds can targetedly put into those there are loopholes can Can property it is high and quantity more than software module test in, the efficiency of software test is improved with this.

Currently, common software vulnerability prediction establishes software vulnerability prediction model using shallow-layer machine learning method, establish Process is as shown in Figure 1：

1. the measurement metric of software source code module is established

Currently, the measurement metric of software source code module is established, there are mainly two types of methods：A kind of method is to use metric software The index (metrics) of code quality be used as measurement metric, such as by the CK indexs of object-oriented program (WMC, DIT, NOC, CBO, RFC and LCOM etc.), the code revision characteristic index of software-oriented development process, developer's experience level index, intermodule Dependency degree index and project team organizational structure reasonability index etc. are used as measurement metric, we are referred to as based on software index Measurement metric；Another method is that software source code is considered as text, and the frequency that each word occurs in text is considered as measurement metric, We are referred to as the measurement metric based on code text word.

2. software history vulnerability scan is established

It is collected from the software vulnerability library having disclosed and is directed to the leaky to current institute of some software project, establish needle To the software vulnerability library of the software project.It is specified in software vulnerability library for loophole in each software module of the software project Position and quantity.

Software vulnerability library provides the historical knowledge of the loophole distribution for the software project.

3. software vulnerability predicts the training of machine learning model, test

For some software project, the concrete numerical value of the measurement metric index of each software module is calculated, software history is utilized Vulnerability database obtains the label or defects count of the presence or absence of each software module loophole, and then selection is suitable for the software project (there is presently no disclosed data to show to have used deep learning algorithm, and what is used is all based on shallow-layer for machine learning algorithm The algorithm of habit), using the concrete numerical value of measurement metric index as input, whether there is or not the labels in gaps and omissions hole or loophole quantity as defeated Go out, establishes the machine learning model that (training, test and parameter adjustment) is directed to the software vulnerability prediction of the software project.

4. software vulnerability predicts the application of machine learning model

Training is completed and the loophole prediction model of test passes can be to the loophole feelings of the new software module of the project Condition is predicted.The concrete numerical value for calculating the measurement metric index of new software module first, numerical value is input in prediction model, The result of model calculation output is that the software module there is a possibility that loophole or the quantity there are loophole.

In entire modeling process, aspect there are three the factors of prediction model performance is influenced：Measurement metric index is selected It takes, the quality of vulnerability database and specific machine learning algorithm.Selected measurement metric index should can reflect leaky mould Block and substantive characteristics without loophole module, that is, measurement metric index is to leaky and without loophole module have certain area The ability of dividing；The quality in software vulnerability library also largely influences the performance of model, and the software vulnerability library established should With higher accuracy and preferable completeness；Machine learning algorithm itself also has different performances, for different soft Part project, selection are suitable for this item purpose machine learning algorithm.Factor in terms of these three be connect each other it is interactional, it is comprehensive Close the performance for determining this prediction model.

In the case where the quality of history vulnerability database is set, for certain class measurement metric index, different machine learning algorithms Performance difference it is larger, carried out a large amount of trial for different machine learning algorithms at present, but the algorithm of all uses Belong to shallow-layer machine learning method, does not there is disclosed data to show that someone uses deep learning method.

Existing data shows that in most cases the prediction effect of the measurement metric based on code text word wants bright The aobvious prediction effect better than the measurement metric based on software index.Actual conditions are found, measurement is being used as using code text word In the prediction technique of member, the dimension of measurement metric would generally be very high, and sometimes even over 10,000 dimensions, and these dimensional characteristics are sparse 's.In such cases, at present frequently with based on shallow-layer study machine learning algorithm, for deep learning, feature Learning effect is poor, directly results in the effect Shortcomings of loophole prediction.A large amount of research and practice have shown that, characteristic dimension very In the case of high and sparsity is very strong, the learning ability of deep learning will be substantially better than shallow-layer study, this is because deep learning It can learn the Structural Characteristics in high dimensional feature automatically, achieve the purpose that carry out compression and dimensionality reduction to feature.

Invention content

The present invention is a kind of software security flaw prediction technique based on text deep learning, using deep neural network mould Type and shallow-layer machine learning algorithm learning characteristic and knowledge from history software source code text, can be used in new software source Security breaches in code are predicted.

The invention is realized by the following technical scheme：

A kind of software security flaw prediction technique based on text deep learning, for the source code of software, with therein Software module is processing unit, counts the number that each word occurs in entire text, and occurrence number is normalized Processing obtains the frequency of word appearance, and the frequency that word and word are occurred is as the feature vector of the source code text, the spy Input of the sign vector as the feature learning device based on deep neural network structure, feature learning device learn it to obtain spy The Structural Characteristics of vector are levied, input of the Structural Characteristics as grader is trained adjustment to the parameter of grader, obtains Optimal loophole prediction model is obtained, the loophole prediction of the new software module for the software.

Further, described eigenvector is extracted in the following ways：To predict that the software module in object is that processing is single Member is first rejected the punctuation mark and code annotation that occur in source code text, is made with space to remaining text Each word is extracted for separator, and counts the number that each word occurs in entire text, will finally go out occurrence Number is normalized, and the result of processing is the frequency that word occurs, and thus obtains the feature for the source code text Vector.

Further, the feature vector of source code text uses following representation：

ComponentName：(Item_1：Number_1；Item_2：Number_2；…；Item_n： Number_n)

Wherein, ComponentName indicates that the title of the module, Item_1 indicate the title of the 1st word, Number_1 indicates that the frequency that the 1st word occurs, Item_2 indicate that the title of the 2nd word, Number_2 indicate the 2nd The frequency that a word occurs, this is analogized, and Item_n indicates that the title of n-th of word, Number_n indicate that n-th of word goes out Existing frequency.

Further, the grader uses shallow-layer machine learning algorithm.

Further, the output of the grader is two classification, that is, whether the module of predicted object has loophole.

Further, it is established using different deep neural network models and different classifier algorithms for prediction object A variety of loophole prediction models, contrast properties index value determine optimal loophole prediction model.

Beneficial effects of the present invention：The present invention uses in deep neural network model learning software source code text feature Structural Characteristics, then using the feature learnt as the input of grader, its advantage is that software source code textual words feature to Amount is that a kind of dimension is very high and the very strong feature of sparsity, and existing shallow-layer machine learning algorithm processing capacity is limited, using depth Degree learning algorithm learns to obtain Structural Characteristics first, achievees the purpose that Feature Dimension Reduction, then the feature that study is obtained as shallow The input of layer learning algorithm, can improve the performance indicator of loophole prediction model in this way so that loophole prediction model has precision The relatively low advantage of height, false alarm rate and false dismissed rate.

Description of the drawings

Fig. 1 is that software vulnerability prediction uses shallow-layer machine learning method flow chart in the prior art；

Fig. 2 is the software security flaw prediction technique flow chart based on text deep learning of the present invention；

Fig. 3 is the structure chart of loophole prediction model in the present invention；

Fig. 4 is the feature learning device based on stack self-encoding encoder structure in the present invention.

Specific implementation mode

The invention will be described further below in conjunction with the accompanying drawings.

As shown in Fig. 2, the software security flaw prediction technique based on text deep learning of the present invention includes the following steps：

Step 1: software source code textual words feature extraction

For the source code (predicting object) of some software project, using software module therein as processing unit, first The punctuation mark and code annotation that occur in module source code text are rejected, to remaining text using space as point Each word extracted every symbol, and counts the number that each word occurs in entire text, finally by occurrence number into Row normalized, the result of processing are the frequency that word occurs, and thus obtain the feature for the module source code text Vector is set as following representation：

Step 2: establishing software history vulnerability database

All disclosed software vulnerabilities for the software project (prediction object) are collected from disclosed software vulnerability library, These disclosed loophole data give the specific location in the software module residing for each loophole simultaneously, thus can determine each The history loophole quantity that software module occurred, referred to as the loophole label of the software module.If being directed to some software mould Block does not find the open loophole for having for the module, then it is assumed that the loophole quantity of the software module is zero.For loophole label It is further processed：If loophole label is not equal to 0, it is uniformly set as 1, if loophole label is 0, remains set to 0, this Sample, otherwise loophole label is 1 or be 0, only indicate whether there is or not.

Step 3: establishing the feature learning device based on deep neural network structure

On the basis of step 1, selection is suitable for the deep neural network model of textual words feature learning.Software source The key property of code text word is between word there is no semantic relation, and the relevance between these features is not strong, because There are many this suitable deep neural network models, for one such, such as with the depth of stack self-encoding encoder structure For neural network, structure is as shown in Figure 4, wherein input layer has n input feature vector amount x₁, x₂..., x_n, corresponding respectively to walk Word feature the amount Item_1, Item_2 ..., Item_n extracted in rapid one, after the quantity n of word feature amount is determined, choosing Corresponding neural network input terminal is selected, such as word feature amount there are 5, then selects corresponding 5 of input terminal as input, hidden layer With h layers, the quantity of every layer of neuron is k respectively₁、k₂、…、k_h, the quantity of hidden neuron successively reduces, i.e. k₁＜ k₂ ＜ ... ＜ k_h.The quantity of the neuron of output layer is m.It is to connect entirely between every layer of neuron, the neuron of input layer is not joined With calculating.The activation primitive of neuron uses Sigmoid functions.Neuron in the number of plies of hidden layer and every layer in feature learning device Quantity specifically determined according to actual conditions.

Step 4: establishing grader

On the basis of step 3, the grader of loophole prediction model is established, the input of grader is established in step 3 Feature learning device output, that is, the feature that arrives of third step middle school's acquistion.Grader uses shallow-layer machine learning algorithm, can There are many shallow-layer machine learning algorithms of selection, such as LogisticRegression algorithms, SVM, NN, Bayesian network etc.；Point The output of class device is classified for two, i.e., and 0 or 1, indicate whether the module of predicted object has loophole, when output is not 0 or 1, The then parameter of neural network model or the model parameter of grader in feature learning device in set-up procedure three, until output for 0 or 1.It was proved that the model parameter for adjusting grader is easier optimization to obtain output to be 0 or 1.

Step 5: being trained and testing to prediction model

Step 1: two, three, four on the basis of is trained and tests to prediction model, the specific method is as follows：Utilize step Rapid one method extracts the Text eigenvector of each software module, and the history that prediction object is established using the method for step 2 is leaked Cave depot, to obtain prediction object each module loophole label.It is soft that all feature vectors and corresponding label constitute this The sample database of part project.Sample database is divided into two parts, a part is training sample for training；A part is used to test, and is Test sample.Using the Text eigenvector of training sample as the input of step 3, feature learning device that step 3 is established Every layer is individually trained.Training method uses the training algorithm corresponding to used deep neural network structure.This process Referred to as pre-training, pre-training process are both the learning process of the dimensionality reduction and Structural Characteristics to primitive character.It is self-editing with stack For the deep neural network of code device structure, successively it is trained based on BP algorithm using greedy method, training process is fallen into a trap It is equal with input value to calculate the output valve used when reverse propagated error.

After training process, the input by the output of feature learning device as the grader in step 4, to grader Parameter be trained adjustment, according to the criterion corresponding to specific learning algorithm, obtain the parameter value of optimal grader. By taking LogisticRegression graders as an example, using the output of feature learning device as LogisticRegression graders Input, LogisticRegression classifier parameters are adjusted using gradient descent algorithm, when cost function obtain Both the optimized parameter of LogisticRegression graders is obtained when minimum value, the training process of grader terminates at this time.

After pre-training and classifier parameters adjustment process, the training of entire loophole prediction model terminates.Next The performance of model is tested using test sample, obtains the occurrence of performance indicator, these performance indicators include precision, standard Exactness, false alarm rate, false dismissed rate etc..

Step 6: determining optimum prediction model according to performance indicator

For prediction object, established using different deep neural network structural models and different classifier algorithms a variety of Loophole prediction model obtains the performance index value of each prediction model on the basis of step 5.Contrast properties index, according to Actual demand determines the optimal loophole prediction model for being suitable for this software project.For example, in certain situations it is desirable to by smart Degree index is placed above the other things, at this time the highest loophole prediction model of choice accuracy index；And in some cases, need by Accuracy index is placed above the other things, answers the highest loophole prediction model of accuracy of selection index at this time.

Step 7: prediction model is applied to new software module

The prediction model of the best performance obtained in step 6 is applied to the leakage of the new software module of the software project It predicts in hole.New software module feature vector value is calculated first, using result of calculation as the input of prediction model, model running Output is whether the new software module has leaky label.

Claims

1. a kind of software security flaw prediction technique based on text deep learning, which is characterized in that it is directed to the source code of software, Using software module therein as processing unit, each word frequency of occurrences in source code text is counted, word and word are gone out Feature vector of the existing frequency as the source code text, this feature vector is as the characterology based on deep neural network structure The input of device is practised, feature learning device learns it to obtain the Structural Characteristics of feature vector, which, which is used as, divides The input of class device is trained adjustment to the parameter of grader, obtains optimal loophole prediction model, for the new of the software The loophole of software module is predicted.

2. a kind of software security flaw prediction technique based on text deep learning as described in claim 1, which is characterized in that Described eigenvector is extracted in the following ways：To predict the software module in object as processing unit, first to source code text The punctuation mark and code annotation occurred in this is rejected, to remaining text using space as separator by each word It extracts, and counts the number that each word occurs in entire text, finally occurrence number is normalized, locate The result of reason is the frequency that word occurs, and thus obtains the feature vector for the source code text.

3. a kind of software security flaw prediction technique based on text deep learning as claimed in claim 1 or 2, feature exist In the feature vector of source code text uses following representation：

Com ponentN am e：(Item_1：N um ber_1；Item_2：N um ber_2；；…；Item_n：N Um ber_n)

Wherein, Com ponentN ame indicate that the title of the module, Item_1 indicate the title of the 1st word, N um Ber_1 indicates that the frequency that the 1st word occurs, Item_2 indicate that the title of the 2nd word, N um ber_2 indicate the 2nd The frequency that word occurs, this is analogized, and Item_n indicates that the title of n-th of word, N um ber_n indicate that n-th of word goes out Existing frequency.

4. a kind of software security flaw prediction technique based on text deep learning as claimed in claim 1 or 2, feature exist In the grader uses shallow-layer machine learning algorithm.

5. a kind of software security flaw prediction technique based on text deep learning as claimed in claim 4, which is characterized in that The output of the grader is two classification, that is, whether the module of predicted object has loophole.

6. a kind of software security flaw prediction technique based on text deep learning as claimed in claim 1 or 2, feature exist In this method further comprises, different deep neural network models and different classifier algorithms are used for prediction object A variety of loophole prediction models are established, contrast properties index value determines optimal loophole prediction model.