CN112001484A

CN112001484A - Safety defect report prediction method based on multitask deep learning

Info

Publication number: CN112001484A
Application number: CN202010853000.1A
Authority: CN
Inventors: 苏小红; 蒋远; 牟辰光; 王甜甜
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-08-22
Filing date: 2020-08-22
Publication date: 2020-11-27

Abstract

The invention discloses a security flaw report prediction method based on multitask deep learning. And preprocessing the text content of the defect report in the data set to generate a professional corpus, and training a word2vec model by using the professional corpus. Establishing a multi-task deep learning model, extracting shared semantic features of a defect report by using a deep neural network at the bottom layer of the model, learning features with resolution for different tasks by using each sub-network at a high layer, and finally taking a feature vector output by the high layer network as the input of each sub-task prediction network to finish safety defect report identification and severity level prediction tasks. According to the method, multi-task learning is used for safety defect report prediction for the first time, and the auxiliary task information related to the target task is utilized to guide the model to learn the characteristics with stronger generalization capability, so that the generalization capability of the model can be improved, and the influence of noise data is reduced.

Description

Safety defect report prediction method based on multitask deep learning

Technical Field

The invention relates to a safety defect report prediction method, in particular to a safety defect report prediction method based on multitask deep learning.

Background

As software scales and complexities increase, various software bugs inevitably occur. Wherein security related flaws, once utilized by an attacker, would cause significant damage and loss to the software system. To facilitate the collection and management of software defects, more and more software companies, such as Google, Mozilla, have built their own defect report tracking systems, and users can submit discovered defects to the systems for timely dispatch of repair personnel for repair. Due to the lack of safety-related domain knowledge, a defect report submitter often has difficulty in accurately judging whether a defect report is safety-related, and if the safety-related defects are marked as non-safety-related when the report is submitted, the time for repairing the safety defects is delayed, so that a serious safety threat is caused to a system. Identifying security-related defect reports (hereinafter "security defect reports") manually is obviously time consuming and impractical. Therefore, it is of great significance to automatically identify the security flaw report.

The defect Report has the characteristic of large difference of text description information, the category imbalance and the scarcity of safety features are caused by the small proportion of positive samples, namely Safety Bug Reports (SBR), in a data set, so that the defect reports are not easy to extract, in addition, the small proportion of SBR is not marked as safety-related defect reports due to the lack of safety knowledge of development or testing personnel, the defect reports exist in the data set in the form of Non-safety Bug reports (NSBR), namely noise is introduced into the data set, and the problems bring difficulty and challenge to the automatic identification of the safety Bug reports.

The currently common method is to use a method combining text mining and machine learning. FARSEC and LTRWES are typical representatives of this class of methods. The FARSEC method is Peters et al (Peters F, Tun T, Yu Y, et al. text Filtering and Ranking for Security cloth Report Prediction [ J)]IEEE Transactions on Software Engineering,2017:1-1), from AnnAnd extracting 100 words with the highest tf-idf value from the full defect report as safety related keywords, filtering the non-safety defect report by using the 100 safety keywords, and representing the historical defect report into a 100-dimensional feature vector by using the safety keywords for training an SBR automatic identification model. However, the main problem of this method is that the words with higher tf-idf values are not necessarily security-related words, which affects the filtering effect of noise data, and since only a few security-related keywords may appear in a defect report, the problem of vector sparsity occurs when the feature vector contains a large number of 0 elements, so that the semantic information of the defect report cannot be accurately expressed. To address these problems, Jiang et al (Y Jiang, P LU, X SU, T Wang. LTRWES: A new frame for security bug report detection [ J)]Information and Software technology.2020:106314) proposes using the ranking model BM25F_extCalculating the content relevance of each NSBR and all SBR, then filtering the NSBR with higher relevance with the SBR content from the NSBR, and expressing the defect report as a low-dimensional continuous real-value vector by utilizing a word2vec model trained on a large number of defect report text corpora, thereby realizing more accurate defect report vector expression.

The safety defects have the characteristics of complex characteristics (namely, the types of the safety defects are various, and the differences of the characteristics of different types of safety defects are large) and small quantity in actual projects, so that deep semantic features related to safety are difficult to extract by using the traditional machine learning method, and the performance of a prediction model is improved to cause a bottleneck. In recent years, deep learning (deep learning) has shown unique advantages and potentials in feature extraction and pattern recognition, and has been the result of applying deep learning to research for solving complex text processing tasks. At present, only 1 document (Zheng Wei and the like, safety defect report prediction method empirical research based on deep learning, software science and report 2020,31(5):1294 + 1313) for carrying out safety defect report prediction based on deep learning is retrieved, and the paper adopts deep text mining models TextCNN and TextRNN to construct a safety defect report prediction model, so that a prediction result superior to that of the traditional machine learning classification algorithm is obtained. Training of deep learning models relies on massive labeled datasets. However, in the task of predicting the safety defect report, a large amount of manpower and material resources are required for labeling the mass data, so that the scale of the data cannot meet the actual requirement generally.

In recent years, multitask learning has enjoyed great success in many practical applications such as natural language processing, text recognition, and computer vision. The multi-task learning effectively guides the model to learn the characteristics with stronger generalization ability by using the information of the auxiliary task related to the target task, thereby improving the generalization ability of the model, improving the performance of the model on unseen data and making up the shortage of the number of data sets for training the target task model. Meanwhile, the multi-task learning can also utilize additional information provided by related tasks to help the model focus attention on the actually important features, so that the influence of noise in target task data on model training is reduced.

Documents for identifying safety defect reports based on multitask learning have not been retrieved at present. The existing methods based on machine learning and deep learning are all based on the security label of the defect report to carry out single-task learning, and related researches for simultaneously predicting the security defect report and the defect severity level by adopting multi-task learning are not available. The predicted safety-related defects and the severity levels of the predicted defects are important in practical application, and the predicted safety-related defects and the severity levels of the predicted defects are related tasks, and the tasks are subjected to multi-task learning, so that the generalization performance of the original task, namely the safety defect report prediction, can be improved.

Disclosure of Invention

The invention provides a security flaw report prediction method based on multitask deep learning, which considers the difficulty of security flaw report collection and labeling, and simultaneously aims to improve the generalization performance of a security flaw report prediction model based on deep learning and avoid the risk that software faces attack because a vulnerability is possibly utilized by a hacker due to the fact that a security flaw report is not marked as security-related and is not repaired in time. The method introduces multi-task learning into the field of safety defect report prediction for the first time, and jointly predicts the safety defect report and the severity level of the defects thereof. The method can realize multi-task learning based on multiple deep neural networks, and by learning the knowledge of two related tasks, namely, the safety defect report recognition and the defect severity level prediction at the same time, and utilizing the potential correlation between the two related tasks, the method extracts the characteristics with stronger generalization capability, reduces the risk that a single-task model is easy to cause overfitting, reduces the requirement of a deep learning model on mass labeled data, reduces the influence of data noise on model training at the same time, and improves the generalization performance of a safety defect report prediction model.

The purpose of the invention is realized by the following technical scheme:

a safety defect report prediction method based on multitask deep learning comprises the following steps:

step 1: mining a defect report warehouse and a related security vulnerability management website, finding out a defect report marked as safe and unsafe by a developer or a maintainer, adding a severity label to each instance according to the severity content of the defect report, and constructing a defect report data set for training and testing a multitask deep learning model;

step 2: preprocessing the text content of the defect report in the defect report data set constructed in the step 1, wherein the preprocessing comprises removing noise, segmenting words, removing stop words, converting lower case, extracting word stems and the like, and generating a professional field corpus related to the defect report;

and step 3: training a word2vec model based on the professional field corpus about the defect report generated in the step 2, and generating a vector representation of each word, namely a word vector dictionary;

and 4, step 4: the method comprises the steps of establishing a multitask deep learning model facing to safety defect report recognition and severity level prediction of related defects, wherein the multitask deep learning model is divided into a feature sharing layer and a specific task layer, and the method comprises the following steps:

the feature sharing layer is positioned at the bottom layer of the multi-task learning model, is realized by adopting a deep neural network and is used for extracting the shared semantic features of the preprocessed defect report;

the specific task layer is positioned at the top layer of the multi-task learning model, each task corresponds to a sub-network, and each sub-network is realized by adopting a full-connection network with a plurality of hidden layers and a softmax layer and is respectively used for learning features with resolution for two different tasks;

and 5: training the multi-task deep learning model established in the step 4, and improving the generalization performance of the safety defect report prediction model by utilizing the potential correlation among a plurality of tasks;

step 6: changing a deep neural network for realizing a feature sharing layer, repeatedly executing the step 5, and selecting a multitask deep learning model with the best effect for identifying the safety defect report and predicting the severity level of the related defects;

and 7: given a newly submitted defect report, the multitask deep learning model trained in step 6 is used to identify if it is a safe defect report and predict the severity level of its associated defect.

Compared with the existing single-task learning model based on machine learning or deep learning, the method has the following advantages:

1. the invention introduces the multi-task learning into the field of safety defect report prediction for the first time, and jointly predicts the safety defect report and the severity level of the related defects thereof.

2. The method utilizes auxiliary task (namely defect severity level prediction) information related to a target task (namely safety defect report identification) to guide the model to learn characteristics with stronger generalization capability, and compared with the existing single-task learning model, the multi-task deep learning model can effectively improve the generalization capability of the model on unknown data, thereby improving the prediction capability of the target task model.

3. The method utilizes the safety defect report recognition and the defect severity level to predict the potential correlation between the two tasks, and simultaneously learns the knowledge of the two related tasks, so that the dependence of a single task model on massive marking data can be reduced, and the influence of data noise in a target task on model training can be reduced.

Drawings

FIG. 1 is a diagram of a multitasking security flaw report prediction framework of the present invention.

FIG. 2 is a diagram of the multi-tasking security defect report prediction model architecture based on TextCNN of the present invention.

FIG. 3 is a diagram of a DCNN + LSTM-based multi-tasking security flaw report prediction model architecture of the present invention.

FIG. 4 is a diagram of the LSTM-based multitask security flaw report prediction model architecture of the present invention.

FIG. 5 is a diagram of a GRU-based multitask security flaw report prediction model architecture according to the present invention.

FIG. 6 is a diagram of a BilSTM-based multitask security flaw report prediction model architecture according to the present invention.

FIG. 7 is a diagram of a BiGRU-based multitask security flaw report prediction model architecture according to the present invention.

Fig. 8 is example 1 security defect report # 1436241.

Fig. 9 is an ID of a defect example of embodiment 1.

Fig. 10 is example 2 non-secure defect report # 129763.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.

The invention provides a safety defect report prediction method based on multitask deep learning. Secondly, preprocessing the text content of the defect report in the data set to generate a professional corpus related to the defect report, training a word2vec model on the basis of the corpus, and generating a vector representation of each word. And establishing a multi-task deep learning model, extracting the Shared semantic features of the preprocessed defect report by using a deep neural network (namely a feature sharing layer) at the bottom layer of the model, learning the features with resolution for different tasks (namely a security defect report identification task and a defect severity level prediction task) in each sub-network (namely a task-specific layer) at the high layer, and finally taking the feature vectors output by the high layer network as the input of each sub-task prediction network to finish the security defect report identification and severity level prediction tasks. The frame is shown in figure 1, and the specific steps are as follows:

step 1: mining a defect report warehouse and a related Security vulnerability management website (such as Mozilla Foundation Security Advisory, MFSA), finding out defect reports marked as safe and non-safe by developers or maintainers, adding a severity label to each instance according to the severity content of the defect reports, and constructing a defect report data set for training and testing a multitask deep learning model. The method comprises the following specific steps:

step 11: constructing a safety defect report data set aiming at the actual project Chromium: sequentially crawling each defect report in a Chromium defect report warehouse by an open-source Web crawler platform Scapy according to the time sequence submitted by the reports, judging whether the defect report contains a 'Bug-Security' field by using a regular expression, if so, marking the defect report as a safe defect report, otherwise, marking the defect report as a non-safe defect report;

step 12: constructing a security flaw report data set aiming at an actual project Mozilla: mining a security vulnerability management website (MFSA) of a Mozilla project, acquiring defect report information related to security problems, extracting IDs of defect reports from related webpages to form a security defect report ID set, and if the IDs of the defect reports crawled from a defect report tracking system of the Mozilla are in the ID set, marking the defect reports as security defect reports, otherwise, marking the defect reports as non-security defect reports;

step 13: and converting a Severity (Severity) field of each defect report in the safety defect report data set into a corresponding Severity label, wherein the Severity grade of the defect report can be subdivided into 6 types, specifically comprising trivision, minor, normal, major, critic and block, so that six different severities of the defect report can be replaced by six labels of 0, 1, 2, 3, 4 and 5, and a multitask deep learning data set which can be simultaneously used for safety defect report identification and Severity prediction is generated.

Step 2: and (3) preprocessing the text content of the defect report in the defect report data set constructed in the step (1), wherein the preprocessing comprises removing noise, segmenting words, removing stop words, converting lower case, extracting stem (stemming) and the like, and generating a professional field corpus related to the defect report. The method comprises the following specific steps:

step 21: analyzing the defect report in the data set, extracting brief Description information (Summary) and detailed Description information (Description) of the defect report, and forming text content of the defect report;

step 22: filtering text contents in the defect report by using a regular expression, and removing noise information, wherein the noise information mainly comprises noise information which is irrelevant to URL link information, StackTrace stack tracking information, file names, non-letters and the like;

step 23: and preprocessing the text content after the noise information is removed, such as word segmentation, stop word removal, word stem extraction and the like, so as to generate a professional text corpus related to the defect report.

And step 3: based on the domain-of-expertise corpus generated in step 2 regarding the defect reports, word2vec models are trained, generating a vector representation for each word, i.e., a word vector dictionary. The method comprises the following specific steps:

step 31: converting words in the corpus into one-hot vector representation;

step 32: constructing a word2vec model based on Negative Sampling (CBOW), taking one-hot vectors of words as the input of the model, and outputting low-dimensional continuous real-value vector representation of the words;

step 33: and establishing a word-index and index-vector mapping table, wherein filling words are represented by special symbols 'pad', and all words which do not appear in the word table are represented by special symbols 'unk'.

And 4, step 4: and establishing a multi-task deep learning model facing to safety defect report identification and severity level prediction of related defects. The method comprises the following specific steps:

step 41: designing a multi-task deep learning model, wherein the model is divided into two parts, namely a Shared layer (Shared layers) and a task-specific layer (task-specific layers), wherein:

the feature sharing layer is positioned at the bottom layer of the multitask learning model, can be realized by adopting various deep neural networks, such as TextCNN (figure 2), DCNN + LSTM (figure 3), LSTM (figure 4), GRU (figure 5), BiLSTM (figure 6), BiGRU (figure 7) and the like, and is mainly used for extracting shared semantic features of the preprocessed defect report;

the specific task layer is positioned at the top layer of the multi-task learning model, each task corresponds to a sub-network (in the invention, because only two tasks, namely, the safety defect report identification and the defect severity level prediction, are provided), and each sub-network can adopt a fully-connected network with a plurality of hidden layers and a softmax layer to realize the specific task-oriented feature extraction with resolution and specific classification tasks;

step 42: setting parameters of a deep neural network for the feature sharing layer, and initializing parameters of the embedding layer by using the word2vec model parameters obtained by training in the step 3; in addition, specific parameter ranges for the networks are specifically set according to the network types adopted by the sharing layer;

step 43: and setting the number of input neurons and output neurons of the fully-connected network for a specific task layer, wherein the number of the input neurons is determined by the dimension of the semantic feature vector output by the sharing layer, and the number of the output neurons is determined by the category of each subtask. For example, the categories of security defect reports are divided into two categories, security-related and non-security-related, so the output neuron of the sub-network (i.e. security task layer) used for security defect report prediction is set to 2, and also since the severity level of defect reports can be subdivided into 6 categories, including: trivision, minor, normal, major, critical, blocker, so the output neuron of the subnetwork (i.e., the severity task layer) used for severity level prediction is set to 6.

And 5: and (4) training the multi-task deep learning model established in the step (4), and improving the generalization performance of the safety defect report prediction model by utilizing the potential correlation among a plurality of tasks. The method comprises the following specific steps:

step 51: setting an over-parameter range for multi-task deep learning model training;

step 52: randomly selecting a group of hyper-parameters for training a multi-task deep learning model according to grid search;

step 53: converting each word in each batch of training examples into a word vector on an embedding layer of the multi-task deep learning model, sending the word vector into a feature sharing layer and a specific task layer of the multi-task deep learning model for forward calculation, and outputting a safety defect report predicted by an example and the probability of each serious level of the example;

step 54: taking the predicted probability and the real label as input of a cross-entropy loss function (namely H (p, q) ═ Σ p (x) log (q (x))), respectively calculating the loss of a plurality of tasks, and performing weighted summation on a plurality of loss values to obtain the total loss;

step 55: adjusting the learning rate by adopting a gradient optimization algorithm (such as an Adam algorithm), performing back propagation, and updating parameters of the multi-task deep learning model;

step 56: and repeatedly executing the steps 52-55, and selecting an optimal set of hyper-parameters so as to optimize the effect of the multi-task deep learning model on the verification set.

Step 6: and (5) modifying the deep neural network for realizing the feature sharing layer, repeatedly executing the step 5, and selecting the model with the best effect for identifying the safety defect report and predicting the severity level of the related defect.

Example 1:

the process of analyzing defect prediction is illustrated with a security defect report #1436241 in the Mozilla project dataset (as shown in fig. 8).

First, the defect report is marked according to the flow of step 1 for subsequent comparison with the predicted result of the model. Since the ID of the defect instance can be found on the MFSA (as shown in fig. 9), the security tag of the defect report is determined to be 1 (indicating that it is a security-related defect); in addition, the defect reporter marks the Severity (Severity) of the defect as normal when submitting the defect report, and determines the Severity label of the defect report as 2 according to the corresponding relationship in step 13.

Assume that the defect report is a newly submitted report and that neither the security label nor the security label is known. The following describes the prediction process and prediction result of the method of the present invention by taking the defect report as an example.

Firstly, text preprocessing is performed on the abstract and the description information in the defect report, and the abstract and the description information before the text preprocessing are shown in table 1:

TABLE 1

The text information after text preprocessing is shown in table 2:

TABLE 2

Mapping the words after the defect report preprocessing into index values by using a vocabulary-index mapping table obtained by word2vec training, as follows:

[80 8 462 146 489 366 84 164 65 7 700 397 134 1467 1468 113 1324 1 191 177 230 3 396 350 2 462 382 519 52 290 694 421 819 164 65 274 69 52 1091 80 73 69 52 689 1091 164 65 8 421 819 164 65 284 2509 6217 52 67]

after the word indexes are converted into vector representations by an index-vector mapping table obtained by word2vec training in an Embedding layer of the model, the Shared semantic features related to the defect reports are output by a Shared layer (Shared layers) of the model, then the Shared features are respectively input into a security task layer and a security task layer for classification, the output of the security task layer is [0.0037,0.9963], the category corresponding to the maximum value is 1, namely the security defect report, so the prediction result is the security defect report, the output of the security task layer is [0.0037,0.0089, 0.8152, 0.0552, 0.1104, 0.0066], the category corresponding to the maximum value is 2, namely the normal, so the prediction result is general in severity level. By comparing with the real label of the defect report, the model can be known to correctly predict the safety label and the severity of the defect report.

Example 2:

the process of analyzing defect prediction is illustrated with a non-secure defect report #129763 in the Mozilla project dataset (as shown in fig. 10).

First, the defect report is marked according to the flow of step 1 for subsequent comparison with the predicted result of the model. Since the ID of the defect instance is not found in the MFSA, the security tag of the defect report is determined to be 0 (indicating that it is a non-security related defect); in addition, the defect reporter marks the Severity (Severity) of the defect as normal when submitting the defect report, and determines the Severity label of the defect report as 2 according to the corresponding relationship in step 13.

First, text preprocessing is performed on the abstract and the description information in the defect report, and the abstract and the description information before text preprocessing are shown in table 3:

TABLE 3

The text information after text preprocessing is shown in table 4:

TABLE 4

Mapping the words after the defect report preprocessing into index values by using a vocabulary-index mapping table obtained by word2vec training, wherein the result is as follows:

[292 15 263 291 578 82 6966 3707 3 34 539 6158 1619 15 10 292 15 491 263 3707 291 669 564 458 3707 301 5966 7609 291 263 570 409 20 759 3707 75 366 1771]

after the word index is converted into vector representation by using an index-vector mapping table obtained by word2vec training at the Embedding layer of the model, the Shared semantic features related to the defect report are output by a Shared layer (Shared layers) of the model, then the Shared features are respectively transmitted into a security task layer and a security task layer for classification, the output of the security task layer is [0.9403,0.0597], the category corresponding to the maximum value is 0, namely, the non-security defect report, so the prediction result is the non-security defect report, the output of the security task layer is [0.0062,0.0149, 0.8943, 0.0513, 0.0121, 0.0212], the category corresponding to the maximum value is 2, namely, normal, so the prediction result is in a serious level and a normal level. By comparing with the real label of the defect report, the model can be known to correctly predict the safety label and the severity of the defect report.

Claims

1. A safety defect report prediction method based on multitask deep learning is characterized by comprising the following steps:

step 2: preprocessing the text content of the defect report in the defect report data set constructed in the step 1 to generate a professional field corpus related to the defect report;

and step 3: training a word2vec model based on the professional field corpus about the defect report generated in the step 2, and generating a word vector dictionary;

and 4, step 4: the method comprises the steps of establishing a multi-task learning model facing to safety defect report recognition and severity level prediction of related defects, wherein the multi-task learning model is divided into a feature sharing layer and a specific task layer, wherein:

the feature sharing layer is positioned at the bottom layer of the multi-task learning model and used for extracting shared semantic features of the preprocessed defect report;

the specific task layer is positioned at the top layer of the multi-task learning model, each task corresponds to a sub-network, and each sub-network adopts a fully-connected network with a plurality of hidden layers and a softmax layer to realize specific task-oriented feature extraction with resolution and specific classification tasks;

and 5: training the multi-task learning model established in the step 4, and improving the generalization performance of the safety defect report prediction model by utilizing the potential correlation among a plurality of tasks;

2. The method for predicting the safety defect report based on the multitask deep learning as claimed in claim 1, wherein the specific steps of the step 1 are as follows:

step 12: constructing a security flaw report data set aiming at an actual project Mozilla: mining a security vulnerability management website of a Mozilla project, acquiring defect report information related to security problems, extracting an ID of a defect report from a related webpage to form a security defect report ID set, if the ID of the defect report crawled from a defect report tracking system of the Mozilla is in the ID set, marking the defect report as a security defect report, and if not, marking the defect report as a non-security defect report;

step 13: the severity field of each defect report in the security defect report dataset is translated into a corresponding severity label.

3. The method for predicting the safety defect report based on the multitask deep learning as claimed in claim 1, wherein the specific steps of the step 4 are as follows:

step 41: designing a multi-task deep learning model;

step 43: the number of input and output neurons of the fully connected network for a particular task layer is set.

4. The security flaw report prediction method based on multitask deep learning as claimed in claim 3, characterized in that said feature sharing layer is implemented by using a plurality of deep neural networks, and the deep neural network for implementing the feature sharing layer is one of LSTM, BilSTM, DCNN + LSTM, TextCNN, GRU, and BiGRU.

5. The method of claim 3, wherein the number of input neurons of the task-specific layer is determined by the dimension of the semantic feature vector output by the shared layer, and the number of output neurons is determined by the category of each subtask.

6. The method for predicting the safety defect report based on the multitask deep learning as claimed in claim 1, wherein the specific steps of the step 5 are as follows:

step 54: taking the predicted probability and the real label as the input of a cross entropy loss function, respectively calculating the loss of a plurality of tasks, and performing weighted summation on a plurality of loss values to serve as the total loss;

step 55: adjusting the learning rate by adopting a gradient optimization algorithm, performing back propagation, and updating parameters of the multi-task deep learning model;