CN112001484A - Safety defect report prediction method based on multitask deep learning - Google Patents

Safety defect report prediction method based on multitask deep learning Download PDF

Info

Publication number
CN112001484A
CN112001484A CN202010853000.1A CN202010853000A CN112001484A CN 112001484 A CN112001484 A CN 112001484A CN 202010853000 A CN202010853000 A CN 202010853000A CN 112001484 A CN112001484 A CN 112001484A
Authority
CN
China
Prior art keywords
defect report
task
report
deep learning
defect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010853000.1A
Other languages
Chinese (zh)
Inventor
苏小红
蒋远
牟辰光
王甜甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202010853000.1A priority Critical patent/CN112001484A/en
Publication of CN112001484A publication Critical patent/CN112001484A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a security flaw report prediction method based on multitask deep learning. And preprocessing the text content of the defect report in the data set to generate a professional corpus, and training a word2vec model by using the professional corpus. Establishing a multi-task deep learning model, extracting shared semantic features of a defect report by using a deep neural network at the bottom layer of the model, learning features with resolution for different tasks by using each sub-network at a high layer, and finally taking a feature vector output by the high layer network as the input of each sub-task prediction network to finish safety defect report identification and severity level prediction tasks. According to the method, multi-task learning is used for safety defect report prediction for the first time, and the auxiliary task information related to the target task is utilized to guide the model to learn the characteristics with stronger generalization capability, so that the generalization capability of the model can be improved, and the influence of noise data is reduced.

Description

Safety defect report prediction method based on multitask deep learning
Technical Field
The invention relates to a safety defect report prediction method, in particular to a safety defect report prediction method based on multitask deep learning.
Background
As software scales and complexities increase, various software bugs inevitably occur. Wherein security related flaws, once utilized by an attacker, would cause significant damage and loss to the software system. To facilitate the collection and management of software defects, more and more software companies, such as Google, Mozilla, have built their own defect report tracking systems, and users can submit discovered defects to the systems for timely dispatch of repair personnel for repair. Due to the lack of safety-related domain knowledge, a defect report submitter often has difficulty in accurately judging whether a defect report is safety-related, and if the safety-related defects are marked as non-safety-related when the report is submitted, the time for repairing the safety defects is delayed, so that a serious safety threat is caused to a system. Identifying security-related defect reports (hereinafter "security defect reports") manually is obviously time consuming and impractical. Therefore, it is of great significance to automatically identify the security flaw report.
The defect Report has the characteristic of large difference of text description information, the category imbalance and the scarcity of safety features are caused by the small proportion of positive samples, namely Safety Bug Reports (SBR), in a data set, so that the defect reports are not easy to extract, in addition, the small proportion of SBR is not marked as safety-related defect reports due to the lack of safety knowledge of development or testing personnel, the defect reports exist in the data set in the form of Non-safety Bug reports (NSBR), namely noise is introduced into the data set, and the problems bring difficulty and challenge to the automatic identification of the safety Bug reports.
The currently common method is to use a method combining text mining and machine learning. FARSEC and LTRWES are typical representatives of this class of methods. The FARSEC method is Peters et al (Peters F, Tun T, Yu Y, et al. text Filtering and Ranking for Security cloth Report Prediction [ J)]IEEE Transactions on Software Engineering,2017:1-1), from AnnAnd extracting 100 words with the highest tf-idf value from the full defect report as safety related keywords, filtering the non-safety defect report by using the 100 safety keywords, and representing the historical defect report into a 100-dimensional feature vector by using the safety keywords for training an SBR automatic identification model. However, the main problem of this method is that the words with higher tf-idf values are not necessarily security-related words, which affects the filtering effect of noise data, and since only a few security-related keywords may appear in a defect report, the problem of vector sparsity occurs when the feature vector contains a large number of 0 elements, so that the semantic information of the defect report cannot be accurately expressed. To address these problems, Jiang et al (Y Jiang, P LU, X SU, T Wang. LTRWES: A new frame for security bug report detection [ J)]Information and Software technology.2020:106314) proposes using the ranking model BM25FextCalculating the content relevance of each NSBR and all SBR, then filtering the NSBR with higher relevance with the SBR content from the NSBR, and expressing the defect report as a low-dimensional continuous real-value vector by utilizing a word2vec model trained on a large number of defect report text corpora, thereby realizing more accurate defect report vector expression.
The safety defects have the characteristics of complex characteristics (namely, the types of the safety defects are various, and the differences of the characteristics of different types of safety defects are large) and small quantity in actual projects, so that deep semantic features related to safety are difficult to extract by using the traditional machine learning method, and the performance of a prediction model is improved to cause a bottleneck. In recent years, deep learning (deep learning) has shown unique advantages and potentials in feature extraction and pattern recognition, and has been the result of applying deep learning to research for solving complex text processing tasks. At present, only 1 document (Zheng Wei and the like, safety defect report prediction method empirical research based on deep learning, software science and report 2020,31(5):1294 + 1313) for carrying out safety defect report prediction based on deep learning is retrieved, and the paper adopts deep text mining models TextCNN and TextRNN to construct a safety defect report prediction model, so that a prediction result superior to that of the traditional machine learning classification algorithm is obtained. Training of deep learning models relies on massive labeled datasets. However, in the task of predicting the safety defect report, a large amount of manpower and material resources are required for labeling the mass data, so that the scale of the data cannot meet the actual requirement generally.
In recent years, multitask learning has enjoyed great success in many practical applications such as natural language processing, text recognition, and computer vision. The multi-task learning effectively guides the model to learn the characteristics with stronger generalization ability by using the information of the auxiliary task related to the target task, thereby improving the generalization ability of the model, improving the performance of the model on unseen data and making up the shortage of the number of data sets for training the target task model. Meanwhile, the multi-task learning can also utilize additional information provided by related tasks to help the model focus attention on the actually important features, so that the influence of noise in target task data on model training is reduced.
Documents for identifying safety defect reports based on multitask learning have not been retrieved at present. The existing methods based on machine learning and deep learning are all based on the security label of the defect report to carry out single-task learning, and related researches for simultaneously predicting the security defect report and the defect severity level by adopting multi-task learning are not available. The predicted safety-related defects and the severity levels of the predicted defects are important in practical application, and the predicted safety-related defects and the severity levels of the predicted defects are related tasks, and the tasks are subjected to multi-task learning, so that the generalization performance of the original task, namely the safety defect report prediction, can be improved.
Disclosure of Invention
The invention provides a security flaw report prediction method based on multitask deep learning, which considers the difficulty of security flaw report collection and labeling, and simultaneously aims to improve the generalization performance of a security flaw report prediction model based on deep learning and avoid the risk that software faces attack because a vulnerability is possibly utilized by a hacker due to the fact that a security flaw report is not marked as security-related and is not repaired in time. The method introduces multi-task learning into the field of safety defect report prediction for the first time, and jointly predicts the safety defect report and the severity level of the defects thereof. The method can realize multi-task learning based on multiple deep neural networks, and by learning the knowledge of two related tasks, namely, the safety defect report recognition and the defect severity level prediction at the same time, and utilizing the potential correlation between the two related tasks, the method extracts the characteristics with stronger generalization capability, reduces the risk that a single-task model is easy to cause overfitting, reduces the requirement of a deep learning model on mass labeled data, reduces the influence of data noise on model training at the same time, and improves the generalization performance of a safety defect report prediction model.
The purpose of the invention is realized by the following technical scheme:
a safety defect report prediction method based on multitask deep learning comprises the following steps:
step 1: mining a defect report warehouse and a related security vulnerability management website, finding out a defect report marked as safe and unsafe by a developer or a maintainer, adding a severity label to each instance according to the severity content of the defect report, and constructing a defect report data set for training and testing a multitask deep learning model;
step 2: preprocessing the text content of the defect report in the defect report data set constructed in the step 1, wherein the preprocessing comprises removing noise, segmenting words, removing stop words, converting lower case, extracting word stems and the like, and generating a professional field corpus related to the defect report;
and step 3: training a word2vec model based on the professional field corpus about the defect report generated in the step 2, and generating a vector representation of each word, namely a word vector dictionary;
and 4, step 4: the method comprises the steps of establishing a multitask deep learning model facing to safety defect report recognition and severity level prediction of related defects, wherein the multitask deep learning model is divided into a feature sharing layer and a specific task layer, and the method comprises the following steps:
the feature sharing layer is positioned at the bottom layer of the multi-task learning model, is realized by adopting a deep neural network and is used for extracting the shared semantic features of the preprocessed defect report;
the specific task layer is positioned at the top layer of the multi-task learning model, each task corresponds to a sub-network, and each sub-network is realized by adopting a full-connection network with a plurality of hidden layers and a softmax layer and is respectively used for learning features with resolution for two different tasks;
and 5: training the multi-task deep learning model established in the step 4, and improving the generalization performance of the safety defect report prediction model by utilizing the potential correlation among a plurality of tasks;
step 6: changing a deep neural network for realizing a feature sharing layer, repeatedly executing the step 5, and selecting a multitask deep learning model with the best effect for identifying the safety defect report and predicting the severity level of the related defects;
and 7: given a newly submitted defect report, the multitask deep learning model trained in step 6 is used to identify if it is a safe defect report and predict the severity level of its associated defect.
Compared with the existing single-task learning model based on machine learning or deep learning, the method has the following advantages:
1. the invention introduces the multi-task learning into the field of safety defect report prediction for the first time, and jointly predicts the safety defect report and the severity level of the related defects thereof.
2. The method utilizes auxiliary task (namely defect severity level prediction) information related to a target task (namely safety defect report identification) to guide the model to learn characteristics with stronger generalization capability, and compared with the existing single-task learning model, the multi-task deep learning model can effectively improve the generalization capability of the model on unknown data, thereby improving the prediction capability of the target task model.
3. The method utilizes the safety defect report recognition and the defect severity level to predict the potential correlation between the two tasks, and simultaneously learns the knowledge of the two related tasks, so that the dependence of a single task model on massive marking data can be reduced, and the influence of data noise in a target task on model training can be reduced.
Drawings
FIG. 1 is a diagram of a multitasking security flaw report prediction framework of the present invention.
FIG. 2 is a diagram of the multi-tasking security defect report prediction model architecture based on TextCNN of the present invention.
FIG. 3 is a diagram of a DCNN + LSTM-based multi-tasking security flaw report prediction model architecture of the present invention.
FIG. 4 is a diagram of the LSTM-based multitask security flaw report prediction model architecture of the present invention.
FIG. 5 is a diagram of a GRU-based multitask security flaw report prediction model architecture according to the present invention.
FIG. 6 is a diagram of a BilSTM-based multitask security flaw report prediction model architecture according to the present invention.
FIG. 7 is a diagram of a BiGRU-based multitask security flaw report prediction model architecture according to the present invention.
Fig. 8 is example 1 security defect report # 1436241.
Fig. 9 is an ID of a defect example of embodiment 1.
Fig. 10 is example 2 non-secure defect report # 129763.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.
The invention provides a safety defect report prediction method based on multitask deep learning. Secondly, preprocessing the text content of the defect report in the data set to generate a professional corpus related to the defect report, training a word2vec model on the basis of the corpus, and generating a vector representation of each word. And establishing a multi-task deep learning model, extracting the Shared semantic features of the preprocessed defect report by using a deep neural network (namely a feature sharing layer) at the bottom layer of the model, learning the features with resolution for different tasks (namely a security defect report identification task and a defect severity level prediction task) in each sub-network (namely a task-specific layer) at the high layer, and finally taking the feature vectors output by the high layer network as the input of each sub-task prediction network to finish the security defect report identification and severity level prediction tasks. The frame is shown in figure 1, and the specific steps are as follows:
step 1: mining a defect report warehouse and a related Security vulnerability management website (such as Mozilla Foundation Security Advisory, MFSA), finding out defect reports marked as safe and non-safe by developers or maintainers, adding a severity label to each instance according to the severity content of the defect reports, and constructing a defect report data set for training and testing a multitask deep learning model. The method comprises the following specific steps:
step 11: constructing a safety defect report data set aiming at the actual project Chromium: sequentially crawling each defect report in a Chromium defect report warehouse by an open-source Web crawler platform Scapy according to the time sequence submitted by the reports, judging whether the defect report contains a 'Bug-Security' field by using a regular expression, if so, marking the defect report as a safe defect report, otherwise, marking the defect report as a non-safe defect report;
step 12: constructing a security flaw report data set aiming at an actual project Mozilla: mining a security vulnerability management website (MFSA) of a Mozilla project, acquiring defect report information related to security problems, extracting IDs of defect reports from related webpages to form a security defect report ID set, and if the IDs of the defect reports crawled from a defect report tracking system of the Mozilla are in the ID set, marking the defect reports as security defect reports, otherwise, marking the defect reports as non-security defect reports;
step 13: and converting a Severity (Severity) field of each defect report in the safety defect report data set into a corresponding Severity label, wherein the Severity grade of the defect report can be subdivided into 6 types, specifically comprising trivision, minor, normal, major, critic and block, so that six different severities of the defect report can be replaced by six labels of 0, 1, 2, 3, 4 and 5, and a multitask deep learning data set which can be simultaneously used for safety defect report identification and Severity prediction is generated.
Step 2: and (3) preprocessing the text content of the defect report in the defect report data set constructed in the step (1), wherein the preprocessing comprises removing noise, segmenting words, removing stop words, converting lower case, extracting stem (stemming) and the like, and generating a professional field corpus related to the defect report. The method comprises the following specific steps:
step 21: analyzing the defect report in the data set, extracting brief Description information (Summary) and detailed Description information (Description) of the defect report, and forming text content of the defect report;
step 22: filtering text contents in the defect report by using a regular expression, and removing noise information, wherein the noise information mainly comprises noise information which is irrelevant to URL link information, StackTrace stack tracking information, file names, non-letters and the like;
step 23: and preprocessing the text content after the noise information is removed, such as word segmentation, stop word removal, word stem extraction and the like, so as to generate a professional text corpus related to the defect report.
And step 3: based on the domain-of-expertise corpus generated in step 2 regarding the defect reports, word2vec models are trained, generating a vector representation for each word, i.e., a word vector dictionary. The method comprises the following specific steps:
step 31: converting words in the corpus into one-hot vector representation;
step 32: constructing a word2vec model based on Negative Sampling (CBOW), taking one-hot vectors of words as the input of the model, and outputting low-dimensional continuous real-value vector representation of the words;
step 33: and establishing a word-index and index-vector mapping table, wherein filling words are represented by special symbols 'pad', and all words which do not appear in the word table are represented by special symbols 'unk'.
And 4, step 4: and establishing a multi-task deep learning model facing to safety defect report identification and severity level prediction of related defects. The method comprises the following specific steps:
step 41: designing a multi-task deep learning model, wherein the model is divided into two parts, namely a Shared layer (Shared layers) and a task-specific layer (task-specific layers), wherein:
the feature sharing layer is positioned at the bottom layer of the multitask learning model, can be realized by adopting various deep neural networks, such as TextCNN (figure 2), DCNN + LSTM (figure 3), LSTM (figure 4), GRU (figure 5), BiLSTM (figure 6), BiGRU (figure 7) and the like, and is mainly used for extracting shared semantic features of the preprocessed defect report;
the specific task layer is positioned at the top layer of the multi-task learning model, each task corresponds to a sub-network (in the invention, because only two tasks, namely, the safety defect report identification and the defect severity level prediction, are provided), and each sub-network can adopt a fully-connected network with a plurality of hidden layers and a softmax layer to realize the specific task-oriented feature extraction with resolution and specific classification tasks;
step 42: setting parameters of a deep neural network for the feature sharing layer, and initializing parameters of the embedding layer by using the word2vec model parameters obtained by training in the step 3; in addition, specific parameter ranges for the networks are specifically set according to the network types adopted by the sharing layer;
step 43: and setting the number of input neurons and output neurons of the fully-connected network for a specific task layer, wherein the number of the input neurons is determined by the dimension of the semantic feature vector output by the sharing layer, and the number of the output neurons is determined by the category of each subtask. For example, the categories of security defect reports are divided into two categories, security-related and non-security-related, so the output neuron of the sub-network (i.e. security task layer) used for security defect report prediction is set to 2, and also since the severity level of defect reports can be subdivided into 6 categories, including: trivision, minor, normal, major, critical, blocker, so the output neuron of the subnetwork (i.e., the severity task layer) used for severity level prediction is set to 6.
And 5: and (4) training the multi-task deep learning model established in the step (4), and improving the generalization performance of the safety defect report prediction model by utilizing the potential correlation among a plurality of tasks. The method comprises the following specific steps:
step 51: setting an over-parameter range for multi-task deep learning model training;
step 52: randomly selecting a group of hyper-parameters for training a multi-task deep learning model according to grid search;
step 53: converting each word in each batch of training examples into a word vector on an embedding layer of the multi-task deep learning model, sending the word vector into a feature sharing layer and a specific task layer of the multi-task deep learning model for forward calculation, and outputting a safety defect report predicted by an example and the probability of each serious level of the example;
step 54: taking the predicted probability and the real label as input of a cross-entropy loss function (namely H (p, q) ═ Σ p (x) log (q (x))), respectively calculating the loss of a plurality of tasks, and performing weighted summation on a plurality of loss values to obtain the total loss;
step 55: adjusting the learning rate by adopting a gradient optimization algorithm (such as an Adam algorithm), performing back propagation, and updating parameters of the multi-task deep learning model;
step 56: and repeatedly executing the steps 52-55, and selecting an optimal set of hyper-parameters so as to optimize the effect of the multi-task deep learning model on the verification set.
Step 6: and (5) modifying the deep neural network for realizing the feature sharing layer, repeatedly executing the step 5, and selecting the model with the best effect for identifying the safety defect report and predicting the severity level of the related defect.
And 7: given a newly submitted defect report, the multitask deep learning model trained in step 6 is used to identify if it is a safe defect report and predict the severity level of its associated defect.
Example 1:
the process of analyzing defect prediction is illustrated with a security defect report #1436241 in the Mozilla project dataset (as shown in fig. 8).
First, the defect report is marked according to the flow of step 1 for subsequent comparison with the predicted result of the model. Since the ID of the defect instance can be found on the MFSA (as shown in fig. 9), the security tag of the defect report is determined to be 1 (indicating that it is a security-related defect); in addition, the defect reporter marks the Severity (Severity) of the defect as normal when submitting the defect report, and determines the Severity label of the defect report as 2 according to the corresponding relationship in step 13.
Assume that the defect report is a newly submitted report and that neither the security label nor the security label is known. The following describes the prediction process and prediction result of the method of the present invention by taking the defect report as an example.
Firstly, text preprocessing is performed on the abstract and the description information in the defect report, and the abstract and the description information before the text preprocessing are shown in table 1:
TABLE 1
Figure BDA0002645415090000131
The text information after text preprocessing is shown in table 2:
TABLE 2
Figure BDA0002645415090000132
Mapping the words after the defect report preprocessing into index values by using a vocabulary-index mapping table obtained by word2vec training, as follows:
[80 8 462 146 489 366 84 164 65 7 700 397 134 1467 1468 113 1324 1 191 177 230 3 396 350 2 462 382 519 52 290 694 421 819 164 65 274 69 52 1091 80 73 69 52 689 1091 164 65 8 421 819 164 65 284 2509 6217 52 67]
after the word indexes are converted into vector representations by an index-vector mapping table obtained by word2vec training in an Embedding layer of the model, the Shared semantic features related to the defect reports are output by a Shared layer (Shared layers) of the model, then the Shared features are respectively input into a security task layer and a security task layer for classification, the output of the security task layer is [0.0037,0.9963], the category corresponding to the maximum value is 1, namely the security defect report, so the prediction result is the security defect report, the output of the security task layer is [0.0037,0.0089, 0.8152, 0.0552, 0.1104, 0.0066], the category corresponding to the maximum value is 2, namely the normal, so the prediction result is general in severity level. By comparing with the real label of the defect report, the model can be known to correctly predict the safety label and the severity of the defect report.
Example 2:
the process of analyzing defect prediction is illustrated with a non-secure defect report #129763 in the Mozilla project dataset (as shown in fig. 10).
First, the defect report is marked according to the flow of step 1 for subsequent comparison with the predicted result of the model. Since the ID of the defect instance is not found in the MFSA, the security tag of the defect report is determined to be 0 (indicating that it is a non-security related defect); in addition, the defect reporter marks the Severity (Severity) of the defect as normal when submitting the defect report, and determines the Severity label of the defect report as 2 according to the corresponding relationship in step 13.
Assume that the defect report is a newly submitted report and that neither the security label nor the security label is known. The following describes the prediction process and prediction result of the method of the present invention by taking the defect report as an example.
First, text preprocessing is performed on the abstract and the description information in the defect report, and the abstract and the description information before text preprocessing are shown in table 3:
TABLE 3
Figure BDA0002645415090000141
Figure BDA0002645415090000151
The text information after text preprocessing is shown in table 4:
TABLE 4
Figure BDA0002645415090000152
Mapping the words after the defect report preprocessing into index values by using a vocabulary-index mapping table obtained by word2vec training, wherein the result is as follows:
[292 15 263 291 578 82 6966 3707 3 34 539 6158 1619 15 10 292 15 491 263 3707 291 669 564 458 3707 301 5966 7609 291 263 570 409 20 759 3707 75 366 1771]
after the word index is converted into vector representation by using an index-vector mapping table obtained by word2vec training at the Embedding layer of the model, the Shared semantic features related to the defect report are output by a Shared layer (Shared layers) of the model, then the Shared features are respectively transmitted into a security task layer and a security task layer for classification, the output of the security task layer is [0.9403,0.0597], the category corresponding to the maximum value is 0, namely, the non-security defect report, so the prediction result is the non-security defect report, the output of the security task layer is [0.0062,0.0149, 0.8943, 0.0513, 0.0121, 0.0212], the category corresponding to the maximum value is 2, namely, normal, so the prediction result is in a serious level and a normal level. By comparing with the real label of the defect report, the model can be known to correctly predict the safety label and the severity of the defect report.

Claims (6)

1. A safety defect report prediction method based on multitask deep learning is characterized by comprising the following steps:
step 1: mining a defect report warehouse and a related security vulnerability management website, finding out a defect report marked as safe and unsafe by a developer or a maintainer, adding a severity label to each instance according to the severity content of the defect report, and constructing a defect report data set for training and testing a multitask deep learning model;
step 2: preprocessing the text content of the defect report in the defect report data set constructed in the step 1 to generate a professional field corpus related to the defect report;
and step 3: training a word2vec model based on the professional field corpus about the defect report generated in the step 2, and generating a word vector dictionary;
and 4, step 4: the method comprises the steps of establishing a multi-task learning model facing to safety defect report recognition and severity level prediction of related defects, wherein the multi-task learning model is divided into a feature sharing layer and a specific task layer, wherein:
the feature sharing layer is positioned at the bottom layer of the multi-task learning model and used for extracting shared semantic features of the preprocessed defect report;
the specific task layer is positioned at the top layer of the multi-task learning model, each task corresponds to a sub-network, and each sub-network adopts a fully-connected network with a plurality of hidden layers and a softmax layer to realize specific task-oriented feature extraction with resolution and specific classification tasks;
and 5: training the multi-task learning model established in the step 4, and improving the generalization performance of the safety defect report prediction model by utilizing the potential correlation among a plurality of tasks;
step 6: changing a deep neural network for realizing a feature sharing layer, repeatedly executing the step 5, and selecting a multitask deep learning model with the best effect for identifying the safety defect report and predicting the severity level of the related defects;
and 7: given a newly submitted defect report, the multitask deep learning model trained in step 6 is used to identify if it is a safe defect report and predict the severity level of its associated defect.
2. The method for predicting the safety defect report based on the multitask deep learning as claimed in claim 1, wherein the specific steps of the step 1 are as follows:
step 11: constructing a safety defect report data set aiming at the actual project Chromium: sequentially crawling each defect report in a Chromium defect report warehouse by an open-source Web crawler platform Scapy according to the time sequence submitted by the reports, judging whether the defect report contains a 'Bug-Security' field by using a regular expression, if so, marking the defect report as a safe defect report, otherwise, marking the defect report as a non-safe defect report;
step 12: constructing a security flaw report data set aiming at an actual project Mozilla: mining a security vulnerability management website of a Mozilla project, acquiring defect report information related to security problems, extracting an ID of a defect report from a related webpage to form a security defect report ID set, if the ID of the defect report crawled from a defect report tracking system of the Mozilla is in the ID set, marking the defect report as a security defect report, and if not, marking the defect report as a non-security defect report;
step 13: the severity field of each defect report in the security defect report dataset is translated into a corresponding severity label.
3. The method for predicting the safety defect report based on the multitask deep learning as claimed in claim 1, wherein the specific steps of the step 4 are as follows:
step 41: designing a multi-task deep learning model;
step 42: setting parameters of a deep neural network for the feature sharing layer, and initializing parameters of the embedding layer by using the word2vec model parameters obtained by training in the step 3; in addition, specific parameter ranges for the networks are specifically set according to the network types adopted by the sharing layer;
step 43: the number of input and output neurons of the fully connected network for a particular task layer is set.
4. The security flaw report prediction method based on multitask deep learning as claimed in claim 3, characterized in that said feature sharing layer is implemented by using a plurality of deep neural networks, and the deep neural network for implementing the feature sharing layer is one of LSTM, BilSTM, DCNN + LSTM, TextCNN, GRU, and BiGRU.
5. The method of claim 3, wherein the number of input neurons of the task-specific layer is determined by the dimension of the semantic feature vector output by the shared layer, and the number of output neurons is determined by the category of each subtask.
6. The method for predicting the safety defect report based on the multitask deep learning as claimed in claim 1, wherein the specific steps of the step 5 are as follows:
step 51: setting an over-parameter range for multi-task deep learning model training;
step 52: randomly selecting a group of hyper-parameters for training a multi-task deep learning model according to grid search;
step 53: converting each word in each batch of training examples into a word vector on an embedding layer of the multi-task deep learning model, sending the word vector into a feature sharing layer and a specific task layer of the multi-task deep learning model for forward calculation, and outputting a safety defect report predicted by an example and the probability of each serious level of the example;
step 54: taking the predicted probability and the real label as the input of a cross entropy loss function, respectively calculating the loss of a plurality of tasks, and performing weighted summation on a plurality of loss values to serve as the total loss;
step 55: adjusting the learning rate by adopting a gradient optimization algorithm, performing back propagation, and updating parameters of the multi-task deep learning model;
step 56: and repeatedly executing the steps 52-55, and selecting an optimal set of hyper-parameters so as to optimize the effect of the multi-task deep learning model on the verification set.
CN202010853000.1A 2020-08-22 2020-08-22 Safety defect report prediction method based on multitask deep learning Pending CN112001484A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010853000.1A CN112001484A (en) 2020-08-22 2020-08-22 Safety defect report prediction method based on multitask deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010853000.1A CN112001484A (en) 2020-08-22 2020-08-22 Safety defect report prediction method based on multitask deep learning

Publications (1)

Publication Number Publication Date
CN112001484A true CN112001484A (en) 2020-11-27

Family

ID=73473183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010853000.1A Pending CN112001484A (en) 2020-08-22 2020-08-22 Safety defect report prediction method based on multitask deep learning

Country Status (1)

Country Link
CN (1) CN112001484A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766490A (en) * 2021-01-13 2021-05-07 深圳前海微众银行股份有限公司 Characteristic variable learning method, device, equipment and computer readable storage medium
CN113128671A (en) * 2021-04-19 2021-07-16 烟台大学 Service demand dynamic prediction method and system based on multi-mode machine learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615533A (en) * 2015-01-15 2015-05-13 南京大学 Intelligent software defect tracking management method based on mobile instant messaging software
CN105068925A (en) * 2015-07-29 2015-11-18 北京理工大学 Software security flaw discovering system
US20170277583A1 (en) * 2016-03-28 2017-09-28 Wipro Limited System and method for classifying defects occurring in a software environment
CN107844414A (en) * 2016-09-21 2018-03-27 南京大学 A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method
US20190179727A1 (en) * 2017-12-13 2019-06-13 The Mathworks, Inc. Automatic setting of multitasking configurations for a code-checking system
CN110347839A (en) * 2019-07-18 2019-10-18 湖南数定智能科技有限公司 A kind of file classification method based on production multi-task learning model
CN110708279A (en) * 2019-08-19 2020-01-17 中国电子科技网络信息安全有限公司 Vulnerability mining model construction method based on group intelligence

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615533A (en) * 2015-01-15 2015-05-13 南京大学 Intelligent software defect tracking management method based on mobile instant messaging software
CN105068925A (en) * 2015-07-29 2015-11-18 北京理工大学 Software security flaw discovering system
US20170277583A1 (en) * 2016-03-28 2017-09-28 Wipro Limited System and method for classifying defects occurring in a software environment
CN107844414A (en) * 2016-09-21 2018-03-27 南京大学 A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method
US20190179727A1 (en) * 2017-12-13 2019-06-13 The Mathworks, Inc. Automatic setting of multitasking configurations for a code-checking system
CN110347839A (en) * 2019-07-18 2019-10-18 湖南数定智能科技有限公司 A kind of file classification method based on production multi-task learning model
CN110708279A (en) * 2019-08-19 2020-01-17 中国电子科技网络信息安全有限公司 Vulnerability mining model construction method based on group intelligence

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JIANG YUAN等: "LTRWES: A new framework for security bug report detection", 《INFORMATION AND SOFTWARE TECHNOLOGY》 *
XI GONG等: "Joint Prediction of Multiple Vulnerability Characteristics Through Multi-Task Learning", 《2019 24TH INTERNATIONAL CONFERENCE ON ENGINEERING OF COMPLEX COMPUTER SYSTEMS》 *
任昕: "基于隐马尔可夫模型和卷积神经网络的Web安全检测研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
路鹏程: "基于深度学习的安全缺陷报告识别和缺陷定位", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
郑炜等: "基于深度学习的安全缺陷报告预测方法实证研究", 《软件学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766490A (en) * 2021-01-13 2021-05-07 深圳前海微众银行股份有限公司 Characteristic variable learning method, device, equipment and computer readable storage medium
CN113128671A (en) * 2021-04-19 2021-07-16 烟台大学 Service demand dynamic prediction method and system based on multi-mode machine learning
CN113128671B (en) * 2021-04-19 2022-08-02 烟台大学 Service demand dynamic prediction method and system based on multi-mode machine learning

Similar Documents

Publication Publication Date Title
CN111309912B (en) Text classification method, apparatus, computer device and storage medium
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
Sharp et al. Toward Semi-autonomous Information: Extraction for Unstructured Maintenance Data in Root Cause Analysis
Althar et al. Software systems security vulnerabilities management by exploring the capabilities of language models using NLP
Taesiri et al. Visual correspondence-based explanations improve AI robustness and human-AI team accuracy
CN112132776A (en) Visual inspection method and system based on federal learning, storage medium and equipment
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN112001484A (en) Safety defect report prediction method based on multitask deep learning
CN107329770A (en) The personalized recommendation method repaired for software security BUG
CN116861924A (en) Project risk early warning method and system based on artificial intelligence
Pei et al. Attention-based model for predicting question relatedness on stack overflow
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN112035345A (en) Mixed depth defect prediction method based on code segment analysis
CN116628212A (en) Uncertainty knowledge graph modeling method oriented to national economy and social development investigation field
Candaş et al. Automated identification of vagueness in the FIDIC silver book conditions of contract
CN112346974B (en) Depth feature embedding-based cross-mobile application program instant defect prediction method
CN117520561A (en) Entity relation extraction method and system for knowledge graph construction in helicopter assembly field
Bella et al. Semi-supervised approach for recovering traceability links in complex systems
CN111144453A (en) Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data
Burley et al. Nlp workflows for computational social science: Understanding triggers of state-led mass killings
Park et al. A new forecasting system using the latent dirichlet allocation (LDA) topic modeling technique
Xie et al. Goal-Driven Context-Aware Next Service Recommendation for Mashup Composition
CN116415047B (en) Resource screening method and system based on national image resource recommendation
Genz et al. Using CNNs to Detect Graphical Representations of Structural Equation Models in IS Papers.
US20240028828A1 (en) Machine learning model architecture and user interface to indicate impact of text ngrams

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201127