CN116775871A

CN116775871A - Deep learning software defect report classification method based on seBERT pre-training model

Info

Publication number: CN116775871A
Application number: CN202310711807.5A
Authority: CN
Inventors: 宫丽娜; 曾子璇; 张静宣; 魏明强
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-09-19

Abstract

The application discloses a deep learning software defect report classification method based on a seBERT pre-training model, which comprises the following steps: collecting a defect report corresponding to a software warehouse developed based on a deep learning framework, adding a label type of the defect report and forming sample data; combining the text data extracted from each sample and inputting the text data into a pre-training model for fine adjustment; the feature vector corresponding to the cls identifier started in the text data is used as the semantic feature of the input text data by the pre-training model after fine adjustment, and the feature vector is input into a Softmax layer for normalization; inputting the feature vector output by the fine-tuned pre-training model to a full-connection layer, and mapping the feature vector containing semantic information to a corresponding category by using a linear layer according to the dimension; and taking the output with the highest probability as the final prediction category of the defect report, and finishing the classification of each defect report. The application is beneficial to improving the efficiency and accuracy of classifying the defect report of the deep learning software.

Description

Deep learning software defect report classification method based on seBERT pre-training model

Technical Field

The application relates to a deep learning software defect report classification method, in particular to a deep learning software defect report classification method based on a seBERT pre-training model.

Background

Deep learning software is permeated in various industries and application fields, but the inevitable security, quality and other loopholes still exist. In order to ensure the quality of the deep learning software system and prevent serious economic loss, the identification and prediction technology of the defects has important engineering application value. In the software development process, the problems of loopholes, performance requirements and the like of the software can be reflected according to software defect reports submitted by developers and users.

However, the manual division and identification of defect reports takes a lot of manpower and time. Compared with common traditional software, the deep learning software has the problems of randomness of training stages, dense interdependence of a neural network and the like, so that defects are more difficult to observe and reproduce, and therefore, new testing technology is needed to support the testing.

Pre-trained contextual language representation models have been successful in natural language processing areas and improving the effectiveness of tag recommendations, but models that are pre-trained in general areas are not as performing on certain tasks as models that are trained on domain-specific corpora. And for the defect report text disclosed in the deep learning software public warehouse, the problems of unbalance of data, redundant items and the like are obvious, and the training and prediction effects are poor by independently adopting a general pre-training model.

Patent document 1 discloses a software defect prediction device and method based on open source community knowledge, the method respectively constructs and trains a BP neural network and an LSTM neural network by using open source community codes, firstly performs software defect prediction through the trained BP neural network, and further performs software defect type prediction through the trained LSTM neural network if the prediction is defective, thereby improving the accuracy of a code defect prediction result. However, the method is not improved aiming at the defect characteristics in the existing deep learning software, and the data processing of the defects of the deep learning software is not considered.

In summary, the traditional research provides a good research foundation for vulnerability prediction according to software defect report, but the defect report classification capability of the current deep learning software is not fully mined, and is mainly expressed in the following steps:

1. the defect report classification method special for the deep learning software is not available, and the accuracy of subdivision of the defect report is poor.

2. Because the deep learning software has the problems of randomness of training stages, dense interdependence of a neural network and the like, the obtained software defects are difficult to reproduce, the defect report data class is unbalanced, and the content is disordered.

Reference to the literature

Patent document 1 chinese application patent application publication No.: CN111949535a, publication date: 2020.11.17.

disclosure of Invention

The application aims to provide a deep learning software defect report classification method based on a seBERT pre-training model, which fully considers the defect characteristics of deep learning software and the defect and class unbalance of defect report data, and is beneficial to improving the efficiency and accuracy of a defect report classification task of the deep learning software.

In order to achieve the above purpose, the application adopts the following technical scheme:

the deep learning software defect report classification method based on the seBERT pre-training model comprises the following steps:

step 1, collecting a defect report corresponding to a software warehouse developed based on a deep learning framework in a hosting platform of a software project, and adding a label category of the defect report according to information such as a title, text description, follow-up comments and the like in the defect report;

the method comprises the steps of forming sample data from defect report information and label categories corresponding to defect reports;

step 2, merging the text data extracted from each sample, inputting the merged text data into a pre-training model, and fine-tuning the pre-training model; the feature vector corresponding to the cls identifier started in the text data is used as the semantic feature of the input text data by the pre-training model after fine adjustment, and then the feature vector is input into a Softmax layer for normalization;

step 3, inputting the feature vector output by the fine-tuned pre-training model in the step 2 to a full-connection layer, and mapping the feature vector containing semantic information to a corresponding category by using a linear layer according to the dimension;

and finally, taking the output with the highest probability as the final prediction category of the defect report, thereby completing the classification of each defect report.

The application has the following advantages:

as described above, the application relates to a deep learning software defect report classification method based on a seBERT pre-training model. The application uses the pre-trained seBERT model in the corpus in the software engineering field to better extract text information in defect reports submitted by users and developers, and improves the capability of identifying defect categories to which the defect reports belong from the defect report text. In addition, in classifying defects of the deep learning software, unlike the traditional classification mode of classifying defect report labels into bug and non-bug, the method analyzes and describes the cause and classification of the bug of the deep learning software, and classifies the defect labels of the software into four types of Error, dployment, performance and problems, thereby further predicting specific classification of defect reports and improving classification and prediction capability of models for defect reports. In addition, in defect report data of a software project, obvious unbalanced-like problems exist, the effect of a fine tuning process of a pre-training model is greatly influenced, and data enhancement by using a mask language model (Mask Language Model, MLM) has excellent performance along with the rising of the pre-training language model, so that the application adopts a BERT pre-training model, predicts and replaces synonyms for masked words in the data by using the MLM method in the BERT, thereby generating new training data and adding the new training data into the original training data, and achieving the purpose of extracting text semantic information of the defect-like from a small amount of unbalanced data.

Drawings

FIG. 1 is a flow chart of a deep learning software defect report classification method according to an embodiment of the application.

FIG. 2 is a flow chart of a method for data enhancement of a data set based on a BERT pre-training model in an embodiment of the application.

FIG. 3 is a model diagram of a deep learning software defect report classification method according to an embodiment of the application.

Detailed Description

The application is described in further detail below with reference to the attached drawings and detailed description:

as deep learning is increasingly used for mission critical applications, defective deep learning application software can lead to catastrophic consequences. In defect reports (i.e., defect reports) submitted by developers, maintainers and users of the software, the development conditions of the software can be reflected and extracted so as to provide data references for defect prediction.

According to the application, through text information of the software defect report, a language model embedded by a context word is utilized to complete the task of multi-classification of labels of the defect report, so that the maintenance possibility and defect traceability of the deep learning software are ensured.

According to the application, the seBERT model which is trained in the software field context in advance is selected for fine tuning, so that semantic information in the defect report can be better extracted, a better result can be obtained by the model under the condition of unbalanced data or insufficient data quantity, and finally multi-classification work is completed through a forward neural network, and the correct label of the real defect in the defect report is determined.

As shown in fig. 1, the deep learning software defect report classification method based on the seBERT pre-training model in this embodiment includes the following steps:

step 1, collecting a defect report corresponding to a software warehouse developed based on a deep learning framework in a hosting platform of a software project, and adding a label category of the defect report according to information such as a title, text description, follow-up comments and the like in the defect report.

And forming the text information of the defect report and the label category corresponding to the defect report into sample data.

The step 1 specifically comprises the following steps:

and step 1.1, screening out a mature software system which has high activity and is developed based on a deep learning framework according to the Stars number and the project development time information.

The software system data to be collected in the application mainly comprises information such as titles and report descriptions in defect reports (issues) in a hosting platform (such as Github and Gitee) of software projects.

And marking the defect type according to the title text of the report and the follow-up reply and submitted information of the developer and the user.

While the provision of an off-the-shelf bug report tab in Github is available for project developers and users to add, lacking subdivision of specific bugs, there is only a "bug" tab to mark reports describing software bugs.

Even most defect reports lack labels or have label addition errors, which seriously affect later project maintenance and defect positioning and other works, and intangibly create additional labor and time costs.

There is therefore a need for re-labeling defect reports, in particular reports containing real defects.

And step 1.2, performing data filtering and text type data preprocessing on the collected software system data.

Step 1.2.1, filtering data; filtering invalid defect reports submitted under the software warehouse, wherein the invalid defect reports comprise reports with titles or texts having vacancies and defect reports without shutdown.

Step 1.2.2. Text data preprocessing.

And (3) performing word segmentation, stop word removal, foreign language word removal, picture removal, linking and code preprocessing operations on text type data contained in the collected data, including the title and the text data in the report.

Step 1.3, manually adding a specific defect category corresponding to the report according to the title text and the follow-up comments of the defect report; if the defect report does not contain a corresponding real defect, the added label is of the "other" class.

For defect reports containing real defects, the label class of the defect report will continue to be subdivided according to the content of the defect report.

Specifically, for defect reports containing real defects, the present embodiment classifies the defect reports into Error class, dproyment class, performance class, and Tensors & Inputs class according to defect types, for a total of 4 classes.

Error tags represent defects arising from code writing problems and api use.

The deviyment tag represents a software defect in terms of installation and hardware Deployment.

Performance tags include problems of inefficiency and inefficiency in software.

Tensorts & Inputs class labels represent problems due to erroneous data types, data shapes or data formats.

The type 4 defect type labels for defect reports containing real defects are consistent with the current mainstream research, so that specific problems of each defect report can be completely and accurately divided.

Defect reports that do not contain real defects (if only reaction requirements are submitted, user questions, file version declarations, etc.) are categorized as "Other (Other)" class labels.

For example, if the report content actually presents a problem of use by the user rather than a defect of the software itself, or presents a question of use and installation of the software, advice on the performance of the software, etc., these types of reports are collectively classified as "other" categories.

Extracting and preprocessing information in all defect reports, extracting header and description data in the defect reports, and forming a defect label data set with labeled classification labels and other data for subsequent training models.

Step 1.4. Because the defect report submitted under the disclosed software project warehouse platform contains very low quantity of real defects, the phenomenon of unbalanced category distribution exists, and the fine tuning effect of the pre-training model is seriously affected.

Thus, the present application employs data enhancement techniques for processing.

Specifically, the original Token is replaced by 'MASK', the pre-training model BERT is utilized for prediction, the Token with higher prediction probability is selected to replace the original Token, and the replaced text is added into the training data set, so that the pre-training model can fully extract text semantic information of the defect class from a small amount of unbalanced data in the training process.

As shown in fig. 2, taking the text of "This is very cool" as an example, after the "try" in the text is masked, the text is replaced by "pretty", "real" and "super", and three types of text after replacement are respectively as follows:

"This is pretty cool", "This is really cool", and "This is super cool".

And adding the three replaced texts into a training data set to realize data enhancement, so that the pre-training model can fully extract text semantic information of the defect class from a small amount of unbalanced data in the training process.

Step 2, merging the text data extracted from each sample, inputting the merged text data into a pre-training model, and fine-tuning the pre-training model; the feature vector corresponding to the cls identifier at the beginning in the text data is used as the semantic feature of the input text data by the pre-training model after fine tuning, and then the feature vector is input into a Softmax layer for normalization.

As shown in fig. 3, the concatenated isue text is converted into vectors of the dimension corresponding to ebedding by querying a word vector table (e.g., t shown in fig. 3 ₁ ……t _n ) As input to the seBERT model.

Corresponding to seBERT<cls>token feature (e.g., C in FIG. 3) and each word conversion in textIs (e.g. t shown in FIG. 3) ₁ ……t _n ) Corresponding features (e.g. T shown in FIG. 3 ₁ ……T _n ) And inputting the extracted isue features into a full-connection layer to obtain probabilities that the isues belong to different categories.

The application uses the seBERT which is excellent in the natural language processing field as a language model, carries out self-supervision training on a large-scale unlabeled text and then carries out fine tuning training on a downstream task, and the trimmed model can finish various downstream tasks.

Unlike models that are pre-trained in a generic corpus, the seBERT model is trained from the beginning based on data from the software engineering domain and has been demonstrated to possess higher performance and work efficiency in some software engineering related downstream tasks.

Step 2.1, selecting seBERT as a pre-training model; the model shows powerful performance in the task of report (isue) type prediction, exceeding the smaller model. In particular, the pre-training process corpus of the seBERT is based on data in the field of software engineering, and can effectively process defect report text information submitted in a deep learning software warehouse.

And 2.2. Merging the extracted text data, namely the title and the text part in the defect report, as the input of the seBERT model for each closed defect report, and performing fine tuning on the seBERT model to enable the seBERT model obtained after fine tuning to be more in accordance with the downstream task of label classification.

By updating the parameters of the original pre-training model, the trimmed seBERT model is more in line with the downstream task of label classification, and the trimmed seBERT model outputs the corresponding feature vector of < cls > for the subsequent classification task.

To further solve the training set problem of data imbalance, a Cross-entropy loss function (Cross-entropy loss) is used in the fine tuning process. In deep learning, cross entropy loss functions are a commonly used type of loss function, commonly used for classification problems. The method can measure the difference between the model prediction result and the actual result, and is one of key indexes for optimizing model parameters.

The cross entropy loss function calculation formula is as follows:

wherein y is _i,k The real label representing the ith sample is K, and the total number of the K label values is N samples;

p _i,k representing the probability that the i-th sample is predicted to be the k-th tag value.

During training, the sembert model was optimized using torch.nn.cross entropyloss as a loss function, the gradient was emptied with an optimizer.zero_grad (), the sembert model output and loss were calculated, the gradient was calculated using loss.backsaward (), and the sembert model parameters were updated using optimizer.step ().

At the end of each epoch, the model will evaluate on the test set to check its generalization ability on new data.

The application adopts the model pre-trained by the corpus in the software engineering field as the training model of the classification task, so as to improve the processing and learning capacity of the model for the text data of the defect report of the deep learning software warehouse and improve the efficiency and classification accuracy of the model.

And 3, inputting the feature vector output by the fine-tuned pre-training model in the step 2 to a full-connection layer, and mapping the feature vector containing semantic information to a corresponding category by using a linear layer according to the dimension.

The step 3 specifically comprises the following steps:

normalizing the feature vector output by the pre-training model in the step 2 by inputting a Softmax activation function, and then classifying the normalized feature vector into a full-connection layer; the calculation formula of the Softmax activation function is as follows:

wherein x is _i Is the output value of the i-th node in the neural network,is a normalized term that ensures that all output values of the function sum to 1 and each value is in the (0, 1) range, thus constituting an effective probability distribution.

And finally, taking the output with the highest probability as the final prediction category of the defect report, thereby completing the classification of the defect report.

The foregoing description is, of course, merely illustrative of preferred embodiments of the present application, and it should be understood that the present application is not limited to the above-described embodiments, but is intended to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present application as defined by the appended claims.

Claims

1. The deep learning software defect report classification method based on the seBERT pre-training model is characterized in that,

the method comprises the following steps:

step 1, collecting a defect report corresponding to a software warehouse developed based on a deep learning framework, and adding a label category of the defect report according to a title, text description and follow-up comment information in the defect report;

forming text information of the defect report and label types corresponding to the defect report into sample data;

step 2, merging the text data extracted from each sample and inputting the text data into a pre-training model, and fine-tuning the pre-training model;

the feature vector corresponding to the cls identifier started in the text data is used as the semantic feature of the input text data by the pre-training model after fine adjustment, and then the feature vector is input into a Softmax layer for normalization;

and finally, taking the output with higher probability as the final prediction category of the defect report, thereby completing the classification of the defect report.

2. The deep learning software defect report classification method of claim 1 wherein,

in the step 1, for a defect report containing a real defect, continuously subdividing the label category of the defect report according to the text content of the defect report; defect reports that do not contain real defects are classified as "other" class labels.

3. The deep learning software defect report classification method of claim 1 wherein,

in the step 1, the obtained closed defect report header and text information and the real classification label corresponding to the defect report are combined to form sample data, and the sample data is used as a training data set of a subsequent pre-training model.

4. The deep learning software defect report classification method of claim 2 wherein,

in the step 1, for the defect report including the real defect, the defect report is classified into Error class, dproyment class, performance class and Tensors & Inputs class according to the defect type;

error tags represent defects arising from code writing problems and api use;

the depoyment tag represents a flaw in software installation and hardware Deployment;

performance tags include problems of inefficiency and inefficiency in software;

5. The deep learning software defect report classification method of claim 1 wherein,

the step 1 specifically comprises the following steps:

step 1.1, screening out a mature software system which has high activity and is developed based on a deep learning framework;

step 1.2, data filtering and text type data preprocessing are carried out on the collected software system data;

6. The deep learning software defect report classification method of claim 5 wherein,

the step 1.2 specifically comprises the following steps:

step 1.2.1, filtering data; filtering invalid defect reports submitted under a software warehouse, wherein the invalid defect reports comprise reports of blank titles or text and defect reports without closing;

step 1.2.2, preprocessing text data; and performing word segmentation, stop word removal, foreign language word removal, picture removal, linking and code preprocessing on text type data contained in the collected data.

7. The deep learning software defect report classification method of claim 5 wherein,

the step 1.3 further comprises a step of data enhancement, namely:

step 1.4, randomly replacing the text marked as the true defect type;

specifically, the original Token is replaced by 'MASK', the pre-training model BERT is utilized for prediction, the Token with higher prediction probability is selected to replace the original Token, and the replaced text is added into the training data set.

8. The deep learning software defect report classification method of claim 1 wherein,

the step 2 specifically comprises the following steps:

step 2.1, selecting seBERT as a pre-training model;

2.2. Merging the extracted text data, namely the title and the text part in the defect report, as the input of the seBERT model for each closed defect report, and performing fine tuning on the seBERT model;

9. The deep learning software defect report classification method of claim 8 wherein,

in the step 2.2, a cross entropy loss function optimization model is adopted in the fine tuning process of the seBERT model;

the cross entropy loss function calculation formula is as follows:

10. The deep learning software defect report classification method of claim 1 wherein,

the step 3 specifically comprises the following steps:

wherein x is _i Is the output value of the i-th node in the neural network,is a normalized term that ensures that all output values of the function sum to 1 and each value is in the (0, 1) range, thus constituting an effective probability distribution;