CN113011180A

CN113011180A - Defect report severity prediction method based on description keyword extraction

Info

Publication number: CN113011180A
Application number: CN202110412776.4A
Authority: CN
Inventors: 陈翔; 贾焱鑫; 成昌姝
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-06-22
Anticipated expiration: 2041-04-16
Also published as: CN113011180B

Abstract

The invention provides a defect report severity degree prediction method based on description keyword extraction, which selects defect abstract, defect description and severity degree of corresponding software project in a defect tracking system; performing word segmentation, word stop removal and word shape reduction processing on the defect abstract; carrying out character string replacement, keyword extraction, word segmentation, word stop removal and word shape reduction processing on the defect description; training and constructing a word vector model for the defect abstract and the defect description respectively based on the severity so as to obtain corresponding vectors; acquiring a defect report severity prediction model by adopting a logistic regression classification method based on the vectors; the model is used to predict the severity of a defect report in a software project. The invention has the beneficial effects that: the method adopts the keywords extracted from the defect description to supplement the defect abstract, and can realize better model prediction performance.

Description

Defect report severity prediction method based on description keyword extraction

Technical Field

The invention relates to the technical field of software quality assurance, in particular to a method for predicting the severity of a defect report based on extraction of description keywords.

Background

With the development of internet technology, software engineering technology is correspondingly developed, and as the number of software projects is increasing day by day, the software projects have larger or smaller software defects inevitably, and 90% of the software defects seriously affect the experience of users, it is particularly important to track and manage the defects in the software projects. In a defect report tracking system, defect reports are used for problems encountered by users submitting feedback. The severity in the defect report can be used for testing the reasonable distribution of the defect report by distributor and the quick repair of the defect by the developer, thereby reducing the workload of manual distribution and realizing the quick repair of the defect.

For the above situation, the prediction of the severity of the defect report is performed by text preprocessing according to the content of the defect report, and finally the prediction of the severity attribute value in the defect report is realized. At present, the defect abstract is generally adopted as a training data set for predicting the severity of the defect report, and the text of the defect abstract is less, so that the performance of a severity prediction model is limited.

How to solve the above technical problems is the subject of the present invention.

Disclosure of Invention

In order to solve the problem, the invention provides a defect report severity degree prediction method based on description keyword extraction, aiming at the current prediction of the severity degree of the defect report, a defect abstract is usually adopted as a training data set, the text of the defect abstract is less, and the performance of a severity degree prediction model is limited, so that the model performance can be further enhanced by supplementing the defect abstract with the rest content in the defect report.

The invention provides a defect report severity prediction method based on description keyword extraction, which comprises the following steps:

(1) selecting defect reports with the states of CLOSED and FIXED and the severity of Blocker, Critical, Major, Minor and Trivisual from a defect tracking system in which the project is positioned, downloading data of the defect reports, wherein downloaded fields comprise defect abstract, defect description and severity of the defect report, and forming a data set based on the downloaded fields;

(2) for dataThe text in the concentrated defect abstract field is sequentially subjected to word segmentation, word stop removal and word shape reduction to obtain a corresponding word segmentation set T_s；

(3) Using said set of part-words T_sUsing a word embedding method FastText training and obtaining a abstract word vector model F according to the severity of the defect report in the data set_sThe defect abstract is subjected to vector representation by using the model, and the method specifically comprises the following steps: vector model F based on abstract words_sObtaining the vector of each participle in the defect abstract, and summing the vectors of each participle in the defect abstract to obtain a defect abstract vector E_s；

(4) Extracting and expressing the keywords of the defect description field in the data set to obtain a defect description vector E_d；

(5) Merging the defect digest vector E_sAnd the defect description vector E_dAs an input vector E_input；

(6) Based on the input vector E_inputAnd the severity of the defect report in the data set, training and obtaining a prediction model of the severity of the defect report by using a logistic regression classification method;

(7) inputting a new defect report, processing the defect abstract in the step (2), processing the defect description in the step (4), combining two vectors based on the step (5), and inputting the defect report severity prediction model obtained in the step (6) to obtain a final prediction result;

in the step (4), keyword extraction and representation are carried out on the defect description field in the experimental data set to obtain a defect description vector E_dThe method specifically comprises the following steps:

1) the method for replacing the character strings of the defect description fields in the data set by utilizing a regularization method comprises the following steps: matching the content containing the URL and replacing the content by using a 'URL' character string, outputting the content by using a matching console and replacing the content by using a 'console' character string, matching a code segment and replacing the content by using a 'code' character string, and then carrying out word segmentation, word stop removal and word shape reduction on the content to obtain a corresponding word segmentation set;

2) based on the set of parts of speechExtracting a keyword set T of defect description by using a keyword extraction method Textrank_d；

3) Based on the keyword set T_dTraining a defect descriptor vector by using a word embedding method FastText according to the severity corresponding to the data set to obtain a descriptor vector model F_dAnd performing defect description vector representation on the defect description by using the model, specifically comprising the following steps: vector model F based on descriptors_dObtaining the vector of each keyword in the defect description, and summing the vectors of each keyword in the defect description to obtain a defect description vector E_d。

Compared with the prior art, the embodiment of the invention has the beneficial effects that: the invention utilizes keywords extracted from defect description to supplement data of defect abstract, can increase the number of data sets, and further improves the prediction performance of a defect report severity prediction model, wherein, the keywords obtained by a keyword extraction method Textrank can realize the improvement of the performance of the defect report severity prediction model and simultaneously compress the information of defect description, namely: better model performance is achieved with fewer keywords.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a flowchart illustrating a method for predicting severity of defect reports based on extraction of description keywords according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments are further detailed. Of course, the specific embodiments described herein are merely illustrative of the invention and are not intended to be limiting.

Example 1

Referring to fig. 1, the present invention provides a method for predicting severity of a defect report based on extraction of description keywords, the method comprising the steps of:

(1) and selecting the defect reports with the states of CLOSED and FIXED and the severity degrees of Blocker, Critical, Major, Minor and Trivisual from the defect tracking system in which the project is positioned, downloading the data of the defect reports, wherein the downloaded fields comprise defect abstract, defect description and severity degrees of defect report, and finally forming a data set.

This embodiment takes the Eclipse project in the Bugzilla defect tracking system as the subject of the experiment and downloads a data set including defect digest, defect description and defect report severity. We consider only 5 severity levels for defect reporting in the dataset, and remove the default option Normal and Enhancement that do not belong to true defects, because in the field of prediction of severity of defect reporting, both the Normal and Enhancement severity levels are considered as noisy data and cannot help in the construction of prediction models of severity of defect reporting. For the rest 5 severity degrees, the Blocker, Critical and Major are combined into a 'severe' category, the Minor and Trivial are combined into a 'non-severe' category, and the 'severe' category and the 'non-severe' category are utilized to train a defect report severity degree prediction model. The number of defect reports for different severity levels is shown in table 1.

TABLE 1 number of Defect reports at varying severity

(2) The text in the defect abstract field in the data set is subjected to word segmentation, word removal and word shape reduction in sequence to obtain a corresponding word segmentation set T_s。

The problem of predicting the severity of the defect report can be modeled as a text classification problem, which firstly needs to carry out word segmentation, word stop removal and word shape reduction processing on text content to obtain a corresponding word segmentation set T_s。

Wherein,

word segmentation: a defect report is divided into a series of words, content which does not belong to words such as punctuation marks can be removed, and the original text data is subjected to preliminary processing.

Stop words: words that appear frequently but have little practical meaning, such as 'the', 'is', 'at', 'which', 'on', etc., are removed, and removal of stop words may improve model performance and reduce text size.

And (3) shape reduction: the words in various forms are restored to the root word form, and in real text data, the words have different tense morphemes, but the meaning of the words is similar in different states. For example, the words "make", "keys" and "making", the words in different states can increase the redundancy of text information, thereby causing the performance of the model to be reduced. Therefore, it is necessary to perform morphological restoration on words.

Table 2 shows the word segmentation set T obtained by the defect abstract through the steps_s。

TABLE 2 recovery of word segmentation, stop word and shape of defect abstract

(3) Using said set of part-words T_sUsing a word embedding method FastText training and obtaining a abstract word vector model F according to the severity of the defect report in the data set_sThe defect abstract is subjected to vector representation by using the model, and the method specifically comprises the following steps: vector model F based on abstract words_sObtaining the vector of each participle in the defect abstract, and summing the vectors of each participle in the defect abstract to obtain a defect abstract vector E_s。

The word embedding method FastText is a word vector representation algorithm which is integrated into a character string-based method and can understand the morphological characteristics of words. FastText uses a hierarchical classifier and a hierarchical Softmax, and establishes a tree structure for representing categories by using a Huffman tree algorithm, so that the number of model prediction targets can be reduced, and the complexity of calculation can be reduced. In the embodiment, the parameter setting in the word embedding method FastText adopts the parameter setting of an original thesis, specifically, the window size n-gram is 3, and the word vector dimension is 10.

Based on the participle set T obtained in the step (2)_sTraining the severity of the defect report in the data set and constructing a abstract word vector model F_sWhen obtaining the abstract word vector model F_sThen, inputting the participles in a defect abstract into the model and obtaining the vector representation method thereof, and summing the vectors of each participle in the defect abstract to obtain a defect abstract vector E_s。

(4) Extracting and expressing the keywords of the defect description field in the data set to obtain a defect description vector E_d。

There is more data in the defect description field than in the defect summary. The effect of the model for predicting the severity of the defect report by using the defect description is slightly inferior to that of the model for predicting the severity of the defect report by using the defect abstract, so that the final performance of the model is interfered due to the noise of the defect description data. Therefore, the keyword extraction method Textrank is used for extracting keywords in the defect description for data supplement, and the main steps are as follows:

1) the method for replacing the character strings of the defect description fields in the data set by utilizing a regularization method comprises the following steps: matching the content containing the URL and replacing the content by using a 'URL' character string, outputting the content by using a matching console and replacing the content by using a 'console' character string, matching a code segment and replacing the content by using a 'code' character string, and then carrying out word segmentation, word stop removal and word shape reduction on the content to obtain a corresponding word segmentation set.

2) Extracting a keyword set T of defect description by using a keyword extraction method Textrank based on the word segmentation set_d. The keyword extraction method Textrank is an algorithm for representing a given text relation based on a graph. The algorithm firstly constructs a segmentation dictionary and a graph model according to the segmentation, and carries out score calculation on the segmentation by using a PageRank algorithm on the basis of the graph model, wherein top-k segmentation is considered as a keyword. The relation among the participles can be better understood.

3) Based on the keyword set T_dTraining a defect descriptor vector by using a word embedding method FastText according to the severity corresponding to the data set to obtain a descriptor vector model F_dUsing the model to perform defect descriptionThe defect description vector represents, specifically: vector model F based on descriptors_dObtaining the vector of each keyword in the defect description, and summing the vectors of each keyword in the defect description to obtain a defect description vector E_d。

The word segmentation, word stop removal and word shape reduction are the same as the step (2), and a word vector model F_dDefect description vector E_dThe acquisition method is similar to the step (2). Table 3 shows that the defect description is processed correspondingly to obtain a keyword set T_dThe process of (1).

Table 3 Defect description processing

(5) Merging the defect digest vector E_sAnd the defect description vector E_dAs an input vector E_input。

Obtaining a defect abstract vector E according to the step (3)_sAnd the defect description vector E obtained in the step (4)_dThe vector vectors are concatenated, and since the word vector dimension, highest, is 10 in this embodiment, the length of the word vector after concatenation is 20.

(6) Based on the input vector E_inputAnd training and obtaining a defect report severity prediction model by using a logistic regression classification method together with the defect report severity in the data set.

The logistic regression classification method is a classical classification algorithm in statistics, and can be used for two-classification or multi-classification problems.

In one of the classification problems, the classification problem,

in the above formula, x represents the input word vector, and y represents the classification result, where y includes k classes. W_kAnd b_kAre two parameters in the logistic regression method. In this experiment, k is 2, specifically two categories, "severe" and "not severe".

Adopting a logistic classification regression method based on the input vector E obtained in the step (5)_inputAnd training and constructing a defect report severity prediction model with the defect report severity in the data set.

(7) Inputting a new defect report, processing the defect abstract in the step (2), processing the defect description in the step (4), combining two vectors based on the step (5), and inputting the defect report severity prediction model obtained in the step (6) to obtain a final prediction result.

When inputting a new defect report, the invention firstly adopts the step (2) to carry out the treatments of word segmentation, stop word removal and word shape reduction on the defect abstract to obtain a corresponding word segmentation set T_sUsing the abstract word vector model F in step (3)_sAnd carrying out vector representation on the defect abstract. Adopting the step (4) to replace character strings of the defect description, then carrying out word segmentation, word stop removal and word shape reduction processing on the defect description to obtain a corresponding word segmentation set, and then adopting a keyword extraction method Textrank to extract a keyword set T of the defect description_dUsing the descriptor vector model F of step (4)_dAnd carrying out vector representation on the defect description. And (5) merging the obtained vectors, and inputting the defect report severity prediction model obtained in the step (6) to obtain a final prediction result.

In the examples, experiments were performed using the dataset of Eclipse project. The data set is first partitioned into a training set and a test set in a time series order of 8: 2. And (3) evaluating the performance of the defect report severity prediction model by using three common evaluation indexes of F-Measure, Precision and Recall. The calculation formula is as follows:

TABLE 4 confusion matrix

		Predicted
					Positive	Negative
Actual	Positive	TP	FN
				Negative	FP	TN

Where the confusion matrix is shown in table 4. TP + FP + TN + FN is the total number of samples, F-Measure is the harmonic mean value of recall ratio and precision ratio, and when the F-Measure value is higher, the method is more effective.

Based on the Eclipse data set, the experimental results obtained under the two standards of a naive Bayes method and a k nearest neighbor method are shown in Table 5.

TABLE 5 results of the experiment

	F-Measure	Precision	Recall
				Naive Bayes method	0.660	0.652	0.681
k nearest neighbor method	0.656	0.674	0.646
				The invention	0.704	0.738	0.687

According to the experimental result, the method has better model effect on 3 evaluation indexes. In addition, the key words in the defect description are extracted to supplement the defect abstract, and compared with the method using the defect description, the defect abstract has the advantages that the added data amount is less, and the model performance is better.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A defect report severity prediction method based on description keyword extraction is characterized by comprising the following steps:

s1, selecting a defect report with the states of CLOSED and FIXED and the severity of Blocker, Critical, Major, Minor and trivisual from a defect tracking system where the project is located, downloading data of the defect report, wherein downloaded fields comprise defect abstract, defect description and severity of the defect report, and forming a data set based on the data;

s2, carrying out word segmentation, word stop removal and word shape reduction on the text in the defect abstract field in the data set in sequence to obtain a corresponding word segmentation set T_s；

S3, utilizing the participle set T in the step S2_sUsing a word embedding method FastText training and obtaining a abstract word vector model F according to the severity of the defect report in the data set_sThe defect abstract is subjected to vector representation by using the model, and the method specifically comprises the following steps: vector model F based on abstract words_sObtaining the vector of each participle in the defect abstract, and summing the vectors of each participle in the defect abstract to obtain a defect abstract vector E_s；

S4, defect description word in data setExtracting and representing key words to obtain a defect description vector E_d；

S5, merging the defect abstract vector E_sAnd the defect description vector E_dAs an input vector E_input；

S6, based on the input vector E_inputAnd the severity of the defect report in the data set, training and obtaining a prediction model of the severity of the defect report by using a logistic regression classification method;

and S7, inputting a new defect report, processing the defect abstract in the step S2, processing the defect description in the step S4, merging two vectors based on the step S5, and inputting the defect report severity prediction model obtained in the step S6 to obtain a final prediction result.

2. The method for predicting the severity of defect report based on keyword extraction as claimed in claim 1, wherein in said step S4, keyword extraction and representation are performed on the defect description fields in the experimental data set to obtain a defect description vector E_dThe method specifically comprises the following steps:

s401, replacing character strings of the defect description fields in the data set by utilizing a regularization method, wherein the method comprises the following steps: matching the content containing the URL and replacing the content by using a 'URL' character string, outputting the content by using a matching console and replacing the content by using a 'console' character string, matching a code segment and replacing the content by using a 'code' character string, and then performing word segmentation, word stop removal and word shape reduction on the content to obtain a corresponding word segmentation set;

s402, extracting a keyword set T of defect description by using a keyword extraction method Textrank based on the word segmentation set_d；

S403, based on the keyword set T_dTraining a defect descriptor vector by using a word embedding method FastText according to the severity corresponding to the data set to obtain a descriptor vector model F_dAnd performing defect description vector representation on the defect description by using the model, specifically comprising the following steps: vector model F based on descriptors_dObtaining a vector of each keyword in the defect description, and performing defect detection on the obtained vectorSumming vectors of each keyword in the defect description to obtain a defect description vector E_d。