CN109656808B

CN109656808B - Software defect prediction method based on hybrid active learning strategy

Info

Publication number: CN109656808B
Application number: CN201811319619.3A
Authority: CN
Inventors: 曲豫宾; 李芳�
Original assignee: Nantong Textile Vocational Technology College
Current assignee: Nantong Textile Vocational Technology College
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2022-03-11
Anticipated expiration: 2038-11-07
Also published as: CN109656808A

Abstract

The invention discloses a software defect prediction method based on a hybrid active learning strategy, which adopts a cost-sensitive information entropy and relative entropy collaborative active learning method, uses common information entropy as an evaluation index of a high-quality sample, manually marks the sample with higher information entropy, and further analyzes the sample with low information entropy by using relative entropy, thereby more effectively expanding a marked data set. Experiments show that the method can improve the software defect prediction performance, reduce the manual labeling cost and is more efficient.

Description

Software defect prediction method based on hybrid active learning strategy

Technical Field

The invention relates to the technical field of active learning, in particular to a software defect prediction method based on a hybrid active learning strategy.

Background

The software defect module can cause operation failure in the production process of an enterprise, so that the enterprise generates heavy loss, and the satisfaction degree of a client is reduced. The software defect prediction model is used for discovering the software defect module as early as possible in the software development stage, and common models comprise a supervised model, an unsupervised model and the like.

If the software project has rich historical annotation data, a supervised machine learning model can be established to construct a same-project software defect prediction (with-project defect prediction) model, evaluate the probability of the defects of the software module or calculate the number of the defects of a certain module, and the like. In the actual software development process, if the software project is a brand-new project or the training data of the project is less, an enterprise needs to invest a large amount of time for marking the defect module, meanwhile, the work is relatively strong in speciality, and the marking of the software module needs to be performed by relatively professional personnel, so that the establishment of the software defect prediction model needs to spend a large amount of time and invest much labor, and the cost of software development is increased.

The active learning provides various query strategies for solving the problem of sample labeling, so that enterprises can actively select a certain sample for labeling when facing a mass labeling module, add the sample to be labeled into a labeled sample data set after the manual labeling is completed, and quickly establish a software defect prediction model. The active learning selection strategy is used for selecting high-quality samples from the software defect prediction data set, the training data set is expanded after the samples are marked manually, and meanwhile, other machine learning methods such as dimensionality reduction and feature selection are jointly used for improving the performance of software defect prediction.

The selection strategies used include common strategies such as uncertainty information entropy and the like, however, the research does not pay attention to the samples with low information entropy, namely the samples with high certainty degree, the samples with low information entropy are usually abandoned in the process of actively learning one-time query, and the utilization of the samples with low information entropy is rarely involved.

Patent No. CN201710271035.2 discloses a multi-label active learning method based on condition dependent label set, which screens samples with large information amount as the object of active learning by integrating the sample information entropy and the relative entropy at the same time, and although this method uses the principle that the information entropy and the relative entropy work together, the relative entropy is added at the same time in the information entropy processing stage to calculate, which adversely affects the operation efficiency and effect of the system, and in addition, the low information entropy samples are not well utilized.

Disclosure of Invention

In order to solve the problems of high manual labeling cost and low prediction performance, the invention provides a software defect prediction method based on a hybrid active learning strategy.

A software defect prediction method based on a hybrid active learning strategy is characterized in that a cost-sensitive information entropy and relative entropy collaborative active learning method is adopted in the software defect prediction method based on the hybrid active learning strategy, which is called an UNCERTAINTYKL model for short, the UNCERTAINTYKL model uses information entropy as an evaluation index of high-quality samples, samples with higher information entropy are selected from unlabeled sample data to be labeled manually, and samples with low information entropy are further analyzed by using relative entropy to further expand a labeled data set.

Preferably, the uncertaintikl model comprises the following steps:

step 1: calculating the information entropy of each unmarked sample data through an information entropy calculation formula; step 2: selecting a data sample with the highest information entropy from the unmarked sample data through a calculation formula (1), submitting the data sample to a domain expert for manual marking, and adding a marked data set after marking;

and step 3: screening unlabeled sample data with the lowest residual information entropy in the step 2, and labeling by using a relative entropy calculation mode;

and 4, step 4: presetting a relative entropy threshold, if the relative entropy is lower than the threshold, adding the sample into the marked data set, and simultaneously using the mark of the prediction result as a pseudo mark of the data; if the relative entropy is above the threshold, processing of the sample is abandoned.

Preferably, the calculation method of the data sample with the highest information entropy in step 2 is as follows:

x_u,max＝argmax(-∑_iP_θ(y_i/x)logP_θ(y_i/x)) (1)

where i denotes the ith unlabelled example (i ═ 1, 2.. u), y_iIndicates the value of the label to which the label to be classified belongs, x_u,maxRepresents the data sample with the maximum information entropy in the unmarked data set obtained according to the formula (1), and belongs to the category y_iPredicted probability value of p_θ(y_i/x) represents the case of data distribution based on the marked data set.

Preferably, the relative entropy calculation method in step 3 includes the following formula:

means, x, representing the relative entropy calculated from all classification models KLD_u,minRepresenting the data sample with minimum entropy in the unlabeled data set obtained according to the formula (1), C representing the number of classifiers of the query committee, the data set of the classifiers being dynamically updated D_lClassification committee C ═ θ¹,...,θ^mThe members of the classifiers of the classification Committee represent different classification strategies, all of which are capable of calculating the current label, P, for unlabeled data_C(y_iX) label y for representing the classification model of the query committee for the label to be classified_iAverage value of the probabilities of D (P)_θ(C)/P_C) Represents a classification model θⁱRelative entropy of information for other models.

Preferably, the threshold value in step 4 is set to an empirical threshold value of 0.1, if the threshold value is set to the empirical threshold value

If the value satisfies the threshold range, then θ is usedⁱFor x_u,minThe sample in (1) is pseudo-marked.

Preferably, to solve the model solution problem, the following segmented optimization strategy is adopted, and the optimization process is as follows:

A. initializing a system: before the system starts to run, a part of samples are taken out from a sample collection pool and are delivered to a field expert for manual labeling, and the collection is marked as D_lThe sampling mode of the initial mark combination is random sampling from the sample set, and D is used for sampling_lData set completion pair classification model θ¹As a basis for subsequent classification of label-free data;

B. active selection of unlabeled samples: using a classification model theta¹Predicting each unmarked sample, calculating the information entropy of each sample according to a formula, and sorting and taking out the sample x with the maximum information entropy_u,_maxHanding over to a domain expert for manual labeling, and putting x_u,maxAdding a tagged data set D_l；

C. Pseudo marking processing of the sample with the highest degree of certainty: sample x with lowest entropy_u,minTake out, inCalculating relative entropy, i.e. KLD, according to a formula, comparing KLD with a threshold value, if the threshold value is met, then for x_u,minLabel, and apply x_u,minAdding a tagged data set D_l；

D. Updating a classification model: using a marker data set D_lRetraining the classification model θ¹And then loops until a termination condition is satisfied.

Has the advantages that:

1. the information entropy and the relative entropy are fused to cooperatively work, a hybrid active learning query strategy is adopted, the low information entropy is further analyzed by using the relative entropy, and the information contained in the sample is more fully utilized, so that the software defect prediction performance is improved, and the defect module of the software is more quickly discovered.

2. By adopting a mixed active learning query strategy, enterprises only need to invest relatively less labor marking cost in the early stage, so that better software defect prediction capability is obtained, better control cost can be realized on the premise of meeting the enterprise requirements, and manpower is saved.

Drawings

FIG. 1 is a schematic diagram of an algorithm flow of a software defect prediction method of a hybrid active learning strategy according to the present invention

FIG. 2 is a schematic diagram of AUC (AUC index) indexes of Equinox data sets under different active learning query strategies

FIG. 3 is a schematic diagram of AUC indicators of an Eclipse JDT Core dataset under different active learning query strategies

FIG. 4 is a graph of AUC index intent of the Apache Lucene data set under different active learning query strategies

FIG. 5 is a schematic diagram of AUC index of Mylyn data set under different active learning query strategies

FIG. 6 is a graph of AUC index intent of Eclipse PED UI datasets under different active learning query strategies

Detailed Description

In this section, we will describe in detail the Cost-sensitive information Entropy and relative Entropy based collaborative Active Learning strategy (Cost-Effective Entropy-Learning), which is abbreviated as uncertaintiykl model. The learning strategy is applied to a common AEEEEM data set in software defect prediction, and a classifier can realize better classification indexes when the same data mark amount is achieved by selecting an improved classification model of strategy increment.

UNCERTAINTYKL active learning strategy

The uncertaintiykl model integrates the idea of cooperative training into the learning strategy of active learning, and completes the creation of the model by minimizing the number of marks. The initiative learning strategy based on uncertainty information entropy selects the data to be labeled with the highest information entropy from the unlabeled data, and the data to be labeled with the lowest information entropy is labeled by the domain expert, and voted by the query committee according to KLD, if the relative entropy of the sample calculated by the query committee is low (the experience threshold set at present is 0.1), the sample is added into the labeled data set, and meanwhile, the label of the prediction result is used as the pseudo label of the data, specifically, the following steps are carried out

x_u,max＝argmax(-∑_iP_θ(y_i/x)logP_θ(y_i/x)) (1)

Defining a data set containing l marked examples

Defining a data set containing u unlabeled data sets

Denotes the ith unlabeled example (i ═ 1,2, … u), x_u,maxRepresents the data instance with the maximum entropy in the unmarked data set obtained according to (1), p_θ(y_i/x) represents x in the case of data distribution based on labeled data sets_u,maxBelong to the category y_iPredicted probability value of argmax (-. Sigma)_iP_θ(y_i/x)logP_θ(y_i/x)) represents that the data sample with the maximum concentration value of the unlabeled data is selected.

x_u,minRepresenting the data sample with the minimum information entropy in the unmarked data set obtained in the step (1), C represents the number of classifiers of the query committee, and the data set of the classifiers is dynamically updated D_l. Classification committee C ═ θ¹,...,θ^mThe classifier members of the classification committee represent different classification strategies, and are able to compute the current label on unlabeled data. P_C(y_i/x) is used for representing the label y of the classification model of the query board for the label to be classified_iAverage of the probabilities of (a). D (P)_θ(C)/P_C) Represents a classification model θⁱFor the relative entropy of the information of the other models,

representing the mean of the relative entropies calculated from all classification models KLD. If the relative entropy is less than the threshold, then θ is usedⁱAnd pseudo-marking the example to be marked.

The algorithm noted above is specifically as follows:

a segmented optimization strategy is used to solve the model solution problem, and the optimization process is shown as follows:

system a initialization

Before the system starts to run, a part of samples are taken out from a sample set pool and are delivered to a field expert for manual labeling, and the set is marked as D_l. The initial label is randomly sampled from the sample set by D_lData set completion pair classification model θ¹As a basis for subsequent classification of label-free data;

b active selection of unlabeled samples

Using a classification model theta¹Predicting each unmarked sample, calculating the information entropy of each sample according to a formula (1), and sorting and taking out the sample x with the maximum information entropy_u,maxHand labeling is carried out by field experts, and x is added_u,maxAdding a tagged data set D_l；

C pseudo-mark processing of the sample with the highest degree of certainty

Sample x with lowest entropy_u,minTaking out, calculating relative entropy according to formulas (2), (3) and (4), namely KLD, comparing the KLD with a threshold value, and if the threshold value is met, comparing x_u,minLabel, and apply x_u,minAdding a tagged data set D_l；

D classification model update

Using a marker data set D_lRetraining the classification model θ¹And then loops until a termination condition is satisfied.

The whole algorithm process can be summarized as algorithm 1

Method for solving UNCERTAINTYKL model by using algorithm 1 sectional strategy

1: input: initialization marker dataset D_l＝{x₁,...x_l}, unlabeled dataset D_u＝{x_l+1,...x_l+u}, data flag y_1i,...y_lMaximum number of cycles U_maxKLD threshold value threshold

2: output: set of classification committees

Classification Performance collections

3: initializing a classification model of a query committee using tagged data

4: while current loop times < maximum loop times | | | unconverged do

5：for i<-1to Umax do

6: takeout Committee Member θⁱTraining on labeled data setsModel (model)

7: for current x⁽ⁱ⁾Calculating the classification probability P according to argmax (-Sigma)_iP_θ(y_i/x)logP_θ(y_i/x)) calculates corresponding information entropy

8 sample x with maximum information entropy⁽ⁱ⁾Manual labeling is carried out by domain experts, and a labeled data set is added

9: sample x with minimum information entropy^(t)The relative entropy, i.e. the average KLD value, is calculated according to equations (2), (3) and (4).

10：D_l＝D_l∪x⁽ⁱ⁾；D_u＝D_u\x⁽ⁱ⁾

11：if KLD>threshold:D_l＝D_l∪x^(t)；D_u＝D_u\x^(t)

12：i＝i+1

13：end for

14：end while

15：

Design of experiments

Evaluating an object

The influence of the uncertaintiyKL query strategy on the active learning adopted strategy in the field of software defect prediction is analyzed and evaluated on the open data set AEEEM, and the AEEEM data set is used for evaluating different learning strategies. The data set is widely applied to the field of software defect prediction. The data set is used as a reference data set for performance comparison in the field of software defect prediction. The data set provides 61 metrics, including software development process metrics, etc., and all 61 metrics were used for classifier modeling in this experiment. AEEEM dataset summary information is shown in table 1.

Table 1 summary of AEEEM data set

Experimental setup

The experiment adopts 5-by-2 cross validation to carry out the experiment, the data is randomly and hierarchically sampled in each experiment, half of the data is used as training data, and half of the data is used as test data, so that the situation that the evaluation results are not independent due to data overlapping between the training data and the test data is prevented. And (3) taking a certain proportion of data from the training data for manual labeling, wherein the proportion of initial labeled data in the experiment is 30%, training a classification model by using an initial labeled data set, and selecting the rest 70% of data as unlabeled data according to an active learning strategy. The classifier in the experiment supports a vector machine (supporting vector machine) to be trained using RBF kernels implemented by libsvm and default parameters. A random forest classifier RandomForestClassifier realized by sklern is used in the UNCERTAINTYKL query strategy, and default parameters are used in training. In the UNCERTAINTYKL strategy, iteration is carried out on unlabeled data, a sample with the highest uncertainty is selected for manual labeling in each iteration, meanwhile, the sample with the lowest certainty is selected for further judgment by using KLD, and the threshold value of the judgment is set to be 0.1.

Evaluation index

The AEEEM data set has the problem of class imbalance, the performance of an active learning query strategy can be better reflected by using an AUC (area under ROC curve) index, and the AUC (area under ROC curve) index is one of the most waiting indexes when software defect prediction is carried out. The index is based on a ROC curve, and the ROC curve is called a receiver operating characteristic (receiver operating characteristic) curve. The confusion matrix of the binary classification model in the software defect prediction is shown in table 2.

TABLE 2 confusion matrix

The ROC curve defines the False Positive Rate (FPR) as the X-axis and the True Positive Rate (TPR) as the Y-axis. TPR: the rate of modules that are correctly determined to be defective in all samples that are actually defective. FPR: the rate of modules that are erroneously determined to be defective in all samples that are actually non-defective.

TPR＝TP/(TP+FN)

FPR＝FP/(TN+FP)

The AUC value corresponds to the area under the ROC curve, the value range of the AUC value is between 0 and 1, and the larger the value is, the better the performance of the model is.

Reference method

The invention uses the following three active learning query strategies as baseline to compare with the UNCERTAINTYKL active learning strategy:

(1) random sampling strategy (random): randomly selecting a query example from the unlabeled data, submitting the query example to a domain expert for labeling, and adding the labeled query example into a training data set;

(2) uncertainty information entropy based sampling strategy (uncertaintiy): training a training data set based on an SVM classifier, predicting each unmarked data, calculating information entropy, and selecting an instance with highest uncertainty for marking;

(3) query committee-based active learning strategy (committee): and querying the labeling sample by using an SVM and a random forest construction query committee. Meanwhile, all training data are used for training, testing is carried out on the testing data so as to be convenient for comparison with other active learning strategies, and the training performance of the model can be approximately considered as the best training data model on a training set.

Results and analysis of the experiments

The results of the analysis plots of fig. 2 to 6 are shown in table 3 below, in detail as follows:

table 3: AUC value comparison (mean + standard deviation). best performance based on predicted t-tests confidence of 95% is marked in bold

Table 4: win/tie/loss comparative analysis of uncertaintiykl model and other models under different labeling proportions

As shown in tables 3-4, fig. 2-6 illustrate the variation of AUC values for different active learning query strategies for different instances of annotation.

Table 3 shows the change of AUC values at the indicated sample ratios of 10%, 20%, 30%, 40%, 50%. When the proportion of the marked samples is more than 50%, the UNCERTAINTYKL active learning strategy already finishes marking all samples by using a pseudo-marking method, and no unmarked samples exist. Statistical analysis was performed using 95% confidence paired t-tests, and the best performing model was labeled. Table 4 performs win/tie/loss analysis on different learning strategies under different labeling ratios, and counts the comparison results between the uncertaintiykl learning strategy and the strategies such as commit, random, uncertaintiy, etc.

Firstly, the evaluation indexes can basically keep rising trend after the marked samples are added into the marked data set along with the gradual reduction of the unmarked samples, and the trend shows the effectiveness of the learning strategy of active learning; the uncertainty sampling strategy has good effect, and proves that the strategy can be used as a baseline strategy in the active learning field; an active learning strategy based on a query committee shows great instability in the AEEEM data set; under most conditions, the UNCERTAINTYKL has the best effect, is greatly improved compared with other active learning strategies, and the performance is improved by 13 percent under most conditions.

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims

1. A software defect prediction method based on a hybrid active learning strategy is characterized in that a cost-sensitive information entropy and relative entropy collaborative active learning method is adopted in the software defect prediction method based on the hybrid active learning strategy, which is called an UNCERTAINTYKL model for short, the UNCERTAINTYKL model uses information entropy as an evaluation index of high-quality samples, samples with higher information entropy are selected from unlabeled sample data to be labeled manually, and meanwhile, the relative entropy is used for further analyzing the samples with low information entropy, so that a labeled data set is further expanded;

the uncertaintiykl model comprises the following steps:

step 1: calculating the information entropy of each unmarked sample data through an information entropy calculation formula;

step 2: selecting a data sample with the highest information entropy from the unlabeled sample data, manually labeling the data sample by a domain expert, and adding a labeled data set after the labeling is finished;

and 4, step 4: presetting a relative entropy threshold, if the relative entropy is lower than the threshold, adding the sample into the marked data set, and simultaneously using the mark of the prediction result as a pseudo mark of the data; if the relative entropy is higher than the threshold value, abandoning the processing of the sample;

in order to solve the problem of solving the UNCERTAINTYKL model, the following sectional optimization strategy is adopted, and the optimization process is as follows:

A. initializing a system: before the system starts to run, a part of samples are taken out from the sample set pool and are manually labeled by field experts, and the samples are recorded as a labeled data set D_lThe sampling mode of the initial label combination is random sampling from the sample set and is marked by the label data set D_lComplete the class model θ¹As a basis for subsequent classification of label-free data;

B. active selection of unlabeled samples: using a classification model theta¹Predicting each unmarked sample, and calculating each unmarked sample according to a formulaThe information entropy of the samples is sorted and the sample x with the maximum information entropy is taken out_u,maxHanding over to a domain expert for manual labeling, and putting x_u,maxAdding a tagged data set D_l；

C. Pseudo marking processing of the sample with the highest degree of certainty: sample x with lowest entropy_u,minTaking out, calculating relative entropy according to formula, i.e. KLD, comparing KLD with threshold value, if threshold value is satisfied, comparing x_u,minLabel, and apply x_u,minAdding a tagged data set D_l；

2. The method of claim 1, wherein the data sample with the highest entropy in step 2 is calculated as follows:

x_u,max＝arg max(-∑_iP_θ(y_i/x)log P_θ(y_i/x)) (1)

where i denotes the ith unlabelled example (i ═ 1, 2.. u), y_iIndicates the value of the label, x, to which the label to be classified belongs_u,maxRepresents the data sample with the maximum information entropy in the unmarked data set obtained according to the formula (1), and belongs to the category y_iPredicted probability value of p_θ(y_i/x) represents the probability based on the distribution of the labeled data set data.

3. The method of claim 1, wherein the relative entropy calculation in step 3 comprises the following formula:

means, x, representing the relative entropy calculated from all classification models KLD_u,minRepresenting the data sample with the minimum information entropy in the unmarked data set obtained according to the formula (1), C representing the number of classifiers of the query committee, and the data set of the classifiers is dynamically updated D_lClassification committee C ═ θ¹,...,θ^mThe classifier members of the classification committee represent different classification strategies, and are able to calculate the current label, P, for unlabeled data_C(y_i/x) label y for representing the label of the classification model of the query committee to the label to be classified_iAverage value of the probabilities of D (P)_θ(C)/P_C) Represents a classification model θⁱRelative entropy of information for other models.

4. The method of claim 3, wherein the threshold is set to 0.1 in step 4, if the threshold is set to 0.1

The value satisfies the threshold range, then θ is usedⁱFor x_u,minThe sample in (1) is pseudo-marked.