CN109656808A

CN109656808A - A kind of Software Defects Predict Methods based on hybrid active learning strategies

Info

Publication number: CN109656808A
Application number: CN201811319619.3A
Authority: CN
Inventors: 曲豫宾; 李芳�
Original assignee: Nantong Textile Vocational Technology College
Current assignee: Nantong Textile Vocational Technology College
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2019-04-19
Anticipated expiration: 2038-11-07
Also published as: CN109656808B

Abstract

The invention discloses a kind of Software Defects Predict Methods based on hybrid active learning strategies, Active Learning Method is cooperateed with relative entropy using the comentropy based on cost-sensitive, this method uses common comentropy as the evaluation index of high-quality sample, sample higher for comentropy marks by hand, the sample of low comentropy, significantly more efficient expansion labeled data collection are further analyzed using relative entropy simultaneously.It is shown experimentally that, software defect estimated performance can be improved in the present invention, reduces artificial mark cost, more efficiently.

Description

A kind of Software Defects Predict Methods based on hybrid active learning strategies

Technical field

The present invention relates to active learning techniques fields, more particularly to a kind of based on the soft of hybrid active learning strategies Part failure prediction method.

Background technique

Software defect module will cause the operation failure in enterprise production process, lead to appearance of enterprise heavy losses, reduce The satisfaction of client.Software defect prediction model is used to find software defect module, common model as early as possible in software development phase It include the model and unsupervised model etc. of supervision.

If software project has history labeled data abundant, there can be the engineering of supervision by establishing Model is practised, to construct same project software failure prediction (within-project defect prediction) model, is assessed soft The probability of part Module defect or the defect number for calculating some module etc..In actual software development process, if software Project is that the training data of completely new project or this project is fewer, then needing enterprise for defect module marks work It devotes considerable time, while the work is professional relatively stronger work, software module mark is needed more professional Personnel carry out, therefore the foundation of software defect prediction model requires a great deal of time, put into more manpower, mention The cost of software development is risen.

Active Learning allows enterprise in face of magnanimity mark to solve the problems, such as that sample mark provides a variety of query strategies It actively selects some sample to be labeled when injection molding block, sample to be marked is manually marked to completion and is added to mark sample later Number of cases quickly establishes software defect prediction model according to concentration.The selection strategy of Active Learning be used to predict number from software defect High-quality sample is selected according to concentrating, dimension reduction is used in combination by spread training data set after manually marking in sample, special The other machines learning methods such as sign selection promote the performance of software defect prediction.

The selection strategy used includes the common strategies such as Uncertainty information entropy, however is not paid close attention in these researchs low The higher sample of the sample of comentropy, i.e. degree of certainty, the often sample quilt of low comentropy during Active Learning one query It abandons, the utilization of low comentropy sample is seldom related to.

Patent No. CN201710271035.2 discloses a kind of multi-tag Active Learning side that tally set is relied on based on condition Method, by integrating simultaneously to sample information entropy and relative entropy, object of the big sample of filter information amount as Active Learning should Relative entropy is added although having used comentropy and the cooperative principle of relative entropy, in comentropy processing stage in method simultaneously Calculate, instead can operational efficiency to system and effect have an adverse effect, in addition low comentropy sample is not also fine Utilization.

Summary of the invention

To solve manually to mark at high cost, the low problem of estimated performance, the present invention provides one kind to be based on hybrid active The Software Defects Predict Methods of learning strategy.

A kind of Software Defects Predict Methods based on hybrid active learning strategies, which is characterized in that described based on mixing The Software Defects Predict Methods of formula active learning strategies cooperate with Active Learning with relative entropy using the comentropy based on cost-sensitive Method, referred to as UNCERTAINTYKL model, the UNCERTAINTYKL model use information entropy is as high-quality sample Evaluation index is chosen the higher sample of comentropy from unmarked sample data and is marked by hand, while using relative entropy come into one Step analyzes the sample of low comentropy, further expands marked data set.

Preferably, the UNCERTAINTYKL model the following steps are included:

Step 1: the comentropy of each unmarked sample data is calculated by comentropy calculation formula；Step 2: passing through meter Calculate formula (1) selects the highest data sample of comentropy that domain expert is transferred to manually to be marked from unmarked sample data, Marked data set is added after the completion of mark；

Step 3: the minimum unmarked sample data of remaining information entropy in screening step 2, using relative entropy calculation into Rower note；

Step 4: presetting an opposite entropy threshold if relative entropy is lower than threshold value and the sample is added to marked number According to collection, while using the label of prediction result as the puppet label of the data；If relative entropy is higher than threshold value, abandon to the sample Processing.

Preferably, the calculation of the highest data sample of comentropy described in step 2 is as follows:

x_u,max=argmax (- ∑_iP_θ(y_i/x)logP_θ(y_i/x)) (1)

Wherein i indicate not mark for i-th sample (i=1,2 ... u), y_iIndicate label value belonging to label to be sorted, x_u,maxIt indicates that the Unlabeled data obtained according to formula (1) concentrates the maximum data sample of comentropy, belongs to classification y_iPrediction Probability value, p_θ(y_i/ x) it indicates based under marked data set data distribution.

Preferably, relative entropy calculation described in step 3 includes following formula:

Indicate the mean value according to all disaggregated model KLD relative entropy being calculated, x_u,minIt indicates according to formula (1) Obtained Unlabeled data concentrates the smallest data sample of comentropy, and C indicates the classifier number of inquiry committee, classifier Data set be dynamic update D_l, classify committee C={ θ¹,...,θ^m, the classifier member for the committee of classifying represents not Same classification policy can calculate current markers, P to Unlabeled data_C(y_i/ x) for indicating inquiry committee's classification mould Type is for label y belonging to label to be sorted_iProbability average value, D (P_θ(C)/P_C) presentation class model θⁱTo other models Relative Entropy.

Preferably, threshold value described in step 4 is set as empirical value 0.1, if describedValue meets threshold range, then makes Use θⁱTo x_u,minIn sample carry out pseudo- label.

Preferably, to solve the problems, such as the model solution, using following segmented optimisation strategy, optimization process is as follows:

A. system initialization: before system brings into operation, a part of sample is taken out from sample set pond and transfers to field special Family carries out manual mark, which is denoted as D_l, the sample mode which combines is random to take from sample set Sample, by D_lData set is completed to disaggregated model θ¹First training, as it is subsequent to data untagged classification basis；

B. unmarked sample actively selects: using disaggregated model θ¹Each unmarked sample is predicted, according to public Formula calculates the comentropy of each sample, and the maximum sample x of comentropy is taken out in sequence_u,_maxDomain expert is transferred to carry out manual mark Note, and by x_u,maxFlag data collection D is added_l；

C. the pseudo- label processing of degree of certainty highest sample: by the sample x that comentropy is minimum_u,minIt takes out, is calculated according to formula KLD is compared, if meeting threshold value, to x by relative entropy, i.e. KLD with threshold value_u,minMark, and by x_u,minLabel is added Data set D_l；

D. disaggregated model updates: using flag data collection D_lTrain classification models θ again¹, then recycle until meeting eventually Only until condition.

The utility model has the advantages that

1. Fusion Information Entropy and relative entropy cooperate, hybrid-type Active Learning query strategy is taken, by using phase Entropy further analyzes low comentropy, the information for more making full use of sample to include, to improve software defect prediction Performance finds more rapidly the defect module of software.

2. use hybrid-type Active Learning query strategy, enterprise only need early investment relative to less artificial mark at This, to obtain better software defect predictive ability, can preferably control cost under the premise of meeting enterprise demand, Save manpower.

Detailed description of the invention

Fig. 1 is the algorithm flow schematic diagram of the Software Defects Predict Methods of the hybrid active learning strategies of the present invention

Fig. 2 is AUC index schematic diagram of the Equinox data set under different Active Learning query strategies

Fig. 3 is AUC index schematic diagram of the Eclipse JDT Core data set under different Active Learning query strategies

Fig. 4 is AUC index schematic diagram of the Apache Lucene data set under different Active Learning query strategies

Fig. 5 is AUC index schematic diagram of the Mylyn data set under different Active Learning query strategies

Fig. 6 is AUC index schematic diagram of the Eclipse PED UI data set under different Active Learning query strategies

Specific embodiment

The part we will be described in detail it is proposed that cooperateed with based on the comentropy of cost-sensitive with relative entropy and actively learn It practises tactful (Cost-Effective Entropy Kullback-Leibler-divergence Active Learning), letter Referred to as UNCERTAINTYKL model.The learning strategy will be applied to common AEEEEM data set in software defect prediction In, by the improvement disaggregated model of selection strategy increment, reaching classifier when identical data mark amount be can be realized more Good classification indicators.

UNCERTAINTYKL active learning strategies

UNCERTAINTYKL model of the invention incorporates the thinking of coorinated training in the learning strategy of Active Learning, leads to Cross the creation for minimizing number of labels to complete model.Active learning strategies based on Uncertainty information entropy will be marked never Comentropy is highest is labeled by domain expert for selection in data, while the data to be marked that comentropy is minimum, is entrusted by inquiry Member can vote according to KLD, if by the lower (warp being arranged at present of relative entropy for inquiring the sample that committee member calculates Testing threshold value is 0.1), then the sample to be added to labeled data collection, while using the label of prediction result as the puppet mark of the data Note, it is specific as follows

x_u,max=argmax (- ∑_iP_θ(y_i/x)logP_θ(y_i/x)) (1)

Define the data set that sample has been marked containing lDefine the data set not marked containing uIndicate not mark sample i-th that (i=1,2 ... u), x_u,maxIndicate the unmarked number obtained according to (1) According to the concentration maximum data sample of comentropy, p_θ(y_i/ x) it indicates to be based under labeled data collection data distribution, x_u,maxBelong to In classification y_iPrediction probability value, argmax (- ∑_iP_θ(y_i/x)logP_θ(y_i/ x)) indicate selection unlabeled data lumped values most Big data sample.

x_u,minIndicate that the Unlabeled data obtained according to (1) concentrates the smallest data sample of comentropy, C indicates inquiry committee The classifier number of member's meeting, the data set of classifier are the D that dynamic updates_l.Classification committee C={ θ¹,...,θ^m, classification committee The classifier member of member's meeting represents different classification policies, can calculate current markers to Unlabeled data.P_C(y_i/ x) it uses In expression inquiry committee's disaggregated model for label y belonging to label to be sorted_iProbability average value.D(P_θ(C)/P_C) table Show disaggregated model θⁱTo the Relative Entropy of other models,Expression is calculated opposite according to all disaggregated model KLD The mean value of entropy.If relative entropy is less than threshold value, θ is usedⁱPseudo- label is carried out to example to be marked.

The algorithm marked above is specific as follows:

For the optimisation strategy of segmented for solving the problems, such as the model solution, optimization process is as follows:

A system initialization

Before system brings into operation, a part of sample is taken out from sample set pond, domain expert is transferred to carry out manual mark Note, the set are denoted as D_l.The initial markers combine sample mode be it is random sampled from sample set, by D_lData set is complete Pairs of disaggregated model θ¹First training, as it is subsequent to data untagged classification basis；

The unmarked sample of B actively selects

Use disaggregated model θ¹Each unmarked sample is predicted, the letter of each sample is calculated according to formula (1) Entropy is ceased, the maximum sample x of comentropy is taken out in sequence_u,maxDomain expert is transferred to carry out manual mark, and by x_u,maxLabel is added Data set D_l；

The pseudo- label of C degree of certainty highest sample is handled

By the sample x that comentropy is minimum_u,minTake out, according to formula (2) (3) (4) calculate relative entropy, i.e. KLD, by KLD with Threshold value is compared, if meeting threshold value, to x_u,minMark, and by x_u,minFlag data collection D is added_l；

D disaggregated model updates

Use flag data collection D_lTrain classification models θ again¹, then recycle until meeting termination condition.

Entire algorithmic procedure can be summarized as algorithm 1

1 segmented strategy of algorithm solves UNCERTAINTYKL model

1:Input: initialization tag data set D_l={ x₁,...x_l, Unlabeled data collection D_u={ x_l+1,...x_l+u}, Data markers y_1i,...y_l, maximum cycle U_max, KLD threshold value threshold

2:Output: classification committee setClassification performance set

3: using the disaggregated model of the flag data initial interrogation committee

4:while current cycle time < maximum cycle | | not converged do

5:for i < -1to Umax do

6: taking out committeeman θⁱ, the training pattern on flag data collection

7: to current x⁽ⁱ⁾Class probability P is calculated, according to argmax (- ∑_iP_θ(y_i/x)logP_θ(y_i/ x)) calculate corresponding letter Cease entropy

8: the maximum sample x of comentropy⁽ⁱ⁾It transfers to domain expert manually to be marked, and flag data collection is added

9: the smallest sample x of comentropy^(t), relative entropy, i.e. average KLD value are calculated according to formula (2) (3) (4).

10:D_l=D_l∪x⁽ⁱ⁾；D_u=D_u\x⁽ⁱ⁾

11:if KLD > threshold:D_l=D_l∪x^(t)；D_u=D_u\x^(t)

12:i=i+1

13:end for

14:end while

15:

Experimental design

Evaluating object

It is pre- to software defect that the present invention will analyze assessment UNCERTAINTYKL query strategy on public data collection AEEEM Survey field Active Learning will be used to assess different learning strategies using tactful influence, AEEEM data set.The data set It is widely used in software defect prediction field.The data set is used to carry out performance comparison as software defect prediction field Benchmark dataset.The data set provides 61 indexs, including software development process measurement etc., in this experiment 61 fingers Mark all be used to do classifier modeling.AEEEM data set summary info is shown in table 1.

1 AEEEM data set of table is summarized

Experimental setup

Experiment uses 5*2 folding cross validation and is tested, and experiment all does Optimum allocation random stratified sampling survey, half to data every time As training data, a half data prevents from generating data overlap between training data and test data data as test data And make evaluation result not independent.It takes out a certain proportion of data in training data manually to be marked, in this experiment initially The ratio of flag data is 30%, and using original tag data collection train classification models, remaining 70% data, which are used as, not to be marked Data are therefrom selected according to active learning strategies.Classifier support vector machines (supporting vector in experiment Machine it) is trained using the RBF core and default parameters realized by libsvm.Make in UNCERTAINTYKL query strategy With the random forest grader RandomForestClassifier realized by sklearn, default parameters is used in training. It in UNCERTAINTYKL strategy, is iterated for unlabeled data, each iteration selects a highest sample of uncertainty Example is manually labeled, and the minimum sample of simultaneous selection degree of certainty further uses KLD and judged, the threshold value setting of judgement For empirical value 0.1.

Evaluation metrics

There are class imbalance problems for AEEEM data set, can be preferable using AUC (area under ROC curve) index Reflection Active Learning query strategy performance, while AUC (area under ROC cruve) index be also carry out software One of most index when failure prediction.The index is based on ROC curve, and the full name of ROC curve is subject's work Feature (receiver operation characteristic) curve.Binary classification model is mixed in software defect prediction The matrix that confuses is as shown in table 2.

2 confusion matrix of table

Pseudo- positive rate (FPR) is defined as X-axis by ROC curve, and true positive rate (TPR) is defined as Y-axis.TPR: in all realities Border is to be correctly judged in the sample of defective module as the ratio of defective module.FPR: being zero defect in all reality In the sample of module, it is wrongly judged the ratio for defective module.

TPR=TP/ (TP+FN)

FPR=FP/ (TN+FP)

It is area below ROC curve that AUC value is corresponding, and value range is worth more big then model between 0 to 1 Performance is better.

Pedestal method

Present invention uses the query strategies of following three kinds of Active Learnings actively to learn as baseline and UNCERTAINTYKL Strategy is practised to be compared:

(1) stochastic sampling strategy (random): a query case is selected in random slave unlabeled data, transfers to lead Domain expert is labeled, and is added to training data concentration；

(2) sampling policy (uncertainty) based on Uncertainty information entropy: based on SVM classifier to training data Collection is trained, and is predicted each unlabeled data, and comentropy is calculated, and the highest example of uncertainty is selected to be labeled；

(3) SVM and random forest building inquiry the active learning strategies based on inquiry committee (committee): are used Mark sample is inquired by the committee.It has used all training datas to be trained simultaneously, has been surveyed in test data Examination, in order to compare with other active learning strategies, it is on training set that the training performance of the model, which can be approximately considered, Optimum training data model.

Experimental result and analysis

The analysis result following table 3 of Fig. 2 to Fig. 6 analysis chart indicates, specific as follows:

Table 3:AUC value compare (mean value+standard deviation) based on paired t-tests confidence level be 95% optimum performance with Black matrix mark

Table 4: the win/tie/loss of uncertaintykl model and other models is to score under different labeled ratio situation Analysis

As shown in table 3-4, Fig. 2 to Fig. 6 illustrates the different Active Learning query strategies situation different in mark example The situation of change of lower AUC value.

Table 3 illustrates the variation feelings of the AUC value when marking sample ratio and being 10%, 20%, 30%, 40%, 50% Condition.When marking sample ratio greater than 50%, UNCERTAINTYKL active learning strategies have been completed using pseudo- labeling method The label of all samples, without unmarked sample.It is united using the paired t-tests that certainty factor is 95% Meter analysis, the optimal model of performance is marked out to come.Different learning strategies under 4 pairs of table mark ratio different situations are done Win/tie/loss analysis, has counted UNCERTAINTYKL learning strategy and committee, random, uncertainty etc. Tactful comparing result.

Firstly, we can observe that as unmarked sample gradually decreases, mark sample be added to labeled data collection with Afterwards, evaluation metrics are all substantially able to maintain the trend of rising, which shows the validity of the learning strategy of Active Learning； The sampling policy effect of uncertainty is pretty good, it was demonstrated that the strategy can be used as the baseline strategy in Active Learning field；Based on looking into The active learning strategies of the inquiry committee show biggish unstability in AEEEM data set；In most cases, UNCERTAINTYKL effect is best, all has a distinct increment than other several active learning strategies, performance boost in most situations 13%.

Above-described embodiment is presently preferred embodiments of the present invention, is not a limitation on the technical scheme of the present invention, as long as Without the technical solution that creative work can be realized on the basis of the above embodiments, it is regarded as falling into of the invention special In the rights protection scope of benefit.

Claims

1. a kind of Software Defects Predict Methods based on hybrid active learning strategies, which is characterized in that described based on hybrid The Software Defects Predict Methods of active learning strategies cooperate with Active Learning side with relative entropy using the comentropy based on cost-sensitive Method, referred to as UNCERTAINTYKL model, evaluation of the UNCERTAINTYKL model use information entropy as high-quality sample Index is chosen the higher sample of comentropy from unmarked sample data and marked by hand, while further divided using relative entropy The sample of low comentropy is analysed, marked data set is further expanded.

2. a kind of Software Defects Predict Methods based on hybrid active learning strategies according to claim 1, feature Be, the UNCERTAINTYKL model the following steps are included:

Step 1: the comentropy of each unmarked sample data is calculated by comentropy calculation formula；

Step 2: selecting the highest data sample of comentropy to transfer to field special from unmarked sample data by calculation formula (1) Family is manually marked, and marked data set is added after the completion of mark；

Step 3: the minimum unmarked sample data of remaining information entropy in screening step 2 is marked using relative entropy calculation Note；

Step 4: an opposite entropy threshold is preset, if relative entropy is lower than threshold value, the sample is added to marked data set, Use the label of prediction result as the puppet label of the data simultaneously；If relative entropy is higher than threshold value, the processing to the sample is abandoned.

3. a kind of Software Defects Predict Methods based on hybrid active learning strategies according to claim 2, feature It is, the calculation of the highest data sample of comentropy described in step 2 is as follows:

x_u,max=argmax (- ∑_iP_θ(y_i/x)logP_θ(y_i/x))(1)

Wherein i indicate not mark for i-th sample (i=1,2 ... u), y_iIndicate label value belonging to label to be sorted, x_u,maxTable Show and the maximum data sample of comentropy is concentrated according to the Unlabeled data that formula (1) obtains, belongs to classification y_iPrediction probability value, p_θ(y_i/ x) it indicates based under marked data set data distribution.

4. a kind of Software Defects Predict Methods based on hybrid active learning strategies according to claim 2, feature It is, relative entropy calculation described in step 3 includes following formula:

Indicate the mean value according to all disaggregated model KLD relative entropy being calculated, x_u,minWhat expression was obtained according to formula (1) Unlabeled data concentrates the smallest data sample of comentropy, and C indicates the classifier number of inquiry committee, the data set of classifier The D updated for dynamic_l, classify committee C={ θ¹,...,θ^m, the classifier member for the committee of classifying represents different classification plans Slightly, current markers, P can be calculated to Unlabeled data_C(y_i/ x) for indicating inquiry committee's disaggregated model for be sorted Label y belonging to label_iProbability average value, D (P_θ(C)/P_C) presentation class model θⁱTo the Relative Entropy of other models.

5. a kind of Software Defects Predict Methods based on hybrid active learning strategies according to claim 2, feature It is, threshold value described in step 4 is set as empirical value 0.1, if describedValue meets threshold range, then uses θⁱTo x_u,minIn Sample carry out pseudo- label.

6. a kind of Software Defects Predict Methods based on hybrid active learning strategies according to claim 1, feature It is, to solve the problems, such as the model solution, using following segmented optimisation strategy, optimization process is as follows:

A. system initialization: before system brings into operation, taken out from sample set pond a part of sample transfer to domain expert into Mark, the set are denoted as D to row by hand_l, the initial markers combine sample mode be it is random sampled from sample set, by D_l Data set is completed to disaggregated model θ¹First training, as it is subsequent to data untagged classification basis；

B. unmarked sample actively selects: using disaggregated model θ¹Each unmarked sample is predicted, is calculated according to formula The maximum sample x of comentropy is taken out in the comentropy of each sample, sequence_u,maxDomain expert is transferred to carry out manual mark, and will x_u,maxFlag data collection D is added_l；

C. the pseudo- label processing of degree of certainty highest sample: by the sample x that comentropy is minimum_u,minIt takes out, is calculated according to formula opposite KLD is compared, if meeting threshold value, to x by entropy, i.e. KLD with threshold value_u,minMark, and by x_u,minFlag data is added Collect D_l；

D. disaggregated model updates: using flag data collection D_lTrain classification models θ again¹, then recycle until meeting termination condition Until.