CN113204481B

CN113204481B - Class imbalance software defect prediction method based on data resampling

Info

Publication number: CN113204481B
Application number: CN202110428102.3A
Authority: CN
Inventors: 荆晓远; 孔晓辉; 陈昊文
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2022-03-04
Anticipated expiration: 2041-04-21
Also published as: CN113204481A

Abstract

The invention provides a class imbalance software defect prediction method based on data resampling. According to the method, the Euclidean distance between a minority class data set and a majority class element and between the minority class data set and the minority class element is calculated, the minority class data and the majority class data which are closest to the minority class data are screened out, and the distance parameter of the minority class data is obtained through the Euclidean distance; marking the minority data in the minority data set according to the distance parameters, and obtaining minority data point types; calculating each K near-point set with few elements in the minority data sets, and counting the number of majority data and minority data in the K near-point sets to obtain the number of newly generated minority data; and respectively selecting two classifiers, performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set, training the selected classifiers, and obtaining a final prediction result through weighted voting. The invention can well solve the class imbalance problem in the software defect prediction process.

Description

Class imbalance software defect prediction method based on data resampling

Technical Field

The invention belongs to the field of software defect prediction, and particularly relates to a class imbalance software defect prediction method based on data resampling.

Background

With the development of society and the promotion of scientific technology, the internet is deeply integrated into the aspects of our lives, and various activities in our daily lives, such as online shopping, going out and sitting on a car, smart home, ordering in a restaurant and the like, can be completed through software, and the use scene of the software permeates the aspects of our wearing and living rows and the like. In the process of software development, the software function demand is continuously increased, the number of people served by software is continuously increased, the software development time is continuously compressed, various problems cause that the software is easy to have defects in the process of software development, the software cannot provide normal functions due to the occurrence of the software defects, huge production and economic losses are caused, and huge influences are caused on normal life of people.

However, in a real development environment, data with software defects is far smaller than data without software defects, a code module with software defects is less likely to be found out by the software defect prediction model constructed at this time, however, an ideal software defect prediction model needs to be more sensitive to data with defects and can more accurately predict whether the code module has defects, so that the problem of class imbalance of software defect prediction becomes very important to solve. In order to overcome the defects, the invention provides a class imbalance software defect prediction method.

Disclosure of Invention

The invention mainly aims to solve the class imbalance problem in software defect prediction, provides a class imbalance problem software defect prediction method, and is generally suitable for software defect prediction. In order to achieve the above object, the present invention comprises the steps of:

step 1, selecting any minority data in a minority data set to sequentially perform Euclidean distance calculation with each minority data in the minority data set, screening out the minority data closest to the selected minority data in the minority data set, selecting any minority data in the minority data set to sequentially perform Euclidean distance calculation with each majority data in a majority data set, screening out the majority data closest to the selected minority data in the majority data set, and calculating a distance parameter of the selected minority data according to the selected Euclidean distance between the selected minority data and each minority data in the minority data set and the closest Euclidean distance between the selected minority data and each minority data in the majority data set; marking the minority class data in the minority class data set according to the distance parameters of the minority class data, and obtaining the data point types of the minority class data; calculating a K neighbor point set of each minority data in the minority data set, further dividing the K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, respectively counting the number of the majority data in the K neighbor point majority data set and the number of the minority data in the K neighbor point minority data set, and calculating the number of newly generated minority data of each minority data in the minority data set;

step 2, respectively selecting a first classifier and a second classifier, and performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set;

step 3, obtaining a final prediction result through weighted voting by using the first classifier, the second classifier and the obtained training set S' selected in the step 2;

preferably, the software defect data in step 1 is: s ═ S_min，S_max}；

In step 1, the minority class data set is as follows:

the majority class data set in the step 1 is as follows:

wherein S is_minRepresenting a collection of minority classes of data, denoted by S_maxRepresenting sets of data of most classes, p_iRepresenting the ith minority class data in the minority class data set, i belongs to [1, N ]]N denotes the number of minority class data in the minority class data set, d_kRepresenting the kth majority class data in the majority class data set, K ∈ [1, K ∈]K represents the number of majority class data in the majority class data set;

step 1, the minority class number closest to the selected minority class dataAccording to the following steps:

i∈[1，N]， min_i∈[1，N]；

wherein,

representing minority class data which are closest to the selected ith minority class data in the minority class data set, and N representing the number of the minority class data in the minority class data set;

the majority class data closest to the selected minority class data in the step 1 is as follows:

i∈[1，N], max_i∈[1，K]；

wherein,

representing the majority class data closest to the selected ith minority class data in the majority class data set, and K representing the number of the minority class data in the minority class data set;

the selected minority data in the step 1 and the nearest Euclidean distance between each minority data in the minority data set are as follows:

the selected minority data in the step 1 and each majority data in the majority data set have the following nearest Euclidean distance:

the distance parameters of the minority class data selected by calculation in the step 1 are as follows:

wherein is alpha_iIs in a minority of classesDistance parameter of ith minority class data in the data set;

step 1, marking the minority class data in the minority class data set according to the distance parameter of the minority class data as follows:

if is equal to_iIf the data point type of the ith minority class data in the minority class data set is less than 1, the data point type of the ith minority class data in the minority class data set is marked as a safety point, and flag_i＝1；

If is equal to_iIf the data point type of the ith minority class data in the minority class data set is 1, the data point type is marked as an confusion point, and flag_i＝2；

If is equal to_iIf the data point type of the selected ith minority class data in the minority class data set is more than 1, the data point type of the selected ith minority class data in the minority class data set is marked as a dangerous point, and flag_i＝3；

Step 1, calculating a K neighbor point set of each minority class data in the minority class data set:

the K neighbor point set of each minority data in step 1 is divided into a K neighbor point majority data set and a K neighbor point minority data set, and specifically comprises the following steps:

1, the number of the majority class data in the K neighbor point majority class data set is marked as

1, recording the number of minority class data in the K neighbor point minority class data set as

Step 1, calculating the newly generated minority class data quantity of each minority class data in the minority class data set, specifically:

wherein is alpha_iDistance parameter, n, for the ith minority class data in the minority class data set_iIs a minority of the numberThe number of newly generated minority class data for the ith each minority class data in the data set;

step 1, calculating newly generated software defect prediction data;

step 1, the ith minority class data in the minority class data set generates n_iNew minority class data, so the newly generated minority class data is used as p^new _i，jIs represented by where j ∈ [1, n ]_i]

Step 1, the deviation amount of the jth newly generated data of the ith minority class data in the minority class data set from the majority class is marked as epsilon_i，j；

Wherein the offset epsilon of the jth newly generated data of the ith minority class data in the minority class data set from the majority class_i，jThe calculation formula is as follows:

wherein,

in order to deviate from most of the class degree parameters, a random number with the value of 0-1 is taken,

most of its recent classes of data.

1, the deviation of the jth newly generated data of the ith minority class data in the minority class data set to the majority class is recorded as sigma_i，j；

Step 1, the offset sigma of the jth newly generated data of the ith minority class data in the minority class data set biased to the majority class_i，jThe calculation formula is as follows:

wherein,

is biased to minority class hierarchy parameter, and takes random number of 0-1.5,

the most recent few classes of data.

Step 1, newly generated software defect prediction data minority class data is marked as p^new _i，j；

The j-th newly generated data calculation formula of the ith minority class data of the newly generated software defect prediction data is as follows:

p^new _i，j＝p_i+ε_i，j+σ_i，j

step 1, obtaining a newly formed minority class data set, and recording the data set as S_new；

Step 1 said minority point p_iNumber n of newly generated defect data_iAccording to the minority class data p generated above^newIn such a way that a new minority class data set S is obtained_new。

Wherein,

n' is a newly formed minority class data set S_newThe number of elements is included, the category of the new data is marked as defect data, and the category of the new data is marked as weak mark L_wSymbol p 'for the ith new minority class data set'_iRepresentation, its mark is

Preferably, the step 2 is specifically as follows:

step 2, respectively calculating the influence degree of the first classifier and the influence degree of the second classifier;

step 2, utilizing the newly formed minority class data set S_newTraining the first classifier H₁Using newly formed minority class data sets S_newAre sequentially brought into the first classifier H₁To obtain a predicted class L_p1For S_newIn (1)Ith point p'_iWeak label thereof

H₁The prediction category is

Using newly formed minority class data sets S_newTraining the second classifier H₂Using newly formed minority class data sets S_newIn turn into a second classifier H₂To obtain a predicted class L_p2For S_newThe ith point of (1)'_iWeak label thereof

H₂The prediction category is

The influence degree of the first classifier is as follows:

wherein N is a minority class data set S_minNumber of elements, first classifier H₁Prediction class and Weak Mark L_wSame class

A value of 1, otherwise a value of 0, a second classifier H₂Prediction class and Weak Mark L_wSame class

The value is 1, otherwise the value is 0.

The influence degree of the second classifier is as follows:

The value is 1, otherwise the value is 0.

Step 2, updating the labels of the minority data according to the influence degree of the first classifier and the influence degree of the second classifier so as to construct updated original software defect data;

step 2, calculating the weak mark

Confidence of (2), by the symbol γ_iAnd (4) showing.

Step 2, weak marking of new minority class data set

The judgment is carried out according to the influence degree of the classifier, and the calculation formula is

When the confidence degree gamma_iBeta, will be this minority class of data

Adding training data when gamma is_iWhen the beta value is less than or equal to beta, the data is directly deleted, and the data of the minority class is not added into a new training set.

Step 2, newly forming minority class data, namely S_newIs screened again to obtain new minority class data S_new', will S_newAdding original software defect data S to obtain a new training set S';

preferably, the step 3 specifically includes the following steps:

after a new training data set S' is obtained, a first classifier H is trained₁And a second classifier H₂By a trained first classifier H₁And a second classifier H₂Respectively obtaining the prediction results L of the first classifier by the prediction data v₁And a second classifier L₂Continuing to use the influence degree o of the first classifier₁And degree of influence o of the second classifier₂Using the calculation formula L_pre＝L₁*o₁+L₂*o₂To obtain a predicted result;

as described in step 3, when L_preWhen the value is larger than beta, predicting the category of v as a minority class;

as described in step 3, when L_preWhen the value of (b) is less than or equal to β, predicting the category of v as a majority;

compared with the prior art, the invention has the advantages and positive effects that:

the invention can well solve the class imbalance problem.

The method adds a screening process for newly generated minority class data, removes data deviating from the actual data, and retains data capable of showing true characteristics of the minority class.

The software defect prediction method capable of solving the class imbalance is provided, and can be widely applied to various software defect data and solve the class imbalance problem.

Drawings

FIG. 1: the invention is a method diagram for predicting software defects of class imbalance.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings and specific embodiments, wherein the present invention is illustrated by way of suitable examples and not by way of limitation.

The general implementation flow chart of the invention is shown in fig. 1, and the specific implementation is as follows:

step 1, selecting any minority data in a minority data set to sequentially perform Euclidean distance calculation with each minority data in the minority data set, screening out the minority data closest to the selected minority data in the minority data set, selecting any minority data in the minority data set to sequentially perform Euclidean distance calculation with each majority data in a majority data set, screening out the majority data closest to the selected minority data in the majority data set, and calculating a distance parameter of the selected minority data according to the selected Euclidean distance between the selected minority data and each minority data in the minority data set and the closest Euclidean distance between the selected minority data and each minority data in the majority data set; marking the minority class data in the minority class data set according to the distance parameters of the minority class data, and obtaining the data point types of the minority class data; calculating a K neighbor point set of each minority data in the minority data set, further dividing the K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, respectively counting the number of the majority data in the K neighbor point majority data set and the number of the minority data in the K neighbor point minority data set, and calculating the number of newly generated minority data of each minority data in the minority data set.

Step 1, the software defect data are as follows: s ═ S_min，S_max}；

In step 1, the minority class data set is as follows:

the majority class data set in the step 1 is as follows:

wherein S is_minRepresenting a collection of minority classes of data, denoted by S_maxRepresenting sets of data of most classes, p_iRepresenting the ith minority class data in the minority class data set, i belongs to [1, N ]]N represents a minority of classesNumber of minority classes of data in data set, d_kRepresenting the kth majority class data in the majority class data set, K ∈ [1, K ∈]K represents the number of majority class data in the majority class data set;

in step 1, the minority class data closest to the selected minority class data is:

i∈[1，N]， min_i∈[1，N]；

wherein,

i∈[1，N], max_i∈[1，K]；

wherein,

wherein is alpha_iA distance parameter for the ith minority class data in the minority class data set;

Step 1, calculating a K neighbor point set of each minority class of data in the minority class of data sets, where K is set to 5:

wherein is alpha_iDistance parameter, n, for the ith minority class data in the minority class data set_iA newly generated minority class data quantity for the ith each minority class data in the minority class data set;

step 1, calculating newly generated software defect prediction data;

wherein,

most of its recent classes of data.

wherein,

the most recent few classes of data.

p^new _i，j＝p_i+ε_i，j+σ_i，j

Wherein,

the step 2 is specifically as follows:

step 2, utilizing the newly formed minority class data set S_newTraining the first classifier H₁Using newly formed minority class data sets S_newAre sequentially brought into the first classifier H₁To obtain a predicted class L_p1For S_newThe ith point of (1)'_iWeak label thereof

H₁The prediction category is

H₂The prediction category is

The influence degree of the first classifier is as follows:

The value is 1, otherwise the value is 0.

The influence degree of the second classifier is as follows:

The value is 1, otherwise the value is 0.

step 2, calculating the weak mark

Confidence of (2), by the symbol γ_iAnd (4) showing.

Step 2, weak marking of new minority class data set

When the confidence degree gamma_iBeta is 0.5, the minority class data

Adding training data when gamma is_iWhen the beta is less than or equal to 0.5, the data is directly deleted, and the data of the minority class is not added into the new training set.

the step 3 specifically comprises the following steps:

as described in step 3, when L_preWhen the value is greater than 0.5, the category of the predicted v is a few;

as described in step 3, when L_preWhen the value of β is 0.5 or less, the category of the prediction v is a majority category.

In the embodiment, the method is compared with the traditional mainstream SMOTE + SVM, SMOTE + decision tree, SMOTE + k neighbor and SMOTE + naive Bayes methods, and the comparison results of the precision, F-measure, balance and AUC indexes are selected. In all the comparison methods, the accuracy of the method is highest, and the identification accuracy reaches the advanced level of the field.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A class imbalance software defect prediction method based on data resampling is characterized in that,

the software defect data is: s ═ S_min，S_max}；

Step 1, the minority class data set is as follows:

step 1, most of data sets are as follows:

the minority class data closest to the selected minority class data in the step 1 are:

wherein,

the majority data closest to the selected minority data in the step 1 is:

wherein,

the shortest Euclidean distance between the minority class data selected in the step 1 and each minority class data in the minority class data set is as follows:

the Euclidean distance between the minority class data selected in the step 1 and each majority class data in the majority class data set is as follows:

the distance parameters of the minority class data calculated and selected in the step 1 are as follows:

step 1, marking minority class data in the minority class data set according to the distance parameter of the minority class data as follows:

If is equal to_iIf the number is more than 1, the selected number in the minority class data set is equal to the number of the selected dataThe data point types of the i minority class data are marked as dangerous points, flag_i＝3；

step 1, dividing a K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, specifically:

step 1K, the number of the majority class data in the neighbor point majority class data set is recorded as

Step 1K, the number of minority class data in the neighbor point minority class data set is recorded as

step 1, calculating newly generated software defect prediction data;

step 1, generating n by the ith minority class data in the minority class data set_iNew minority class data, so the newly generated minority class data is used as p^new _i，jIs represented by where j ∈ [1, n ]_i]

Wherein the ith minority class data in the minority class data setOffset of j newly generated data from majority class_i，jThe calculation formula is as follows:

wherein,

most of the class data that is most recent;

step 1, the deviation of the jth newly generated data of the ith minority class data in the minority class data set to the majority class is recorded as sigma_i，j；

wherein,

the minority class of data that is its most recent;

step 1, newly generating software defect prediction data minority class data, and recording the minority class data as p^new _i，j；

p^new _i，j＝p_i+ε_i，j+σ_i，j

step 1, obtaining a newly generated minority class data set, and recording the data set as S_new；

Step 1 minority points p_iNumber n of newly generated defect data_iAccording to the minority class data p generated above^newIn such a way that a newly generated minority class data set S is obtained_new；

Wherein,

n, as new minority class data set S_newThe number of elements is included, the category of the new data is marked as defect data, and the category of the new data is marked as weak mark L_wSymbol p 'for the ith new minority class data set'_iRepresentation, its mark is

The step 2 is specifically as follows:

step 2 utilizing newly formed minority class data set S_newTraining the first classifier H₁Using newly formed minority class data sets S_newAre sequentially brought into the first classifier H₁To obtain a predicted class L_p1For S_newThe ith point of (1)'_iWeak label thereof

H₁The prediction category is

Using newly formed minority class data sets S_newTraining the second classifier H₂Using newly formed minority class data sets S_newIn turn into a second classifier H₂To obtain a predicted class L_p2For S_newThe ith point of (1)'_iWeak mark thereofIs composed of

H₂The prediction category is

The influence degree of the first classifier is as follows:

The value is 1, otherwise the value is 0;

the influence degree of the second classifier is as follows:

The value is 1, otherwise the value is 0;

step 2, calculating weak marks

Confidence of (2), by the symbol γ_iRepresents;

step 2, weak marking of new minority class data set

When the confidence degree gamma_iBeta, will be this minority class of data

Adding training data when gamma is_iWhen the beta is less than or equal to beta, directly deleting the data, and not adding the minority class data into a new training set;

step 2, newly forming minority class data, namely S_newIs screened again to obtain newly generated minority class data S_new', will S_new'Add original software defect data S to get a new training set S'.

2. The method of claim 1,

the step 3 specifically comprises the following steps:

after a new training data set S' is obtained, a first classifier H is trained₁And a second classifier H₂By a trained first classifier H₁And a second classifier H₂Respectively obtaining the prediction results of the first classifier by the prediction data vL₁And a second classifier prediction result L₂Continuing to use the influence degree o of the first classifier₁And degree of influence o of the second classifier₂Using the calculation formula L_pre＝L₁*o₁+L₂*o₂To obtain a predicted result;

step 3, when L is_preWhen the value is larger than beta, the category of the prediction data v is a minority class;

step 3, when L is_preWhen the value of (b) is equal to or less than β, the category of the prediction data v is a majority category.