CN113204481B - Class imbalance software defect prediction method based on data resampling - Google Patents

Class imbalance software defect prediction method based on data resampling Download PDF

Info

Publication number
CN113204481B
CN113204481B CN202110428102.3A CN202110428102A CN113204481B CN 113204481 B CN113204481 B CN 113204481B CN 202110428102 A CN202110428102 A CN 202110428102A CN 113204481 B CN113204481 B CN 113204481B
Authority
CN
China
Prior art keywords
data
minority
class data
class
minority class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110428102.3A
Other languages
Chinese (zh)
Other versions
CN113204481A (en
Inventor
荆晓远
孔晓辉
陈昊文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202110428102.3A priority Critical patent/CN113204481B/en
Publication of CN113204481A publication Critical patent/CN113204481A/en
Application granted granted Critical
Publication of CN113204481B publication Critical patent/CN113204481B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a class imbalance software defect prediction method based on data resampling. According to the method, the Euclidean distance between a minority class data set and a majority class element and between the minority class data set and the minority class element is calculated, the minority class data and the majority class data which are closest to the minority class data are screened out, and the distance parameter of the minority class data is obtained through the Euclidean distance; marking the minority data in the minority data set according to the distance parameters, and obtaining minority data point types; calculating each K near-point set with few elements in the minority data sets, and counting the number of majority data and minority data in the K near-point sets to obtain the number of newly generated minority data; and respectively selecting two classifiers, performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set, training the selected classifiers, and obtaining a final prediction result through weighted voting. The invention can well solve the class imbalance problem in the software defect prediction process.

Description

Class imbalance software defect prediction method based on data resampling
Technical Field
The invention belongs to the field of software defect prediction, and particularly relates to a class imbalance software defect prediction method based on data resampling.
Background
With the development of society and the promotion of scientific technology, the internet is deeply integrated into the aspects of our lives, and various activities in our daily lives, such as online shopping, going out and sitting on a car, smart home, ordering in a restaurant and the like, can be completed through software, and the use scene of the software permeates the aspects of our wearing and living rows and the like. In the process of software development, the software function demand is continuously increased, the number of people served by software is continuously increased, the software development time is continuously compressed, various problems cause that the software is easy to have defects in the process of software development, the software cannot provide normal functions due to the occurrence of the software defects, huge production and economic losses are caused, and huge influences are caused on normal life of people.
However, in a real development environment, data with software defects is far smaller than data without software defects, a code module with software defects is less likely to be found out by the software defect prediction model constructed at this time, however, an ideal software defect prediction model needs to be more sensitive to data with defects and can more accurately predict whether the code module has defects, so that the problem of class imbalance of software defect prediction becomes very important to solve. In order to overcome the defects, the invention provides a class imbalance software defect prediction method.
Disclosure of Invention
The invention mainly aims to solve the class imbalance problem in software defect prediction, provides a class imbalance problem software defect prediction method, and is generally suitable for software defect prediction. In order to achieve the above object, the present invention comprises the steps of:
step 1, selecting any minority data in a minority data set to sequentially perform Euclidean distance calculation with each minority data in the minority data set, screening out the minority data closest to the selected minority data in the minority data set, selecting any minority data in the minority data set to sequentially perform Euclidean distance calculation with each majority data in a majority data set, screening out the majority data closest to the selected minority data in the majority data set, and calculating a distance parameter of the selected minority data according to the selected Euclidean distance between the selected minority data and each minority data in the minority data set and the closest Euclidean distance between the selected minority data and each minority data in the majority data set; marking the minority class data in the minority class data set according to the distance parameters of the minority class data, and obtaining the data point types of the minority class data; calculating a K neighbor point set of each minority data in the minority data set, further dividing the K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, respectively counting the number of the majority data in the K neighbor point majority data set and the number of the minority data in the K neighbor point minority data set, and calculating the number of newly generated minority data of each minority data in the minority data set;
step 2, respectively selecting a first classifier and a second classifier, and performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set;
step 3, obtaining a final prediction result through weighted voting by using the first classifier, the second classifier and the obtained training set S' selected in the step 2;
preferably, the software defect data in step 1 is: s ═ Smin,Smax};
In step 1, the minority class data set is as follows:
Figure GDA0003109305290000021
the majority class data set in the step 1 is as follows:
Figure GDA0003109305290000022
wherein S isminRepresenting a collection of minority classes of data, denoted by SmaxRepresenting sets of data of most classes, piRepresenting the ith minority class data in the minority class data set, i belongs to [1, N ]]N denotes the number of minority class data in the minority class data set, dkRepresenting the kth majority class data in the majority class data set, K ∈ [1, K ∈]K represents the number of majority class data in the majority class data set;
step 1, the minority class number closest to the selected minority class dataAccording to the following steps:
Figure GDA0003109305290000023
i∈[1,N], mini∈[1,N];
wherein,
Figure GDA0003109305290000024
representing minority class data which are closest to the selected ith minority class data in the minority class data set, and N representing the number of the minority class data in the minority class data set;
the majority class data closest to the selected minority class data in the step 1 is as follows:
Figure GDA0003109305290000025
i∈[1,N], maxi∈[1,K];
wherein,
Figure GDA0003109305290000026
representing the majority class data closest to the selected ith minority class data in the majority class data set, and K representing the number of the minority class data in the minority class data set;
the selected minority data in the step 1 and the nearest Euclidean distance between each minority data in the minority data set are as follows:
Figure GDA0003109305290000027
the selected minority data in the step 1 and each majority data in the majority data set have the following nearest Euclidean distance:
Figure GDA0003109305290000031
the distance parameters of the minority class data selected by calculation in the step 1 are as follows:
Figure GDA0003109305290000032
wherein is alphaiIs in a minority of classesDistance parameter of ith minority class data in the data set;
step 1, marking the minority class data in the minority class data set according to the distance parameter of the minority class data as follows:
if is equal toiIf the data point type of the ith minority class data in the minority class data set is less than 1, the data point type of the ith minority class data in the minority class data set is marked as a safety point, and flagi=1;
If is equal toiIf the data point type of the ith minority class data in the minority class data set is 1, the data point type is marked as an confusion point, and flagi=2;
If is equal toiIf the data point type of the selected ith minority class data in the minority class data set is more than 1, the data point type of the selected ith minority class data in the minority class data set is marked as a dangerous point, and flagi=3;
Step 1, calculating a K neighbor point set of each minority class data in the minority class data set:
the K neighbor point set of each minority data in step 1 is divided into a K neighbor point majority data set and a K neighbor point minority data set, and specifically comprises the following steps:
1, the number of the majority class data in the K neighbor point majority class data set is marked as
Figure GDA0003109305290000033
1, recording the number of minority class data in the K neighbor point minority class data set as
Figure GDA0003109305290000034
Step 1, calculating the newly generated minority class data quantity of each minority class data in the minority class data set, specifically:
Figure GDA0003109305290000035
wherein is alphaiDistance parameter, n, for the ith minority class data in the minority class data setiIs a minority of the numberThe number of newly generated minority class data for the ith each minority class data in the data set;
step 1, calculating newly generated software defect prediction data;
step 1, the ith minority class data in the minority class data set generates niNew minority class data, so the newly generated minority class data is used as pnew i,jIs represented by where j ∈ [1, n ]i]
Step 1, the deviation amount of the jth newly generated data of the ith minority class data in the minority class data set from the majority class is marked as epsiloni,j
Wherein the offset epsilon of the jth newly generated data of the ith minority class data in the minority class data set from the majority classi,jThe calculation formula is as follows:
Figure GDA0003109305290000041
wherein,
Figure GDA0003109305290000042
in order to deviate from most of the class degree parameters, a random number with the value of 0-1 is taken,
Figure GDA00031093052900000411
most of its recent classes of data.
1, the deviation of the jth newly generated data of the ith minority class data in the minority class data set to the majority class is recorded as sigmai,j
Step 1, the offset sigma of the jth newly generated data of the ith minority class data in the minority class data set biased to the majority classi,jThe calculation formula is as follows:
Figure GDA0003109305290000043
wherein,
Figure GDA0003109305290000044
is biased to minority class hierarchy parameter, and takes random number of 0-1.5,
Figure GDA00031093052900000412
the most recent few classes of data.
Step 1, newly generated software defect prediction data minority class data is marked as pnew i,j
The j-th newly generated data calculation formula of the ith minority class data of the newly generated software defect prediction data is as follows:
pnew i,j=pii,ji,j
step 1, obtaining a newly formed minority class data set, and recording the data set as Snew
Step 1 said minority point piNumber n of newly generated defect dataiAccording to the minority class data p generated abovenewIn such a way that a new minority class data set S is obtainednew
Wherein,
Figure GDA0003109305290000045
n' is a newly formed minority class data set SnewThe number of elements is included, the category of the new data is marked as defect data, and the category of the new data is marked as weak mark LwSymbol p 'for the ith new minority class data set'iRepresentation, its mark is
Figure GDA0003109305290000046
Preferably, the step 2 is specifically as follows:
step 2, respectively calculating the influence degree of the first classifier and the influence degree of the second classifier;
step 2, utilizing the newly formed minority class data set SnewTraining the first classifier H1Using newly formed minority class data sets SnewAre sequentially brought into the first classifier H1To obtain a predicted class Lp1For SnewIn (1)Ith point p'iWeak label thereof
Figure GDA0003109305290000047
H1The prediction category is
Figure GDA0003109305290000048
Using newly formed minority class data sets SnewTraining the second classifier H2Using newly formed minority class data sets SnewIn turn into a second classifier H2To obtain a predicted class Lp2For SnewThe ith point of (1)'iWeak label thereof
Figure GDA0003109305290000049
H2The prediction category is
Figure GDA00031093052900000410
The influence degree of the first classifier is as follows:
Figure GDA0003109305290000051
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame class
Figure GDA0003109305290000052
A value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame class
Figure GDA0003109305290000053
The value is 1, otherwise the value is 0.
The influence degree of the second classifier is as follows:
Figure GDA0003109305290000054
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame class
Figure GDA0003109305290000055
A value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame class
Figure GDA0003109305290000056
The value is 1, otherwise the value is 0.
Step 2, updating the labels of the minority data according to the influence degree of the first classifier and the influence degree of the second classifier so as to construct updated original software defect data;
step 2, calculating the weak mark
Figure GDA0003109305290000057
Confidence of (2), by the symbol γiAnd (4) showing.
Step 2, weak marking of new minority class data set
Figure GDA0003109305290000058
The judgment is carried out according to the influence degree of the classifier, and the calculation formula is
Figure GDA0003109305290000059
When the confidence degree gammaiBeta, will be this minority class of data
Figure GDA00031093052900000510
Adding training data when gamma isiWhen the beta value is less than or equal to beta, the data is directly deleted, and the data of the minority class is not added into a new training set.
Step 2, newly forming minority class data, namely SnewIs screened again to obtain new minority class data Snew', will SnewAdding original software defect data S to obtain a new training set S';
preferably, the step 3 specifically includes the following steps:
after a new training data set S' is obtained, a first classifier H is trained1And a second classifier H2By a trained first classifier H1And a second classifier H2Respectively obtaining the prediction results L of the first classifier by the prediction data v1And a second classifier L2Continuing to use the influence degree o of the first classifier1And degree of influence o of the second classifier2Using the calculation formula Lpre=L1*o1+L2*o2To obtain a predicted result;
as described in step 3, when LpreWhen the value is larger than beta, predicting the category of v as a minority class;
as described in step 3, when LpreWhen the value of (b) is less than or equal to β, predicting the category of v as a majority;
compared with the prior art, the invention has the advantages and positive effects that:
the invention can well solve the class imbalance problem.
The method adds a screening process for newly generated minority class data, removes data deviating from the actual data, and retains data capable of showing true characteristics of the minority class.
The software defect prediction method capable of solving the class imbalance is provided, and can be widely applied to various software defect data and solve the class imbalance problem.
Drawings
FIG. 1: the invention is a method diagram for predicting software defects of class imbalance.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings and specific embodiments, wherein the present invention is illustrated by way of suitable examples and not by way of limitation.
The general implementation flow chart of the invention is shown in fig. 1, and the specific implementation is as follows:
step 1, selecting any minority data in a minority data set to sequentially perform Euclidean distance calculation with each minority data in the minority data set, screening out the minority data closest to the selected minority data in the minority data set, selecting any minority data in the minority data set to sequentially perform Euclidean distance calculation with each majority data in a majority data set, screening out the majority data closest to the selected minority data in the majority data set, and calculating a distance parameter of the selected minority data according to the selected Euclidean distance between the selected minority data and each minority data in the minority data set and the closest Euclidean distance between the selected minority data and each minority data in the majority data set; marking the minority class data in the minority class data set according to the distance parameters of the minority class data, and obtaining the data point types of the minority class data; calculating a K neighbor point set of each minority data in the minority data set, further dividing the K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, respectively counting the number of the majority data in the K neighbor point majority data set and the number of the minority data in the K neighbor point minority data set, and calculating the number of newly generated minority data of each minority data in the minority data set.
Step 1, the software defect data are as follows: s ═ Smin,Smax};
In step 1, the minority class data set is as follows:
Figure GDA0003109305290000061
the majority class data set in the step 1 is as follows:
Figure GDA0003109305290000062
wherein S isminRepresenting a collection of minority classes of data, denoted by SmaxRepresenting sets of data of most classes, piRepresenting the ith minority class data in the minority class data set, i belongs to [1, N ]]N represents a minority of classesNumber of minority classes of data in data set, dkRepresenting the kth majority class data in the majority class data set, K ∈ [1, K ∈]K represents the number of majority class data in the majority class data set;
in step 1, the minority class data closest to the selected minority class data is:
Figure GDA0003109305290000071
i∈[1,N], mini∈[1,N];
wherein,
Figure GDA0003109305290000072
representing minority class data which are closest to the selected ith minority class data in the minority class data set, and N representing the number of the minority class data in the minority class data set;
the majority class data closest to the selected minority class data in the step 1 is as follows:
Figure GDA0003109305290000073
i∈[1,N], maxi∈[1,K];
wherein,
Figure GDA0003109305290000074
representing the majority class data closest to the selected ith minority class data in the majority class data set, and K representing the number of the minority class data in the minority class data set;
the selected minority data in the step 1 and the nearest Euclidean distance between each minority data in the minority data set are as follows:
Figure GDA0003109305290000075
the selected minority data in the step 1 and each majority data in the majority data set have the following nearest Euclidean distance:
Figure GDA0003109305290000076
the distance parameters of the minority class data selected by calculation in the step 1 are as follows:
Figure GDA0003109305290000077
wherein is alphaiA distance parameter for the ith minority class data in the minority class data set;
step 1, marking the minority class data in the minority class data set according to the distance parameter of the minority class data as follows:
if is equal toiIf the data point type of the ith minority class data in the minority class data set is less than 1, the data point type of the ith minority class data in the minority class data set is marked as a safety point, and flagi=1;
If is equal toiIf the data point type of the ith minority class data in the minority class data set is 1, the data point type is marked as an confusion point, and flagi=2;
If is equal toiIf the data point type of the selected ith minority class data in the minority class data set is more than 1, the data point type of the selected ith minority class data in the minority class data set is marked as a dangerous point, and flagi=3;
Step 1, calculating a K neighbor point set of each minority class of data in the minority class of data sets, where K is set to 5:
the K neighbor point set of each minority data in step 1 is divided into a K neighbor point majority data set and a K neighbor point minority data set, and specifically comprises the following steps:
1, the number of the majority class data in the K neighbor point majority class data set is marked as
Figure GDA0003109305290000078
1, recording the number of minority class data in the K neighbor point minority class data set as
Figure GDA0003109305290000081
Step 1, calculating the newly generated minority class data quantity of each minority class data in the minority class data set, specifically:
Figure GDA0003109305290000082
wherein is alphaiDistance parameter, n, for the ith minority class data in the minority class data setiA newly generated minority class data quantity for the ith each minority class data in the minority class data set;
step 1, calculating newly generated software defect prediction data;
step 1, the ith minority class data in the minority class data set generates niNew minority class data, so the newly generated minority class data is used as pnew i,jIs represented by where j ∈ [1, n ]i]
Step 1, the deviation amount of the jth newly generated data of the ith minority class data in the minority class data set from the majority class is marked as epsiloni,j
Wherein the offset epsilon of the jth newly generated data of the ith minority class data in the minority class data set from the majority classi,jThe calculation formula is as follows:
Figure GDA0003109305290000083
wherein,
Figure GDA0003109305290000084
in order to deviate from most of the class degree parameters, a random number with the value of 0-1 is taken,
Figure GDA0003109305290000085
most of its recent classes of data.
1, the deviation of the jth newly generated data of the ith minority class data in the minority class data set to the majority class is recorded as sigmai,j
Step 1, the offset sigma of the jth newly generated data of the ith minority class data in the minority class data set biased to the majority classi,jThe calculation formula is as follows:
Figure GDA0003109305290000086
wherein,
Figure GDA0003109305290000087
is biased to minority class hierarchy parameter, and takes random number of 0-1.5,
Figure GDA0003109305290000088
the most recent few classes of data.
Step 1, newly generated software defect prediction data minority class data is marked as pnew i,j
The j-th newly generated data calculation formula of the ith minority class data of the newly generated software defect prediction data is as follows:
pnew i,j=pii,ji,j
step 1, obtaining a newly formed minority class data set, and recording the data set as Snew
Step 1 said minority point piNumber n of newly generated defect dataiAccording to the minority class data p generated abovenewIn such a way that a new minority class data set S is obtainednew
Wherein,
Figure GDA0003109305290000091
n' is a newly formed minority class data set SnewThe number of elements is included, the category of the new data is marked as defect data, and the category of the new data is marked as weak mark LwSymbol p 'for the ith new minority class data set'iRepresentation, its mark is
Figure GDA0003109305290000092
Step 2, respectively selecting a first classifier and a second classifier, and performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set;
the step 2 is specifically as follows:
step 2, respectively calculating the influence degree of the first classifier and the influence degree of the second classifier;
step 2, utilizing the newly formed minority class data set SnewTraining the first classifier H1Using newly formed minority class data sets SnewAre sequentially brought into the first classifier H1To obtain a predicted class Lp1For SnewThe ith point of (1)'iWeak label thereof
Figure GDA0003109305290000093
H1The prediction category is
Figure GDA0003109305290000094
Using newly formed minority class data sets SnewTraining the second classifier H2Using newly formed minority class data sets SnewIn turn into a second classifier H2To obtain a predicted class Lp2For SnewThe ith point of (1)'iWeak label thereof
Figure GDA0003109305290000095
H2The prediction category is
Figure GDA0003109305290000096
The influence degree of the first classifier is as follows:
Figure GDA0003109305290000097
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame class
Figure GDA0003109305290000098
A value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame class
Figure GDA0003109305290000099
The value is 1, otherwise the value is 0.
The influence degree of the second classifier is as follows:
Figure GDA00031093052900000910
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame class
Figure GDA00031093052900000911
A value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame class
Figure GDA00031093052900000912
The value is 1, otherwise the value is 0.
Step 2, updating the labels of the minority data according to the influence degree of the first classifier and the influence degree of the second classifier so as to construct updated original software defect data;
step 2, calculating the weak mark
Figure GDA0003109305290000101
Confidence of (2), by the symbol γiAnd (4) showing.
Step 2, weak marking of new minority class data set
Figure GDA0003109305290000102
The judgment is carried out according to the influence degree of the classifier, and the calculation formula is
Figure GDA0003109305290000103
When the confidence degree gammaiBeta is 0.5, the minority class data
Figure GDA0003109305290000104
Adding training data when gamma isiWhen the beta is less than or equal to 0.5, the data is directly deleted, and the data of the minority class is not added into the new training set.
Step 2, newly forming minority class data, namely SnewIs screened again to obtain new minority class data Snew', will SnewAdding original software defect data S to obtain a new training set S';
step 3, obtaining a final prediction result through weighted voting by using the first classifier, the second classifier and the obtained training set S' selected in the step 2;
the step 3 specifically comprises the following steps:
after a new training data set S' is obtained, a first classifier H is trained1And a second classifier H2By a trained first classifier H1And a second classifier H2Respectively obtaining the prediction results L of the first classifier by the prediction data v1And a second classifier L2Continuing to use the influence degree o of the first classifier1And degree of influence o of the second classifier2Using the calculation formula Lpre=L1*o1+L2*o2To obtain a predicted result;
as described in step 3, when LpreWhen the value is greater than 0.5, the category of the predicted v is a few;
as described in step 3, when LpreWhen the value of β is 0.5 or less, the category of the prediction v is a majority category.
In the embodiment, the method is compared with the traditional mainstream SMOTE + SVM, SMOTE + decision tree, SMOTE + k neighbor and SMOTE + naive Bayes methods, and the comparison results of the precision, F-measure, balance and AUC indexes are selected. In all the comparison methods, the accuracy of the method is highest, and the identification accuracy reaches the advanced level of the field.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (2)

1. A class imbalance software defect prediction method based on data resampling is characterized in that,
step 1, selecting any minority data in a minority data set to sequentially perform Euclidean distance calculation with each minority data in the minority data set, screening out the minority data closest to the selected minority data in the minority data set, selecting any minority data in the minority data set to sequentially perform Euclidean distance calculation with each majority data in a majority data set, screening out the majority data closest to the selected minority data in the majority data set, and calculating a distance parameter of the selected minority data according to the selected Euclidean distance between the selected minority data and each minority data in the minority data set and the closest Euclidean distance between the selected minority data and each minority data in the majority data set; marking the minority class data in the minority class data set according to the distance parameters of the minority class data, and obtaining the data point types of the minority class data; calculating a K neighbor point set of each minority data in the minority data set, further dividing the K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, respectively counting the number of the majority data in the K neighbor point majority data set and the number of the minority data in the K neighbor point minority data set, and calculating the number of newly generated minority data of each minority data in the minority data set;
step 2, respectively selecting a first classifier and a second classifier, and performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set;
step 3, obtaining a final prediction result through weighted voting by using the first classifier, the second classifier and the obtained training set S' selected in the step 2;
the software defect data is: s ═ Smin,Smax};
Step 1, the minority class data set is as follows:
Figure FDA0003463770520000011
step 1, most of data sets are as follows:
Figure FDA0003463770520000012
wherein S isminRepresenting a collection of minority classes of data, denoted by SmaxRepresenting sets of data of most classes, piRepresenting the ith minority class data in the minority class data set, i belongs to [1, N ]]N denotes the number of minority class data in the minority class data set, dkRepresenting the kth majority class data in the majority class data set, K ∈ [1, K ∈]K represents the number of majority class data in the majority class data set;
the minority class data closest to the selected minority class data in the step 1 are:
Figure FDA0003463770520000013
Figure FDA0003463770520000014
wherein,
Figure FDA0003463770520000015
representing minority class data which are closest to the selected ith minority class data in the minority class data set, and N representing the number of the minority class data in the minority class data set;
the majority data closest to the selected minority data in the step 1 is:
Figure FDA0003463770520000016
Figure FDA0003463770520000021
wherein,
Figure FDA0003463770520000022
representing the majority class data closest to the selected ith minority class data in the majority class data set, and K representing the number of the minority class data in the minority class data set;
the shortest Euclidean distance between the minority class data selected in the step 1 and each minority class data in the minority class data set is as follows:
Figure FDA0003463770520000023
the Euclidean distance between the minority class data selected in the step 1 and each majority class data in the majority class data set is as follows:
Figure FDA0003463770520000024
the distance parameters of the minority class data calculated and selected in the step 1 are as follows:
Figure FDA0003463770520000025
wherein is alphaiA distance parameter for the ith minority class data in the minority class data set;
step 1, marking minority class data in the minority class data set according to the distance parameter of the minority class data as follows:
if is equal toiIf the data point type of the ith minority class data in the minority class data set is less than 1, the data point type of the ith minority class data in the minority class data set is marked as a safety point, and flagi=1;
If is equal toiIf the data point type of the ith minority class data in the minority class data set is 1, the data point type is marked as an confusion point, and flagi=2;
If is equal toiIf the number is more than 1, the selected number in the minority class data set is equal to the number of the selected dataThe data point types of the i minority class data are marked as dangerous points, flagi=3;
Step 1, calculating a K neighbor point set of each minority class data in the minority class data set:
step 1, dividing a K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, specifically:
step 1K, the number of the majority class data in the neighbor point majority class data set is recorded as
Figure FDA0003463770520000026
Step 1K, the number of minority class data in the neighbor point minority class data set is recorded as
Figure FDA0003463770520000027
Step 1, calculating the newly generated minority class data quantity of each minority class data in the minority class data set, specifically:
Figure FDA0003463770520000028
wherein is alphaiDistance parameter, n, for the ith minority class data in the minority class data setiA newly generated minority class data quantity for the ith each minority class data in the minority class data set;
step 1, calculating newly generated software defect prediction data;
step 1, generating n by the ith minority class data in the minority class data setiNew minority class data, so the newly generated minority class data is used as pnew i,jIs represented by where j ∈ [1, n ]i]
Step 1, the deviation amount of the jth newly generated data of the ith minority class data in the minority class data set from the majority class is marked as epsiloni,j
Wherein the ith minority class data in the minority class data setOffset of j newly generated data from majority classi,jThe calculation formula is as follows:
Figure FDA0003463770520000031
wherein,
Figure FDA0003463770520000032
in order to deviate from most of the class degree parameters, a random number with the value of 0-1 is taken,
Figure FDA0003463770520000033
most of the class data that is most recent;
step 1, the deviation of the jth newly generated data of the ith minority class data in the minority class data set to the majority class is recorded as sigmai,j
Step 1, the offset sigma of the jth newly generated data of the ith minority class data in the minority class data set biased to the majority classi,jThe calculation formula is as follows:
Figure FDA0003463770520000034
wherein,
Figure FDA0003463770520000035
is biased to minority class hierarchy parameter, and takes random number of 0-1.5,
Figure FDA0003463770520000036
the minority class of data that is its most recent;
step 1, newly generating software defect prediction data minority class data, and recording the minority class data as pnew i,j
The j-th newly generated data calculation formula of the ith minority class data of the newly generated software defect prediction data is as follows:
pnew i,j=pii,ji,j
step 1, obtaining a newly generated minority class data set, and recording the data set as Snew
Step 1 minority points piNumber n of newly generated defect dataiAccording to the minority class data p generated abovenewIn such a way that a newly generated minority class data set S is obtainednew
Wherein,
Figure FDA0003463770520000037
n, as new minority class data set SnewThe number of elements is included, the category of the new data is marked as defect data, and the category of the new data is marked as weak mark LwSymbol p 'for the ith new minority class data set'iRepresentation, its mark is
Figure FDA0003463770520000038
The step 2 is specifically as follows:
step 2, respectively calculating the influence degree of the first classifier and the influence degree of the second classifier;
step 2 utilizing newly formed minority class data set SnewTraining the first classifier H1Using newly formed minority class data sets SnewAre sequentially brought into the first classifier H1To obtain a predicted class Lp1For SnewThe ith point of (1)'iWeak label thereof
Figure FDA0003463770520000041
H1The prediction category is
Figure FDA0003463770520000042
Using newly formed minority class data sets SnewTraining the second classifier H2Using newly formed minority class data sets SnewIn turn into a second classifier H2To obtain a predicted class Lp2For SnewThe ith point of (1)'iWeak mark thereofIs composed of
Figure FDA0003463770520000043
H2The prediction category is
Figure FDA0003463770520000044
The influence degree of the first classifier is as follows:
Figure FDA0003463770520000045
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame class
Figure FDA0003463770520000046
A value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame class
Figure FDA0003463770520000047
The value is 1, otherwise the value is 0;
the influence degree of the second classifier is as follows:
Figure FDA0003463770520000048
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame class
Figure FDA0003463770520000049
A value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame class
Figure FDA00034637705200000410
The value is 1, otherwise the value is 0;
step 2, updating the labels of the minority data according to the influence degree of the first classifier and the influence degree of the second classifier so as to construct updated original software defect data;
step 2, calculating weak marks
Figure FDA00034637705200000411
Confidence of (2), by the symbol γiRepresents;
step 2, weak marking of new minority class data set
Figure FDA00034637705200000412
The judgment is carried out according to the influence degree of the classifier, and the calculation formula is
Figure FDA00034637705200000413
When the confidence degree gammaiBeta, will be this minority class of data
Figure FDA00034637705200000414
Adding training data when gamma isiWhen the beta is less than or equal to beta, directly deleting the data, and not adding the minority class data into a new training set;
step 2, newly forming minority class data, namely SnewIs screened again to obtain newly generated minority class data Snew', will Snew'Add original software defect data S to get a new training set S'.
2. The method of claim 1,
the step 3 specifically comprises the following steps:
after a new training data set S' is obtained, a first classifier H is trained1And a second classifier H2By a trained first classifier H1And a second classifier H2Respectively obtaining the prediction results of the first classifier by the prediction data vL1And a second classifier prediction result L2Continuing to use the influence degree o of the first classifier1And degree of influence o of the second classifier2Using the calculation formula Lpre=L1*o1+L2*o2To obtain a predicted result;
step 3, when L ispreWhen the value is larger than beta, the category of the prediction data v is a minority class;
step 3, when L ispreWhen the value of (b) is equal to or less than β, the category of the prediction data v is a majority category.
CN202110428102.3A 2021-04-21 2021-04-21 Class imbalance software defect prediction method based on data resampling Active CN113204481B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110428102.3A CN113204481B (en) 2021-04-21 2021-04-21 Class imbalance software defect prediction method based on data resampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110428102.3A CN113204481B (en) 2021-04-21 2021-04-21 Class imbalance software defect prediction method based on data resampling

Publications (2)

Publication Number Publication Date
CN113204481A CN113204481A (en) 2021-08-03
CN113204481B true CN113204481B (en) 2022-03-04

Family

ID=77027498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110428102.3A Active CN113204481B (en) 2021-04-21 2021-04-21 Class imbalance software defect prediction method based on data resampling

Country Status (1)

Country Link
CN (1) CN113204481B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677564A (en) * 2016-01-04 2016-06-15 中国石油大学(华东) Adaboost software defect unbalanced data classification method based on improvement
CN107391452A (en) * 2017-07-06 2017-11-24 武汉大学 A kind of software defect estimated number method based on data lack sampling and integrated study
CN108563556A (en) * 2018-01-10 2018-09-21 江苏工程职业技术学院 Software defect prediction optimization method based on differential evolution algorithm
CN110471856A (en) * 2019-08-21 2019-11-19 大连海事大学 A kind of Software Defects Predict Methods based on data nonbalance
CN110533116A (en) * 2019-09-04 2019-12-03 大连大学 Based on the adaptive set of Euclidean distance at unbalanced data classification method
CN110674865A (en) * 2019-09-20 2020-01-10 燕山大学 Rule learning classifier integration method oriented to software defect class distribution unbalance
CN110942153A (en) * 2019-11-11 2020-03-31 西北工业大学 Data resampling method based on repeated editing nearest neighbor and clustering oversampling
CN111090579A (en) * 2019-11-14 2020-05-01 北京航空航天大学 Software defect prediction method based on Pearson correlation weighting association classification rule
CN111522736A (en) * 2020-03-26 2020-08-11 中南大学 Software defect prediction method and device, electronic equipment and computer storage medium
CN111767216A (en) * 2020-06-23 2020-10-13 江苏工程职业技术学院 Cross-version depth defect prediction method capable of relieving class overlap problem
CN112465040A (en) * 2020-12-01 2021-03-09 杭州电子科技大学 Software defect prediction method based on class imbalance learning algorithm

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012116208A2 (en) * 2011-02-23 2012-08-30 New York University Apparatus, method, and computer-accessible medium for explaining classifications of documents
CN103810101B (en) * 2014-02-19 2019-02-19 北京理工大学 A kind of Software Defects Predict Methods and software defect forecasting system
US10430315B2 (en) * 2017-10-04 2019-10-01 Blackberry Limited Classifying warning messages generated by software developer tools

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677564A (en) * 2016-01-04 2016-06-15 中国石油大学(华东) Adaboost software defect unbalanced data classification method based on improvement
CN107391452A (en) * 2017-07-06 2017-11-24 武汉大学 A kind of software defect estimated number method based on data lack sampling and integrated study
CN108563556A (en) * 2018-01-10 2018-09-21 江苏工程职业技术学院 Software defect prediction optimization method based on differential evolution algorithm
CN110471856A (en) * 2019-08-21 2019-11-19 大连海事大学 A kind of Software Defects Predict Methods based on data nonbalance
CN110533116A (en) * 2019-09-04 2019-12-03 大连大学 Based on the adaptive set of Euclidean distance at unbalanced data classification method
CN110674865A (en) * 2019-09-20 2020-01-10 燕山大学 Rule learning classifier integration method oriented to software defect class distribution unbalance
CN110942153A (en) * 2019-11-11 2020-03-31 西北工业大学 Data resampling method based on repeated editing nearest neighbor and clustering oversampling
CN111090579A (en) * 2019-11-14 2020-05-01 北京航空航天大学 Software defect prediction method based on Pearson correlation weighting association classification rule
CN111522736A (en) * 2020-03-26 2020-08-11 中南大学 Software defect prediction method and device, electronic equipment and computer storage medium
CN111767216A (en) * 2020-06-23 2020-10-13 江苏工程职业技术学院 Cross-version depth defect prediction method capable of relieving class overlap problem
CN112465040A (en) * 2020-12-01 2021-03-09 杭州电子科技大学 Software defect prediction method based on class imbalance learning algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SMOTE: Synthetic Minority Over-sampling Technique;Nitesh V. Chawla;《Journal of Artificial Intelligence Research》;20020602;全文 *
类不平衡稀疏重构度量学习软件缺陷预测;史作婷;《计算机技术与发展》;20180610;全文 *
面向不平衡数据集的机器学习分类策略;徐玲玲,迟冬祥;《计算机工程与应用》;20201120;全文 *

Also Published As

Publication number Publication date
CN113204481A (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN112069415A (en) Interest point recommendation method based on heterogeneous attribute network characterization learning
CN108171209A (en) A kind of face age estimation method that metric learning is carried out based on convolutional neural networks
CN109101938B (en) Multi-label age estimation method based on convolutional neural network
CN109947963A (en) A kind of multiple dimensioned Hash search method based on deep learning
CN105760888B (en) A kind of neighborhood rough set integrated learning approach based on hierarchical cluster attribute
CN107704512A (en) Financial product based on social data recommends method, electronic installation and medium
CN107480723B (en) Texture Recognition based on partial binary threshold learning network
CN105955951A (en) Message filtering method and device
CN112115993B (en) Zero sample and small sample evidence photo anomaly detection method based on meta-learning
CN107808375A (en) Merge the rice disease image detecting method of a variety of context deep learning models
CN113159149B (en) Method and device for identifying enterprise office address
CN113032613B (en) Three-dimensional model retrieval method based on interactive attention convolution neural network
CN111950195B (en) Project progress prediction method based on portrait system and depth regression model
US11682039B2 (en) Determining a target group based on product-specific affinity attributes and corresponding weights
CN111144466B (en) Image sample self-adaptive depth measurement learning method
CN111144462A (en) Unknown individual identification method and device for radar signals
CN113204481B (en) Class imbalance software defect prediction method based on data resampling
CN110232397A (en) A kind of multi-tag classification method of combination supporting vector machine and projection matrix
CN116992155B (en) User long tail recommendation method and system utilizing NMF with different liveness
CN113591016A (en) Landslide labeling contour generation method based on multi-user cooperation
CN113420797A (en) Online learning image attribute identification method and system
CN111783688B (en) Remote sensing image scene classification method based on convolutional neural network
CN108965585B (en) User identity recognition method based on smart phone sensor
CN116245259A (en) Photovoltaic power generation prediction method and device based on depth feature selection and electronic equipment
CN114372181B (en) Equipment production intelligent planning method based on multi-mode data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant