CN113204481B - Class imbalance software defect prediction method based on data resampling - Google Patents
Class imbalance software defect prediction method based on data resampling Download PDFInfo
- Publication number
- CN113204481B CN113204481B CN202110428102.3A CN202110428102A CN113204481B CN 113204481 B CN113204481 B CN 113204481B CN 202110428102 A CN202110428102 A CN 202110428102A CN 113204481 B CN113204481 B CN 113204481B
- Authority
- CN
- China
- Prior art keywords
- data
- minority
- class data
- class
- minority class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000007547 defect Effects 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000012952 Resampling Methods 0.000 title claims abstract description 5
- 238000012549 training Methods 0.000 claims abstract description 26
- 238000011156 evaluation Methods 0.000 claims abstract description 4
- 238000004364 calculation method Methods 0.000 claims description 23
- 238000012216 screening Methods 0.000 claims description 7
- 238000011161 development Methods 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a class imbalance software defect prediction method based on data resampling. According to the method, the Euclidean distance between a minority class data set and a majority class element and between the minority class data set and the minority class element is calculated, the minority class data and the majority class data which are closest to the minority class data are screened out, and the distance parameter of the minority class data is obtained through the Euclidean distance; marking the minority data in the minority data set according to the distance parameters, and obtaining minority data point types; calculating each K near-point set with few elements in the minority data sets, and counting the number of majority data and minority data in the K near-point sets to obtain the number of newly generated minority data; and respectively selecting two classifiers, performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set, training the selected classifiers, and obtaining a final prediction result through weighted voting. The invention can well solve the class imbalance problem in the software defect prediction process.
Description
Technical Field
The invention belongs to the field of software defect prediction, and particularly relates to a class imbalance software defect prediction method based on data resampling.
Background
With the development of society and the promotion of scientific technology, the internet is deeply integrated into the aspects of our lives, and various activities in our daily lives, such as online shopping, going out and sitting on a car, smart home, ordering in a restaurant and the like, can be completed through software, and the use scene of the software permeates the aspects of our wearing and living rows and the like. In the process of software development, the software function demand is continuously increased, the number of people served by software is continuously increased, the software development time is continuously compressed, various problems cause that the software is easy to have defects in the process of software development, the software cannot provide normal functions due to the occurrence of the software defects, huge production and economic losses are caused, and huge influences are caused on normal life of people.
However, in a real development environment, data with software defects is far smaller than data without software defects, a code module with software defects is less likely to be found out by the software defect prediction model constructed at this time, however, an ideal software defect prediction model needs to be more sensitive to data with defects and can more accurately predict whether the code module has defects, so that the problem of class imbalance of software defect prediction becomes very important to solve. In order to overcome the defects, the invention provides a class imbalance software defect prediction method.
Disclosure of Invention
The invention mainly aims to solve the class imbalance problem in software defect prediction, provides a class imbalance problem software defect prediction method, and is generally suitable for software defect prediction. In order to achieve the above object, the present invention comprises the steps of:
step 1, selecting any minority data in a minority data set to sequentially perform Euclidean distance calculation with each minority data in the minority data set, screening out the minority data closest to the selected minority data in the minority data set, selecting any minority data in the minority data set to sequentially perform Euclidean distance calculation with each majority data in a majority data set, screening out the majority data closest to the selected minority data in the majority data set, and calculating a distance parameter of the selected minority data according to the selected Euclidean distance between the selected minority data and each minority data in the minority data set and the closest Euclidean distance between the selected minority data and each minority data in the majority data set; marking the minority class data in the minority class data set according to the distance parameters of the minority class data, and obtaining the data point types of the minority class data; calculating a K neighbor point set of each minority data in the minority data set, further dividing the K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, respectively counting the number of the majority data in the K neighbor point majority data set and the number of the minority data in the K neighbor point minority data set, and calculating the number of newly generated minority data of each minority data in the minority data set;
step 2, respectively selecting a first classifier and a second classifier, and performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set;
step 3, obtaining a final prediction result through weighted voting by using the first classifier, the second classifier and the obtained training set S' selected in the step 2;
preferably, the software defect data in step 1 is: s ═ Smin,Smax};
wherein S isminRepresenting a collection of minority classes of data, denoted by SmaxRepresenting sets of data of most classes, piRepresenting the ith minority class data in the minority class data set, i belongs to [1, N ]]N denotes the number of minority class data in the minority class data set, dkRepresenting the kth majority class data in the majority class data set, K ∈ [1, K ∈]K represents the number of majority class data in the majority class data set;
step 1, the minority class number closest to the selected minority class dataAccording to the following steps:i∈[1,N], mini∈[1,N];
wherein,representing minority class data which are closest to the selected ith minority class data in the minority class data set, and N representing the number of the minority class data in the minority class data set;
the majority class data closest to the selected minority class data in the step 1 is as follows:i∈[1,N], maxi∈[1,K];
wherein,representing the majority class data closest to the selected ith minority class data in the majority class data set, and K representing the number of the minority class data in the minority class data set;
the selected minority data in the step 1 and the nearest Euclidean distance between each minority data in the minority data set are as follows:
the selected minority data in the step 1 and each majority data in the majority data set have the following nearest Euclidean distance:
the distance parameters of the minority class data selected by calculation in the step 1 are as follows:
wherein is alphaiIs in a minority of classesDistance parameter of ith minority class data in the data set;
step 1, marking the minority class data in the minority class data set according to the distance parameter of the minority class data as follows:
if is equal toiIf the data point type of the ith minority class data in the minority class data set is less than 1, the data point type of the ith minority class data in the minority class data set is marked as a safety point, and flagi=1;
If is equal toiIf the data point type of the ith minority class data in the minority class data set is 1, the data point type is marked as an confusion point, and flagi=2;
If is equal toiIf the data point type of the selected ith minority class data in the minority class data set is more than 1, the data point type of the selected ith minority class data in the minority class data set is marked as a dangerous point, and flagi=3;
Step 1, calculating a K neighbor point set of each minority class data in the minority class data set:
the K neighbor point set of each minority data in step 1 is divided into a K neighbor point majority data set and a K neighbor point minority data set, and specifically comprises the following steps:
1, the number of the majority class data in the K neighbor point majority class data set is marked as
Step 1, calculating the newly generated minority class data quantity of each minority class data in the minority class data set, specifically:
wherein is alphaiDistance parameter, n, for the ith minority class data in the minority class data setiIs a minority of the numberThe number of newly generated minority class data for the ith each minority class data in the data set;
step 1, calculating newly generated software defect prediction data;
step 1, the ith minority class data in the minority class data set generates niNew minority class data, so the newly generated minority class data is used as pnew i,jIs represented by where j ∈ [1, n ]i]
Step 1, the deviation amount of the jth newly generated data of the ith minority class data in the minority class data set from the majority class is marked as epsiloni,j;
Wherein the offset epsilon of the jth newly generated data of the ith minority class data in the minority class data set from the majority classi,jThe calculation formula is as follows:
wherein,in order to deviate from most of the class degree parameters, a random number with the value of 0-1 is taken,most of its recent classes of data.
1, the deviation of the jth newly generated data of the ith minority class data in the minority class data set to the majority class is recorded as sigmai,j;
Step 1, the offset sigma of the jth newly generated data of the ith minority class data in the minority class data set biased to the majority classi,jThe calculation formula is as follows:
wherein,is biased to minority class hierarchy parameter, and takes random number of 0-1.5,the most recent few classes of data.
Step 1, newly generated software defect prediction data minority class data is marked as pnew i,j;
The j-th newly generated data calculation formula of the ith minority class data of the newly generated software defect prediction data is as follows:
pnew i,j=pi+εi,j+σi,j
step 1, obtaining a newly formed minority class data set, and recording the data set as Snew;
Step 1 said minority point piNumber n of newly generated defect dataiAccording to the minority class data p generated abovenewIn such a way that a new minority class data set S is obtainednew。
Wherein,n' is a newly formed minority class data set SnewThe number of elements is included, the category of the new data is marked as defect data, and the category of the new data is marked as weak mark LwSymbol p 'for the ith new minority class data set'iRepresentation, its mark is
Preferably, the step 2 is specifically as follows:
step 2, respectively calculating the influence degree of the first classifier and the influence degree of the second classifier;
step 2, utilizing the newly formed minority class data set SnewTraining the first classifier H1Using newly formed minority class data sets SnewAre sequentially brought into the first classifier H1To obtain a predicted class Lp1For SnewIn (1)Ith point p'iWeak label thereofH1The prediction category is
Using newly formed minority class data sets SnewTraining the second classifier H2Using newly formed minority class data sets SnewIn turn into a second classifier H2To obtain a predicted class Lp2For SnewThe ith point of (1)'iWeak label thereofH2The prediction category is
The influence degree of the first classifier is as follows:
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame classA value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame classThe value is 1, otherwise the value is 0.
The influence degree of the second classifier is as follows:
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame classA value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame classThe value is 1, otherwise the value is 0.
Step 2, updating the labels of the minority data according to the influence degree of the first classifier and the influence degree of the second classifier so as to construct updated original software defect data;
Step 2, weak marking of new minority class data setThe judgment is carried out according to the influence degree of the classifier, and the calculation formula isWhen the confidence degree gammaiBeta, will be this minority class of dataAdding training data when gamma isiWhen the beta value is less than or equal to beta, the data is directly deleted, and the data of the minority class is not added into a new training set.
Step 2, newly forming minority class data, namely SnewIs screened again to obtain new minority class data Snew', will SnewAdding original software defect data S to obtain a new training set S';
preferably, the step 3 specifically includes the following steps:
after a new training data set S' is obtained, a first classifier H is trained1And a second classifier H2By a trained first classifier H1And a second classifier H2Respectively obtaining the prediction results L of the first classifier by the prediction data v1And a second classifier L2Continuing to use the influence degree o of the first classifier1And degree of influence o of the second classifier2Using the calculation formula Lpre=L1*o1+L2*o2To obtain a predicted result;
as described in step 3, when LpreWhen the value is larger than beta, predicting the category of v as a minority class;
as described in step 3, when LpreWhen the value of (b) is less than or equal to β, predicting the category of v as a majority;
compared with the prior art, the invention has the advantages and positive effects that:
the invention can well solve the class imbalance problem.
The method adds a screening process for newly generated minority class data, removes data deviating from the actual data, and retains data capable of showing true characteristics of the minority class.
The software defect prediction method capable of solving the class imbalance is provided, and can be widely applied to various software defect data and solve the class imbalance problem.
Drawings
FIG. 1: the invention is a method diagram for predicting software defects of class imbalance.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings and specific embodiments, wherein the present invention is illustrated by way of suitable examples and not by way of limitation.
The general implementation flow chart of the invention is shown in fig. 1, and the specific implementation is as follows:
step 1, selecting any minority data in a minority data set to sequentially perform Euclidean distance calculation with each minority data in the minority data set, screening out the minority data closest to the selected minority data in the minority data set, selecting any minority data in the minority data set to sequentially perform Euclidean distance calculation with each majority data in a majority data set, screening out the majority data closest to the selected minority data in the majority data set, and calculating a distance parameter of the selected minority data according to the selected Euclidean distance between the selected minority data and each minority data in the minority data set and the closest Euclidean distance between the selected minority data and each minority data in the majority data set; marking the minority class data in the minority class data set according to the distance parameters of the minority class data, and obtaining the data point types of the minority class data; calculating a K neighbor point set of each minority data in the minority data set, further dividing the K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, respectively counting the number of the majority data in the K neighbor point majority data set and the number of the minority data in the K neighbor point minority data set, and calculating the number of newly generated minority data of each minority data in the minority data set.
Step 1, the software defect data are as follows: s ═ Smin,Smax};
wherein S isminRepresenting a collection of minority classes of data, denoted by SmaxRepresenting sets of data of most classes, piRepresenting the ith minority class data in the minority class data set, i belongs to [1, N ]]N represents a minority of classesNumber of minority classes of data in data set, dkRepresenting the kth majority class data in the majority class data set, K ∈ [1, K ∈]K represents the number of majority class data in the majority class data set;
in step 1, the minority class data closest to the selected minority class data is:i∈[1,N], mini∈[1,N];
wherein,representing minority class data which are closest to the selected ith minority class data in the minority class data set, and N representing the number of the minority class data in the minority class data set;
the majority class data closest to the selected minority class data in the step 1 is as follows:i∈[1,N], maxi∈[1,K];
wherein,representing the majority class data closest to the selected ith minority class data in the majority class data set, and K representing the number of the minority class data in the minority class data set;
the selected minority data in the step 1 and the nearest Euclidean distance between each minority data in the minority data set are as follows:
the selected minority data in the step 1 and each majority data in the majority data set have the following nearest Euclidean distance:
the distance parameters of the minority class data selected by calculation in the step 1 are as follows:
wherein is alphaiA distance parameter for the ith minority class data in the minority class data set;
step 1, marking the minority class data in the minority class data set according to the distance parameter of the minority class data as follows:
if is equal toiIf the data point type of the ith minority class data in the minority class data set is less than 1, the data point type of the ith minority class data in the minority class data set is marked as a safety point, and flagi=1;
If is equal toiIf the data point type of the ith minority class data in the minority class data set is 1, the data point type is marked as an confusion point, and flagi=2;
If is equal toiIf the data point type of the selected ith minority class data in the minority class data set is more than 1, the data point type of the selected ith minority class data in the minority class data set is marked as a dangerous point, and flagi=3;
Step 1, calculating a K neighbor point set of each minority class of data in the minority class of data sets, where K is set to 5:
the K neighbor point set of each minority data in step 1 is divided into a K neighbor point majority data set and a K neighbor point minority data set, and specifically comprises the following steps:
1, the number of the majority class data in the K neighbor point majority class data set is marked as
Step 1, calculating the newly generated minority class data quantity of each minority class data in the minority class data set, specifically:
wherein is alphaiDistance parameter, n, for the ith minority class data in the minority class data setiA newly generated minority class data quantity for the ith each minority class data in the minority class data set;
step 1, calculating newly generated software defect prediction data;
step 1, the ith minority class data in the minority class data set generates niNew minority class data, so the newly generated minority class data is used as pnew i,jIs represented by where j ∈ [1, n ]i]
Step 1, the deviation amount of the jth newly generated data of the ith minority class data in the minority class data set from the majority class is marked as epsiloni,j;
Wherein the offset epsilon of the jth newly generated data of the ith minority class data in the minority class data set from the majority classi,jThe calculation formula is as follows:
wherein,in order to deviate from most of the class degree parameters, a random number with the value of 0-1 is taken,most of its recent classes of data.
1, the deviation of the jth newly generated data of the ith minority class data in the minority class data set to the majority class is recorded as sigmai,j;
Step 1, the offset sigma of the jth newly generated data of the ith minority class data in the minority class data set biased to the majority classi,jThe calculation formula is as follows:
wherein,is biased to minority class hierarchy parameter, and takes random number of 0-1.5,the most recent few classes of data.
Step 1, newly generated software defect prediction data minority class data is marked as pnew i,j;
The j-th newly generated data calculation formula of the ith minority class data of the newly generated software defect prediction data is as follows:
pnew i,j=pi+εi,j+σi,j
step 1, obtaining a newly formed minority class data set, and recording the data set as Snew;
Step 1 said minority point piNumber n of newly generated defect dataiAccording to the minority class data p generated abovenewIn such a way that a new minority class data set S is obtainednew。
Wherein,n' is a newly formed minority class data set SnewThe number of elements is included, the category of the new data is marked as defect data, and the category of the new data is marked as weak mark LwSymbol p 'for the ith new minority class data set'iRepresentation, its mark is
Step 2, respectively selecting a first classifier and a second classifier, and performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set;
the step 2 is specifically as follows:
step 2, respectively calculating the influence degree of the first classifier and the influence degree of the second classifier;
step 2, utilizing the newly formed minority class data set SnewTraining the first classifier H1Using newly formed minority class data sets SnewAre sequentially brought into the first classifier H1To obtain a predicted class Lp1For SnewThe ith point of (1)'iWeak label thereofH1The prediction category is
Using newly formed minority class data sets SnewTraining the second classifier H2Using newly formed minority class data sets SnewIn turn into a second classifier H2To obtain a predicted class Lp2For SnewThe ith point of (1)'iWeak label thereofH2The prediction category is
The influence degree of the first classifier is as follows:
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame classA value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame classThe value is 1, otherwise the value is 0.
The influence degree of the second classifier is as follows:
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame classA value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame classThe value is 1, otherwise the value is 0.
Step 2, updating the labels of the minority data according to the influence degree of the first classifier and the influence degree of the second classifier so as to construct updated original software defect data;
Step 2, weak marking of new minority class data setThe judgment is carried out according to the influence degree of the classifier, and the calculation formula isWhen the confidence degree gammaiBeta is 0.5, the minority class dataAdding training data when gamma isiWhen the beta is less than or equal to 0.5, the data is directly deleted, and the data of the minority class is not added into the new training set.
Step 2, newly forming minority class data, namely SnewIs screened again to obtain new minority class data Snew', will SnewAdding original software defect data S to obtain a new training set S';
step 3, obtaining a final prediction result through weighted voting by using the first classifier, the second classifier and the obtained training set S' selected in the step 2;
the step 3 specifically comprises the following steps:
after a new training data set S' is obtained, a first classifier H is trained1And a second classifier H2By a trained first classifier H1And a second classifier H2Respectively obtaining the prediction results L of the first classifier by the prediction data v1And a second classifier L2Continuing to use the influence degree o of the first classifier1And degree of influence o of the second classifier2Using the calculation formula Lpre=L1*o1+L2*o2To obtain a predicted result;
as described in step 3, when LpreWhen the value is greater than 0.5, the category of the predicted v is a few;
as described in step 3, when LpreWhen the value of β is 0.5 or less, the category of the prediction v is a majority category.
In the embodiment, the method is compared with the traditional mainstream SMOTE + SVM, SMOTE + decision tree, SMOTE + k neighbor and SMOTE + naive Bayes methods, and the comparison results of the precision, F-measure, balance and AUC indexes are selected. In all the comparison methods, the accuracy of the method is highest, and the identification accuracy reaches the advanced level of the field.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (2)
1. A class imbalance software defect prediction method based on data resampling is characterized in that,
step 1, selecting any minority data in a minority data set to sequentially perform Euclidean distance calculation with each minority data in the minority data set, screening out the minority data closest to the selected minority data in the minority data set, selecting any minority data in the minority data set to sequentially perform Euclidean distance calculation with each majority data in a majority data set, screening out the majority data closest to the selected minority data in the majority data set, and calculating a distance parameter of the selected minority data according to the selected Euclidean distance between the selected minority data and each minority data in the minority data set and the closest Euclidean distance between the selected minority data and each minority data in the majority data set; marking the minority class data in the minority class data set according to the distance parameters of the minority class data, and obtaining the data point types of the minority class data; calculating a K neighbor point set of each minority data in the minority data set, further dividing the K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, respectively counting the number of the majority data in the K neighbor point majority data set and the number of the minority data in the K neighbor point minority data set, and calculating the number of newly generated minority data of each minority data in the minority data set;
step 2, respectively selecting a first classifier and a second classifier, and performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set;
step 3, obtaining a final prediction result through weighted voting by using the first classifier, the second classifier and the obtained training set S' selected in the step 2;
the software defect data is: s ═ Smin,Smax};
wherein S isminRepresenting a collection of minority classes of data, denoted by SmaxRepresenting sets of data of most classes, piRepresenting the ith minority class data in the minority class data set, i belongs to [1, N ]]N denotes the number of minority class data in the minority class data set, dkRepresenting the kth majority class data in the majority class data set, K ∈ [1, K ∈]K represents the number of majority class data in the majority class data set;
wherein,representing minority class data which are closest to the selected ith minority class data in the minority class data set, and N representing the number of the minority class data in the minority class data set;
wherein,representing the majority class data closest to the selected ith minority class data in the majority class data set, and K representing the number of the minority class data in the minority class data set;
the shortest Euclidean distance between the minority class data selected in the step 1 and each minority class data in the minority class data set is as follows:
the Euclidean distance between the minority class data selected in the step 1 and each majority class data in the majority class data set is as follows:
the distance parameters of the minority class data calculated and selected in the step 1 are as follows:
wherein is alphaiA distance parameter for the ith minority class data in the minority class data set;
step 1, marking minority class data in the minority class data set according to the distance parameter of the minority class data as follows:
if is equal toiIf the data point type of the ith minority class data in the minority class data set is less than 1, the data point type of the ith minority class data in the minority class data set is marked as a safety point, and flagi=1;
If is equal toiIf the data point type of the ith minority class data in the minority class data set is 1, the data point type is marked as an confusion point, and flagi=2;
If is equal toiIf the number is more than 1, the selected number in the minority class data set is equal to the number of the selected dataThe data point types of the i minority class data are marked as dangerous points, flagi=3;
Step 1, calculating a K neighbor point set of each minority class data in the minority class data set:
step 1, dividing a K neighbor point set of each minority data into a K neighbor point majority data set and a K neighbor point minority data set, specifically:
step 1K, the number of the majority class data in the neighbor point majority class data set is recorded as
Step 1K, the number of minority class data in the neighbor point minority class data set is recorded as
Step 1, calculating the newly generated minority class data quantity of each minority class data in the minority class data set, specifically:
wherein is alphaiDistance parameter, n, for the ith minority class data in the minority class data setiA newly generated minority class data quantity for the ith each minority class data in the minority class data set;
step 1, calculating newly generated software defect prediction data;
step 1, generating n by the ith minority class data in the minority class data setiNew minority class data, so the newly generated minority class data is used as pnew i,jIs represented by where j ∈ [1, n ]i]
Step 1, the deviation amount of the jth newly generated data of the ith minority class data in the minority class data set from the majority class is marked as epsiloni,j;
Wherein the ith minority class data in the minority class data setOffset of j newly generated data from majority classi,jThe calculation formula is as follows:
wherein,in order to deviate from most of the class degree parameters, a random number with the value of 0-1 is taken,most of the class data that is most recent;
step 1, the deviation of the jth newly generated data of the ith minority class data in the minority class data set to the majority class is recorded as sigmai,j;
Step 1, the offset sigma of the jth newly generated data of the ith minority class data in the minority class data set biased to the majority classi,jThe calculation formula is as follows:
wherein,is biased to minority class hierarchy parameter, and takes random number of 0-1.5,the minority class of data that is its most recent;
step 1, newly generating software defect prediction data minority class data, and recording the minority class data as pnew i,j;
The j-th newly generated data calculation formula of the ith minority class data of the newly generated software defect prediction data is as follows:
pnew i,j=pi+εi,j+σi,j
step 1, obtaining a newly generated minority class data set, and recording the data set as Snew;
Step 1 minority points piNumber n of newly generated defect dataiAccording to the minority class data p generated abovenewIn such a way that a newly generated minority class data set S is obtainednew;
Wherein,n, as new minority class data set SnewThe number of elements is included, the category of the new data is marked as defect data, and the category of the new data is marked as weak mark LwSymbol p 'for the ith new minority class data set'iRepresentation, its mark is
The step 2 is specifically as follows:
step 2, respectively calculating the influence degree of the first classifier and the influence degree of the second classifier;
step 2 utilizing newly formed minority class data set SnewTraining the first classifier H1Using newly formed minority class data sets SnewAre sequentially brought into the first classifier H1To obtain a predicted class Lp1For SnewThe ith point of (1)'iWeak label thereofH1The prediction category is
Using newly formed minority class data sets SnewTraining the second classifier H2Using newly formed minority class data sets SnewIn turn into a second classifier H2To obtain a predicted class Lp2For SnewThe ith point of (1)'iWeak mark thereofIs composed ofH2The prediction category is
The influence degree of the first classifier is as follows:
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame classA value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame classThe value is 1, otherwise the value is 0;
the influence degree of the second classifier is as follows:
wherein N is a minority class data set SminNumber of elements, first classifier H1Prediction class and Weak Mark LwSame classA value of 1, otherwise a value of 0, a second classifier H2Prediction class and Weak Mark LwSame classThe value is 1, otherwise the value is 0;
step 2, updating the labels of the minority data according to the influence degree of the first classifier and the influence degree of the second classifier so as to construct updated original software defect data;
step 2, weak marking of new minority class data setThe judgment is carried out according to the influence degree of the classifier, and the calculation formula isWhen the confidence degree gammaiBeta, will be this minority class of dataAdding training data when gamma isiWhen the beta is less than or equal to beta, directly deleting the data, and not adding the minority class data into a new training set;
step 2, newly forming minority class data, namely SnewIs screened again to obtain newly generated minority class data Snew', will Snew'Add original software defect data S to get a new training set S'.
2. The method of claim 1,
the step 3 specifically comprises the following steps:
after a new training data set S' is obtained, a first classifier H is trained1And a second classifier H2By a trained first classifier H1And a second classifier H2Respectively obtaining the prediction results of the first classifier by the prediction data vL1And a second classifier prediction result L2Continuing to use the influence degree o of the first classifier1And degree of influence o of the second classifier2Using the calculation formula Lpre=L1*o1+L2*o2To obtain a predicted result;
step 3, when L ispreWhen the value is larger than beta, the category of the prediction data v is a minority class;
step 3, when L ispreWhen the value of (b) is equal to or less than β, the category of the prediction data v is a majority category.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110428102.3A CN113204481B (en) | 2021-04-21 | 2021-04-21 | Class imbalance software defect prediction method based on data resampling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110428102.3A CN113204481B (en) | 2021-04-21 | 2021-04-21 | Class imbalance software defect prediction method based on data resampling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113204481A CN113204481A (en) | 2021-08-03 |
CN113204481B true CN113204481B (en) | 2022-03-04 |
Family
ID=77027498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110428102.3A Active CN113204481B (en) | 2021-04-21 | 2021-04-21 | Class imbalance software defect prediction method based on data resampling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113204481B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677564A (en) * | 2016-01-04 | 2016-06-15 | 中国石油大学(华东) | Adaboost software defect unbalanced data classification method based on improvement |
CN107391452A (en) * | 2017-07-06 | 2017-11-24 | 武汉大学 | A kind of software defect estimated number method based on data lack sampling and integrated study |
CN108563556A (en) * | 2018-01-10 | 2018-09-21 | 江苏工程职业技术学院 | Software defect prediction optimization method based on differential evolution algorithm |
CN110471856A (en) * | 2019-08-21 | 2019-11-19 | 大连海事大学 | A kind of Software Defects Predict Methods based on data nonbalance |
CN110533116A (en) * | 2019-09-04 | 2019-12-03 | 大连大学 | Based on the adaptive set of Euclidean distance at unbalanced data classification method |
CN110674865A (en) * | 2019-09-20 | 2020-01-10 | 燕山大学 | Rule learning classifier integration method oriented to software defect class distribution unbalance |
CN110942153A (en) * | 2019-11-11 | 2020-03-31 | 西北工业大学 | Data resampling method based on repeated editing nearest neighbor and clustering oversampling |
CN111090579A (en) * | 2019-11-14 | 2020-05-01 | 北京航空航天大学 | Software defect prediction method based on Pearson correlation weighting association classification rule |
CN111522736A (en) * | 2020-03-26 | 2020-08-11 | 中南大学 | Software defect prediction method and device, electronic equipment and computer storage medium |
CN111767216A (en) * | 2020-06-23 | 2020-10-13 | 江苏工程职业技术学院 | Cross-version depth defect prediction method capable of relieving class overlap problem |
CN112465040A (en) * | 2020-12-01 | 2021-03-09 | 杭州电子科技大学 | Software defect prediction method based on class imbalance learning algorithm |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012116208A2 (en) * | 2011-02-23 | 2012-08-30 | New York University | Apparatus, method, and computer-accessible medium for explaining classifications of documents |
CN103810101B (en) * | 2014-02-19 | 2019-02-19 | 北京理工大学 | A kind of Software Defects Predict Methods and software defect forecasting system |
US10430315B2 (en) * | 2017-10-04 | 2019-10-01 | Blackberry Limited | Classifying warning messages generated by software developer tools |
-
2021
- 2021-04-21 CN CN202110428102.3A patent/CN113204481B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677564A (en) * | 2016-01-04 | 2016-06-15 | 中国石油大学(华东) | Adaboost software defect unbalanced data classification method based on improvement |
CN107391452A (en) * | 2017-07-06 | 2017-11-24 | 武汉大学 | A kind of software defect estimated number method based on data lack sampling and integrated study |
CN108563556A (en) * | 2018-01-10 | 2018-09-21 | 江苏工程职业技术学院 | Software defect prediction optimization method based on differential evolution algorithm |
CN110471856A (en) * | 2019-08-21 | 2019-11-19 | 大连海事大学 | A kind of Software Defects Predict Methods based on data nonbalance |
CN110533116A (en) * | 2019-09-04 | 2019-12-03 | 大连大学 | Based on the adaptive set of Euclidean distance at unbalanced data classification method |
CN110674865A (en) * | 2019-09-20 | 2020-01-10 | 燕山大学 | Rule learning classifier integration method oriented to software defect class distribution unbalance |
CN110942153A (en) * | 2019-11-11 | 2020-03-31 | 西北工业大学 | Data resampling method based on repeated editing nearest neighbor and clustering oversampling |
CN111090579A (en) * | 2019-11-14 | 2020-05-01 | 北京航空航天大学 | Software defect prediction method based on Pearson correlation weighting association classification rule |
CN111522736A (en) * | 2020-03-26 | 2020-08-11 | 中南大学 | Software defect prediction method and device, electronic equipment and computer storage medium |
CN111767216A (en) * | 2020-06-23 | 2020-10-13 | 江苏工程职业技术学院 | Cross-version depth defect prediction method capable of relieving class overlap problem |
CN112465040A (en) * | 2020-12-01 | 2021-03-09 | 杭州电子科技大学 | Software defect prediction method based on class imbalance learning algorithm |
Non-Patent Citations (3)
Title |
---|
SMOTE: Synthetic Minority Over-sampling Technique;Nitesh V. Chawla;《Journal of Artificial Intelligence Research》;20020602;全文 * |
类不平衡稀疏重构度量学习软件缺陷预测;史作婷;《计算机技术与发展》;20180610;全文 * |
面向不平衡数据集的机器学习分类策略;徐玲玲,迟冬祥;《计算机工程与应用》;20201120;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113204481A (en) | 2021-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112069415A (en) | Interest point recommendation method based on heterogeneous attribute network characterization learning | |
CN108171209A (en) | A kind of face age estimation method that metric learning is carried out based on convolutional neural networks | |
CN109101938B (en) | Multi-label age estimation method based on convolutional neural network | |
CN109947963A (en) | A kind of multiple dimensioned Hash search method based on deep learning | |
CN105760888B (en) | A kind of neighborhood rough set integrated learning approach based on hierarchical cluster attribute | |
CN107704512A (en) | Financial product based on social data recommends method, electronic installation and medium | |
CN107480723B (en) | Texture Recognition based on partial binary threshold learning network | |
CN105955951A (en) | Message filtering method and device | |
CN112115993B (en) | Zero sample and small sample evidence photo anomaly detection method based on meta-learning | |
CN107808375A (en) | Merge the rice disease image detecting method of a variety of context deep learning models | |
CN113159149B (en) | Method and device for identifying enterprise office address | |
CN113032613B (en) | Three-dimensional model retrieval method based on interactive attention convolution neural network | |
CN111950195B (en) | Project progress prediction method based on portrait system and depth regression model | |
US11682039B2 (en) | Determining a target group based on product-specific affinity attributes and corresponding weights | |
CN111144466B (en) | Image sample self-adaptive depth measurement learning method | |
CN111144462A (en) | Unknown individual identification method and device for radar signals | |
CN113204481B (en) | Class imbalance software defect prediction method based on data resampling | |
CN110232397A (en) | A kind of multi-tag classification method of combination supporting vector machine and projection matrix | |
CN116992155B (en) | User long tail recommendation method and system utilizing NMF with different liveness | |
CN113591016A (en) | Landslide labeling contour generation method based on multi-user cooperation | |
CN113420797A (en) | Online learning image attribute identification method and system | |
CN111783688B (en) | Remote sensing image scene classification method based on convolutional neural network | |
CN108965585B (en) | User identity recognition method based on smart phone sensor | |
CN116245259A (en) | Photovoltaic power generation prediction method and device based on depth feature selection and electronic equipment | |
CN114372181B (en) | Equipment production intelligent planning method based on multi-mode data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |