CN113434401B - Software defect prediction method based on sample distribution characteristics and SPY algorithm - Google Patents

Software defect prediction method based on sample distribution characteristics and SPY algorithm Download PDF

Info

Publication number
CN113434401B
CN113434401B CN202110703322.2A CN202110703322A CN113434401B CN 113434401 B CN113434401 B CN 113434401B CN 202110703322 A CN202110703322 A CN 202110703322A CN 113434401 B CN113434401 B CN 113434401B
Authority
CN
China
Prior art keywords
samples
sample
minority
boundary
spy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110703322.2A
Other languages
Chinese (zh)
Other versions
CN113434401A (en
Inventor
陈滨
俞坚强
方景龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110703322.2A priority Critical patent/CN113434401B/en
Publication of CN113434401A publication Critical patent/CN113434401A/en
Application granted granted Critical
Publication of CN113434401B publication Critical patent/CN113434401B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a software defect prediction method based on sample distribution characteristics and an SPY algorithm; the invention provides a boundary k value determination formula based on the distribution characteristics of a software defect data set. A suitable few classes of boundary samples are selected for different data sets according to a formula. In addition, the SPY algorithm and the boundary sampling algorithm are combined, the SPY algorithm is optimized through the edge samples, part of majority samples of the minority boundary area are set into the SPY samples, the smaller training sample weight is set for the SPY samples, and the boundary sampling algorithm is used in the original minority boundary area. The SPY sample can play a guiding role for the minority sample on the boundary, and the minority sample of the boundary area can be correctly classified. Meanwhile, a smaller sample weight is set for the SPY sample, so that the classification influence of the SPY sample on most samples is reduced, and a better classification effect is finally achieved.

Description

Software defect prediction method based on sample distribution characteristics and SPY algorithm
Technical Field
The invention relates to a software defect prediction method, in particular to a software defect prediction method based on sample distribution characteristics and an SPY algorithm; the invention discloses a category imbalance processing method for predicting software defects in a project, and aims to balance a software defect data set, improve the classification effect of a model, and finally help a tester to more effectively find defect files and allocate test resources, so that the cost of software testing is reduced.
Background
For a data set with balanced class distribution, a traditional classification algorithm can achieve a better classification effect. However, in practical application scenarios, the distribution of data is usually unbalanced, such as financial fraud, medical diagnosis, software failure, and so on. In these scenarios, the data is largely divided into two broad categories, with most samples belonging to the majority of the data categories and the remainder belonging to the minority of the data categories. When the traditional classification algorithm classifies unbalanced data, the result tends to be in a majority class, and the recognition rate of a minority class sample is low. However, in an actual scene, a few types of samples have more practical value. Therefore, the classification of the unbalanced data has high research value.
The existing method for processing the class imbalance data classification problem mainly starts from two aspects: 1) And (6) sampling data. The original data set is balanced by adding few classes of data or subtracting many classes of data. Among them, a method by increasing a few class samples is called an oversampling method, and a method by decreasing a majority class sample is called an undersampling method. There are also mixed sampling methods that mix oversampling and oversampling. These above methods balance directly on the amount of data, but change the distribution of the original data. 2) The classification algorithm mainly comprises two parts: cost sensitive learning and integrated learning. The majority type samples and the minority type samples are different in cost of being classified by mistake, the cost sensitive learning balances the classification tendency of the classifier by setting different misclassification punishment factors for the two types of samples and improving the misclassification punishment factors of the minority type samples, so that the minority type samples are classified correctly as far as possible. Cost-sensitive learning this approach does not change the distribution of the original samples, but requires the determination of penalty factors for both types of samples. The ensemble learning method is to combine a plurality of weak classifiers, assign different weights according to the classification performance of each weak classifier, and integrate into a strong classifier. In the SPY algorithm, most class samples around part of the few class samples are regarded as SPY samples, and labels of the SPY samples are modified, so that the data set is balanced. However, since many SPY samples are needed to balance the data set, this will affect the correct classification of most classes of samples. Therefore, the invention combines the sample edge sampling and the SPY algorithm, optimizes the selection mode of the SPY sample, and adds the control of the training weight to the SPY sample, thereby improving the overall prediction performance.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a software defect prediction method based on sample distribution characteristics and an SPY algorithm.
The invention provides a boundary k value determination formula based on the distribution characteristics of a software defect data set. A suitable few class boundary samples are selected for different datasets according to a formula. In addition, the invention combines the SPY algorithm and the boundary sampling algorithm, provides a new boundary oversampling method BSGSMOTE through the edge sample optimization SPY algorithm, mainly sets part of majority samples of minority boundary regions into SPY samples, sets smaller training sample weights for the SPY samples, and uses the boundary sampling algorithm in the original minority boundary regions. The SPY sample can play a guiding role for the minority sample on the boundary, and the minority sample of the boundary area can be correctly classified. Meanwhile, a smaller sample weight is set for the SPY sample, so that the classification influence of the SPY sample on most samples is reduced, and a better classification effect is finally achieved.
The main body of the invention comprises the following steps:
step 1) extracting sample characteristics based on a software defect data set; obtaining sample unbalance, obtaining average distance between similar samples and obtaining sample variance;
a) Calculating sample imbalance
And counting the ratio of the number of the majority type samples to the number of the minority type samples in the data set. Sample calculation
The formula of the unbalance degree is as follows:
imblance=num N /num P
num in the formula N Number of majority sample, num P The number of the samples is a small number.
b) Calculating average distance between homogeneous samples
The average distance between homogeneous samples describes the proximity between samples. Taking the minority class samples as an example, for the minority class sample set S p Each of the minority class samples P in (1) i Calculating P i The distances to the surrounding k adjacent homogeneous samples are obtained to obtain the distance d 1 ,d 2 …d k The obtained k distances are averaged to obtain P i Average distance dp from sample to surrounding neighbor samples i The formula is as follows:
dp i =Avg(d 1 ,d 2 …d k )。
calculating each dp i Finally obtaining dp = [ dp ] 1 ,dp 2 …dp nump ]Taking the average value of dp as the average distance measure index dp between the minority class samples average . Similarly, we can also get the average distance between most kinds of samplesOff dn average
c) Calculating sample variance
The variance of a sample describes the degree of dispersion of the sample. The sample variance may be derived from the sum average of the squared values of the difference between each sample value and the total sample mean. For the same type of sample, the larger the variance of the sample, the more dispersed the distribution of the sample; conversely, the smaller the variance of the sample, the more concentrated the distribution of the sample. The total data set has two types of samples, i.e. majority type and minority type, and the variance of the two types of samples is obtained as S 1 And S 2 . The sample variance is calculated as:
Figure GDA0003731975580000031
in the formula sigma 2 Represents the global variance, X represents the individual samples, μ represents the mean of the global samples, and N is the number of data set samples.
Step 2), calculating a self-adaptive boundary k value, selecting a proper boundary sample according to a k neighbor algorithm, and preparing for resampling the boundary sample;
a) Adaptive boundary k value calculation
The distribution characteristics of different data sets are different, and the value of k needs to be adaptively adjusted according to the distribution characteristics of the data sets. According to the distribution condition of the whole data set, two calculation formulas of the k value are provided from the aspects of distance and variance. To prevent the boundary k from being too large or too small, the k value is constrained to range between 5 and 15.
Starting from the distance of the individual samples, the following formula is obtained by combining the overall imbalance rate of the samples.
Figure GDA0003731975580000032
s.t.k 1 ∈[5,15]
Where the imbalance is the sample imbalance, dp average Is the average distance between the minority class samples, dn average Is the average distance between most classes of samples.
From the aspect of the variance of the two types of sample populations, the imbalance rate of the sample population is combined to obtain another calculation formula of the k value:
Figure GDA0003731975580000033
s.t.k 2 ∈[5,15]
where the imbalance is the sample imbalance ratio, S P Is the overall variance of the minority class samples, S N Is the overall variance of the majority class samples.
b) Boundary sample selection
And calculating K adjacent samples around each minority class sample by using a K adjacent algorithm according to the obtained boundary K value. Of the k neighboring samples, if the number of majority class samples is greater than the number of minority class samples and the number of neighboring minority class samples is not 0, then it is selected as a minority class boundary sample.
Step 3) selecting SPY samples around a few types of samples to help the two types of samples in the boundary area to be better classified, so that the overall software defect prediction level is improved;
and (3) selecting a boundary SPY sample according to the self-adaptive boundary k value obtained in the step (2). SPY samples refer to most samples close to the boundary of the minority sample, and the neighbor situation around the minority sample can be calculated to find the corresponding SPY sample. And performing neighbor sample analysis on the minority class boundary, and selecting proper majority class samples as SPY samples. The specific mode is that k neighbor analysis is carried out on the minority samples, and neighbor samples around the minority samples are obtained through calculation. For each minority sample, if the number of minority samples in its neighboring samples is greater than the number of majority samples, it indicates that the sample is in a relatively safe area, and at this time, the majority samples in the neighboring samples can be regarded as SPY samples.
Step 4) oversampling in a few boundary samples to balance the data set;
and a few types of samples are oversampled by adopting a linear interpolation mode. The invention obtains a few classes of boundary neighbor samples on the boundary by using a k neighbor algorithm with k = 5. And random linear interpolation is carried out between the two minority samples, so that the newly generated minority samples can be distributed in a minority boundary region, and meanwhile, the randomness leads the newly generated samples to be more diverse.
Step 5) training weights are respectively set for the SPY sample and other samples to reduce the influence of the SPY sample on most samples, so that the overall effect is improved;
the training weight of the SPY sample is set, and the label of the SPY sample is set as the label of the few samples because the SPY sample is essentially the majority sample, so that the classification decision of the majority sample around the SPY sample is influenced. By reducing the training weight of the SPY sample, the influence of the SPY sample on the decision boundary can be reduced, and the classification decision of the few samples is guided. In the invention, the training weight of the SPY sample is set to be 0.5, and the weights of other samples are set to be 1.
Step 6), training and predicting a data set by using a machine learning model;
and putting the obtained class-balanced software defect data set into a training model for training to obtain a trained model, wherein the model selects a logistic regression model, a decision tree model, a k nearest neighbor model and a Bayesian model. After the model training is finished, preprocessing the test set sample and inputting the preprocessed test set sample into the model, so as to obtain the label predicted by the model.
The invention has the beneficial effects that:
1. the technology adaptively determines the k value according to the distribution characteristics of the original samples, finds out a proper minority class boundary sample by using a k nearest neighbor algorithm, and prepares for generating a new sample in the minority class boundary region.
2. This technique combines the SPY algorithm with the boundary sampling algorithm. By setting the majority of samples in the designated area as SPY samples, decision-making guidance is carried out on the minority of samples, and training weight control is added, so that more minority of samples can be correctly classified, and the probability of identifying the defect samples is improved.
Drawings
FIG. 1 is a definition diagram of a few class boundary samples.
FIG. 2 is a graph defining the average distance between samples of the same type.
Fig. 3 is a definition diagram of SPY samples.
FIG. 4 is an overall flow diagram of an algorithmic model.
Detailed Description
The invention is described in detail below with reference to the accompanying figures in conjunction with a software defect prediction dataset. The overall process of the invention is shown in the attached figure 4, and the specific steps are as follows:
step 1, carrying out five-fold cross validation on an original software defect data set, selecting 80% of the original software defect data set as a training set, and taking the rest 20% of the original software defect data set as a test set. And extracting the characteristics of the samples in the training set to obtain three characteristics of the sample unbalance, the average distance between the samples of the same type and the sample variance.
1) Obtaining sample imbalance
The ratio of the number of the majority samples to the number of the minority samples in the software defect data set is calculated by the following formula:
imblance=num N /num P
num in the formula N Is the number of majority sample, num P The number of the samples is a minority number; as shown in fig. 1;
2) Obtaining average distance between homogeneous samples
As shown in fig. 2, the average distance between homogeneous samples describes the proximity between samples; taking the minority class samples as an example, for the minority class sample set S p Each of the minority class samples P in (1) i Calculate P i The distances to the surrounding k adjacent homogeneous samples are obtained to obtain the distance d 1 ,d 2 …d k The obtained k distances are averaged to obtain P i Average distance dp of a sample to surrounding neighbor samples i The formula is as follows:
dp i =Avg(d 1 ,d 2 …d k );
calculating each dp i Finally obtaining dp = [ dp ] 1 ,dp 2 …dp nump ]Taking dpThe mean value is taken as the average distance measure dp between the minority samples average (ii) a Similarly, we can also get the average distance dn between most kinds of samples average
3) Obtaining sample variance
The variance of a sample describes the degree of dispersion of the sample; the sample variance may be obtained from the sum average of the squared values of the difference between each sample value and the total sample mean; for the same type of sample, the larger the variance of the sample, the more dispersed the distribution of the sample; conversely, the smaller the variance of a sample, the more concentrated the distribution of the sample; the total data set has two types of samples, i.e. majority type and minority type, and the variance of the two types of samples is obtained as S 1 And S 2 (ii) a The sample variance is calculated as:
Figure GDA0003731975580000061
in the formula sigma 2 Denotes the global variance, X denotes the individual samples, μ denotes the mean of the global samples, and N is the number of data set samples.
And 2, substituting the three characteristics obtained in the step 1 into a proposed two boundary k value calculation formula to obtain two k values. Two equations are shown below:
from the distance of the individual samples, the formula of the imbalance rate of the combined samples is as follows:
Figure GDA0003731975580000062
where the imbalance is the sample imbalance, dp average Is the average distance between the minority class samples, dn average Is the average distance between the majority of the classes of samples.
Starting from the variance of the sample population, the formula for combining the imbalance rate of the samples is as follows:
Figure GDA0003731975580000063
where the imbalance is the sample imbalance, S P Is the overall variance, S, of the minority samples N Is the overall variance of the majority class samples.
According to the boundary K value calculated in the step 2, a better one is selected as a parameter of K neighbor, and the patent uses K2 as a neighbor parameter. And calculating K neighbor samples around each minority sample by using a K neighbor algorithm on the minority samples, and if the number of the majority samples in the K neighbor samples is more than that of the minority samples and the number of the minority samples is not 0, taking the minority samples to be analyzed as the minority boundary samples.
And 3, performing k neighbor analysis on the minority samples, and calculating to obtain neighbor samples around the minority samples. For each minority sample, if the number of minority samples in the neighbor samples is greater than that of majority samples, the sample is in a relatively safe area, and the majority samples in the neighbor samples are selected as SPY samples.
And 4, carrying out linear interpolation sampling on the few samples in the boundary area. The invention adopts a k nearest neighbor algorithm with k =5 to obtain boundary nearest neighbor samples of a few classes on the boundary. And random linear interpolation is carried out between the two minority samples, so that the newly generated minority samples can be distributed in a minority boundary region, and meanwhile, the randomness causes the newly generated samples to have more diversity. The formula for linear interpolation is as follows:
n i =(p i -p j )*δ+p j
and 5, setting the SPY sample labels as a minority class sample label, modifying the training weight of the SPY sample labels to be 0.5, and setting the training weights of other samples to be 1.
And 6, training on a classical classification model such as a logistic regression model, a naive Bayes model, a logistic regression model, a support vector machine model and a decision tree model by using the training set. And then carrying out normalization processing on sample data on the test set, putting the normalized data into a trained model for classification prediction, and solving evaluation indexes Recall, F1, AUC and G-Mean.
Analyzing the final data distribution result: the method mainly comprises the step of generating a new sample in the boundary area of a few types of samples, wherein the new sample is located in the boundary area. In the process of picking a few classes of boundary samples, noise samples in the samples can be found and removed through a k-nearest neighbor algorithm. Meanwhile, part of the majority samples can be regarded as SPY samples, and under the condition that the minority samples are not added too much, the decision surfaces of the two types of samples are moved to the minority sample area through the SPY samples, so that the minority samples can be classified correctly. Meanwhile, by controlling the training weights of the two types of samples, the influence of the SPY samples on the classification of most types of samples is reduced, and the overall classification performance is finally improved.

Claims (4)

1. The software defect prediction method based on the sample distribution characteristics and the SPY algorithm is characterized by comprising the following steps of:
step 1) extracting sample characteristics based on a software defect data set; acquiring sample unbalance, acquiring average distance between similar samples and acquiring sample variance;
step 2), calculating a self-adaptive boundary k value, and selecting a boundary sample according to a k nearest neighbor algorithm;
a) Adaptive boundary k value calculation
The distribution characteristics of different data sets are different, and the value of k needs to be adaptively adjusted according to the distribution characteristics of the data sets; according to the distribution condition of the whole data set, two calculation formulas of a k value are provided from the aspects of distance and variance; to prevent the boundary k from being too large or too small, the k value is constrained to range between 5 and 15;
starting from the distance of the individual sample, combining the integral unbalance rate of the sample to obtain the following formula;
Figure FDA0003731975570000011
s.t.k 1 ∈[5,15]
where the average is the sample imbalance, dp average Is a small numberAverage distance between class samples, dn average Is the average distance between most classes of samples;
from the aspect of the variance of the two types of sample populations, the imbalance rate of the sample population is combined to obtain another calculation formula of the k value:
Figure FDA0003731975570000012
s.t.k 2 ∈[5,15]
where the imbalance is the sample imbalance ratio, S P Is the overall variance, S, of the minority samples N Is the overall variance of the majority class samples;
b) Boundary sample selection
Calculating K neighbor samples around each minority sample by using a K neighbor algorithm according to the obtained boundary K value; if the number of the majority class samples is more than that of the minority class samples and the number of the neighbor minority class samples is not 0, selecting the k neighbor class samples as the minority class boundary samples;
step 3) performing k neighbor operation on the minority samples, and calculating to obtain neighbor samples around the minority samples; for each minority sample, if the number of minority samples in the neighbor samples is greater than that of majority samples, the sample is in a relatively safe area, and at this time, the majority samples in the neighbor samples are regarded as SPY samples; selecting SPY samples around the few samples, and using the SPY samples to guide the few samples in the boundary region to be better classified so as to improve the overall software defect prediction level;
step 4) oversampling in a few boundary samples to balance the data set;
step 5) respectively setting training weights for the SPY sample and other samples;
setting the training weight of the SPY sample to be 0.5, and setting the weights of other samples to be 1; the control of the weight leads SPY samples to guide the accurate classification of few samples of the boundary, and simultaneously reduces the classification influence on most samples of the boundary area, thereby integrally improving the overall classification prediction effect;
and 6) training and predicting the data set by using a machine learning model.
2. The method for predicting software defects based on sample distribution characteristics and SPY algorithm according to claim 1, wherein the step 1 of extracting sample characteristics based on the software defect data set specifically comprises the following steps:
1) Obtaining sample imbalance
The ratio of the number of the majority samples to the number of the minority samples in the software defect data set is calculated by the following formula:
imblance=num N /num P
num in the formula N Number of majority sample, num P The number of the samples is a minority number;
2) Obtaining average distance between homogeneous samples
The average distance between homogeneous samples describes the proximity between samples; taking the minority class sample as an example, for each minority class sample P in the minority class sample set i Calculating P i The distances to the surrounding k neighboring homogeneous samples are obtained to obtain the distance d 1 ,d 2 …d k The obtained k distances are averaged to obtain P i Average distance dp of a sample to surrounding neighbor samples i The formula is as follows:
dp i =Avg(d 1 ,d 2 …d k );
calculating each dp i Finally obtaining dp = [ dp ] 1 ,dp 2 …dp nump ]Taking the average value of dp as the average distance measure index dp between the minority class samples average (ii) a Similarly, we can also get the average distance dn between most kinds of samples average
3) Obtaining a sample variance
The variance of a sample describes the degree of dispersion of the sample; the sample variance may be obtained from the sum average of the squared values of the difference between each sample value and the total sample mean; for the same type of sample, the larger the variance of the sample, the more dispersed the distribution of the sample; conversely, the smaller the variance of a sample, the more concentrated the distribution of the sample; the total data set comprises a majority type sample and a minority type sample, and the variances of the two types of samples are calculated respectively; the sample variance is calculated as:
Figure FDA0003731975570000031
in the formula sigma 2 Represents the global variance, X represents the individual samples, μ represents the mean of the global samples, and N is the number of data set samples.
3. The method of claim 1, wherein the oversampling is performed in the boundary minority sample class according to the step 4, specifically as follows:
a few types of samples are subjected to oversampling in a linear interpolation mode; acquiring boundary neighbor samples of a few classes on a boundary by using a k neighbor algorithm with k = 5; and random linear interpolation is carried out between the two minority samples, so that the newly generated minority samples can be distributed in a minority boundary region, and meanwhile, the randomness causes the newly generated samples to have more diversity.
4. The method of claim 1, wherein the training and prediction of the data set using the machine learning model in step 6 is performed by using a sample distribution feature and an SPY algorithm, specifically as follows:
training and predicting in a classical decision model such as a logistic regression model, a decision tree model, a k nearest neighbor model and a Bayesian model through the obtained class-balanced software defect data set to obtain a trained model; after the model training is finished, preprocessing a test set sample and inputting the preprocessed test set sample into the model, so that a label predicted by the model can be obtained.
CN202110703322.2A 2021-06-24 2021-06-24 Software defect prediction method based on sample distribution characteristics and SPY algorithm Active CN113434401B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110703322.2A CN113434401B (en) 2021-06-24 2021-06-24 Software defect prediction method based on sample distribution characteristics and SPY algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110703322.2A CN113434401B (en) 2021-06-24 2021-06-24 Software defect prediction method based on sample distribution characteristics and SPY algorithm

Publications (2)

Publication Number Publication Date
CN113434401A CN113434401A (en) 2021-09-24
CN113434401B true CN113434401B (en) 2022-10-28

Family

ID=77753851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110703322.2A Active CN113434401B (en) 2021-06-24 2021-06-24 Software defect prediction method based on sample distribution characteristics and SPY algorithm

Country Status (1)

Country Link
CN (1) CN113434401B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114490386A (en) * 2022-01-26 2022-05-13 安徽大学 Software defect prediction method and system based on information entropy oversampling
CN114860297B (en) * 2022-03-25 2024-09-13 上海师范大学 SMOTE (short message analysis) improvement-based Bayes-LightGBM software defect prediction method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016756A (en) * 2020-08-31 2020-12-01 北京深演智能科技股份有限公司 Data prediction method and device
CN112465153A (en) * 2019-12-23 2021-03-09 北京邮电大学 Disk fault prediction method based on unbalanced integrated binary classification
CN112883855A (en) * 2021-02-04 2021-06-01 东北林业大学 Electroencephalogram signal emotion recognition based on CNN + data enhancement algorithm Borderline-SMOTE
CN112932497A (en) * 2021-03-10 2021-06-11 中山大学 Unbalanced single-lead electrocardiogram data classification method and system
CN112966778A (en) * 2021-03-29 2021-06-15 上海冰鉴信息科技有限公司 Data processing method and device for unbalanced sample data
CN112990286A (en) * 2021-03-08 2021-06-18 中电积至(海南)信息技术有限公司 Malicious traffic detection method in data imbalance scene

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760889A (en) * 2016-03-01 2016-07-13 中国科学技术大学 Efficient imbalanced data set classification method
CN107944460A (en) * 2016-10-12 2018-04-20 甘肃农业大学 One kind is applied to class imbalance sorting technique in bioinformatics
CN110019770A (en) * 2017-07-24 2019-07-16 华为技术有限公司 The method and apparatus of train classification models
US11444957B2 (en) * 2018-07-31 2022-09-13 Fortinet, Inc. Automated feature extraction and artificial intelligence (AI) based detection and classification of malware
US20200143274A1 (en) * 2018-11-06 2020-05-07 Kira Inc. System and method for applying artificial intelligence techniques to respond to multiple choice questions
CN109871862A (en) * 2018-12-28 2019-06-11 北京航天测控技术有限公司 A kind of failure prediction method based on synthesis minority class over-sampling and deep learning
CN110532542B (en) * 2019-07-15 2021-07-13 西安交通大学 Invoice false invoice identification method and system based on positive case and unmarked learning
CN112633337A (en) * 2020-12-14 2021-04-09 哈尔滨理工大学 Unbalanced data processing method based on clustering and boundary points

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112465153A (en) * 2019-12-23 2021-03-09 北京邮电大学 Disk fault prediction method based on unbalanced integrated binary classification
CN112016756A (en) * 2020-08-31 2020-12-01 北京深演智能科技股份有限公司 Data prediction method and device
CN112883855A (en) * 2021-02-04 2021-06-01 东北林业大学 Electroencephalogram signal emotion recognition based on CNN + data enhancement algorithm Borderline-SMOTE
CN112990286A (en) * 2021-03-08 2021-06-18 中电积至(海南)信息技术有限公司 Malicious traffic detection method in data imbalance scene
CN112932497A (en) * 2021-03-10 2021-06-11 中山大学 Unbalanced single-lead electrocardiogram data classification method and system
CN112966778A (en) * 2021-03-29 2021-06-15 上海冰鉴信息科技有限公司 Data processing method and device for unbalanced sample data

Also Published As

Publication number Publication date
CN113434401A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN113434401B (en) Software defect prediction method based on sample distribution characteristics and SPY algorithm
CN111062425B (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
CN108304316B (en) Software defect prediction method based on collaborative migration
CN112633337A (en) Unbalanced data processing method based on clustering and boundary points
AU2019293020B2 (en) Display control device, display control method, and display control program
CN109145960A (en) Based on the data characteristics selection method and system for improving particle swarm algorithm
CN111259924A (en) Boundary synthesis, mixed sampling, anomaly detection algorithm and data classification method
CN109472302A (en) A kind of support vector machine ensembles learning method based on AdaBoost
Patel et al. Multi-Classifier Analysis of Leukemia Gene Expression From Curated Microarray Database (CuMiDa)
CN115910362A (en) Atopic dermatitis characteristic prediction method based on enhanced particle swarm optimization
CN117423451B (en) Intelligent molecular diagnosis method and system based on big data analysis
CN110276395A (en) Unbalanced data classification method based on regularization dynamic integrity
CN113936185A (en) Software defect data self-adaptive oversampling method based on local density information
CN113269200A (en) Unbalanced data oversampling method based on minority sample spatial distribution
Gao et al. An ensemble classifier learning approach to ROC optimization
CN106018325B (en) A method of evaluation gasoline property modeling and forecasting credible result degree
Liang et al. ASE: Anomaly Scoring Based Ensemble Learning for Imbalanced Datasets
CN116313111A (en) Breast cancer risk prediction method, system, medium and equipment based on combined model
Bhowan et al. Differentiating between individual class performance in genetic programming fitness for classification with unbalanced data
CN111274119B (en) Variation test data generation method based on multi-population coevolution
CN114511002A (en) Fault diagnosis method and system for small sample data
CN113792141A (en) Feature selection method based on covariance measurement factor
CN113392908A (en) Unbalanced data oversampling algorithm based on boundary density
WO2019190732A1 (en) Apparatus and method for identification of primary immune resistance in cancer patients
CN111252166B (en) Bulldozer control assembly process control method and device based on K-means clustering algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant