CN113434401B - Software defect prediction method based on sample distribution characteristics and SPY algorithm - Google Patents
Software defect prediction method based on sample distribution characteristics and SPY algorithm Download PDFInfo
- Publication number
- CN113434401B CN113434401B CN202110703322.2A CN202110703322A CN113434401B CN 113434401 B CN113434401 B CN 113434401B CN 202110703322 A CN202110703322 A CN 202110703322A CN 113434401 B CN113434401 B CN 113434401B
- Authority
- CN
- China
- Prior art keywords
- samples
- sample
- boundary
- minority
- spy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 30
- 230000007547 defect Effects 0.000 title claims abstract description 23
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 23
- 230000000694 effects Effects 0.000 claims abstract description 5
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 6
- 230000003044 adaptive effect Effects 0.000 claims description 4
- 238000007477 logistic regression Methods 0.000 claims description 4
- 238000003066 decision tree Methods 0.000 claims description 3
- 239000006185 dispersion Substances 0.000 claims description 3
- 238000010801 machine learning Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims 1
- 238000005070 sampling Methods 0.000 abstract description 9
- 239000000523 sample Substances 0.000 description 84
- 238000004458 analytical method Methods 0.000 description 4
- 238000007635 classification algorithm Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000013522 software testing Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000004260 weight control Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3668—Testing of software
- G06F11/3672—Test management
- G06F11/3688—Test management for test execution, e.g. scheduling of test suites
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Complex Calculations (AREA)
Abstract
本发明公开了一种基于样本分布特征和SPY算法的软件缺陷预测方法;本发明基于软件缺陷数据集的分布特征,提出了边界k值确定公式。根据公式为不同的数据集选择合适的少数类边界样本。此外,本发明将SPY算法和边界采样算法结合,通过边缘样本优化SPY算法,将少数类边界区域的部分多数类样本设置成为SPY样本,并为SPY样本设置较小训练样本权重,在原始少数类边界区域内部使用边界采样的算法。SPY样本可以对边界上的少数类样本起到引导的作用,确保边界区域的少数类样本能够被正确分类。同时,通过为SPY样本设立一个较小的样本权重,减少了SPY样本对多数类样本的分类影响,最终达到一个较好的分类效果。
The invention discloses a software defect prediction method based on sample distribution characteristics and SPY algorithm; based on the distribution characteristics of software defect data sets, the invention proposes a boundary k value determination formula. Choose appropriate minority class boundary samples for different datasets according to the formula. In addition, the present invention combines the SPY algorithm with the boundary sampling algorithm, optimizes the SPY algorithm through the edge samples, sets some majority class samples in the minority class boundary region as SPY samples, and sets a smaller training sample weight for the SPY samples, in the original minority class Algorithms that use boundary sampling inside the boundary region. SPY samples can guide the minority class samples on the boundary to ensure that the minority class samples in the boundary area can be correctly classified. At the same time, by setting a smaller sample weight for SPY samples, the influence of SPY samples on the classification of most samples is reduced, and a better classification effect is finally achieved.
Description
技术领域technical field
本发明涉及软件缺陷预测方法,具体涉及一种基于样本分布特征和SPY算法的软件缺陷预测方法;本发明是对项目内软件缺陷预测的一种类不平衡处理方法,旨在使用该方法可以平衡软件缺陷数据集,提升模型分类效果,最终帮助测试人员更有效地发现缺陷文件和分配测试资源,从而降低软件测试的成本。The present invention relates to a method for predicting software defects, in particular to a method for predicting software defects based on sample distribution characteristics and the SPY algorithm; The defect data set improves the classification effect of the model, and ultimately helps testers find defect files and allocate test resources more effectively, thereby reducing the cost of software testing.
背景技术Background technique
对于类别分布均衡的数据集,传统的分类算法能够达到较好的分类效果。但在实际的应用场景中,数据的分布通常是不平衡的,例如金融欺骗、医疗诊断、软件故障等场景。在这些场景中,数据主要分为两大类,大多数样本属于多数类数据,剩余的属于少数类样本。传统的分类算法对不平衡数据进行分类时,会将结果倾向于多数类,而对少数类样本的识别率偏低。但在实际场景中,少数类样本更具有实际的价值。因此,不平衡数据的分类有较高的研究价值。For datasets with balanced category distribution, traditional classification algorithms can achieve better classification results. But in actual application scenarios, the distribution of data is usually unbalanced, such as financial fraud, medical diagnosis, software failure and other scenarios. In these scenarios, the data are mainly divided into two categories, most of the samples belong to the majority class data, and the rest belong to the minority class samples. When the traditional classification algorithm classifies the unbalanced data, the result tends to be the majority class, but the recognition rate of the minority class samples is low. But in actual scenarios, minority class samples have more practical value. Therefore, the classification of imbalanced data has high research value.
现有的处理类不平衡数据分类问题的方法主要从两个方面着手:1)数据采样。通过增加少数类数据或是减少多数类数据来平衡原始数据集。其中通过增加少数类样本的方法称为过采样方法,通过减少多数类样本的方法称为欠采样方法。另外还有混合过采样和过采样的混合采样方法。以上这些方法直接从数据数量上进行了平衡,但会改变原始数据的分布。2)分类算法,主要包括两部分:代价敏感学习和集成学习。多数类和少数类样本被误分类的代价是不同的,代价敏感学习通过给两类样本设置不同的误分类惩罚因子,通过提高少数类样本的误分类惩罚因子,来平衡分类器的分类倾向,使少数类样本尽可能被分类正确。代价敏感学习这种方法不会改变原始样本的分布,但是需要确定两类样本的惩罚因子。集成学习方法是将若干个弱分类器组合起来,根据每个弱分类器的分类性能,分配不同的权重并整合成一个强分类器。SPY算法中,将部分少数类样本周围的多数类样本视为SPY样本,并修改其标签,以此来平衡数据集。但由于平衡数据集所需要的SPY样本比较多,这将会影响到多数类样本的正确分类。所以本发明将样本边缘采样和SPY算法结合,优化了SPY样本的选择方式,并对其添加了训练权重的控制,以此提升整体预测性能。Existing approaches to deal with the class imbalanced data classification problem mainly start from two aspects: 1) Data sampling. Balance the original dataset by adding minority class data or subtracting majority class data. The method of increasing minority class samples is called oversampling method, and the method of reducing majority class samples is called undersampling method. There is also a hybrid sampling method that mixes oversampling and oversampling. The above methods directly balance the amount of data, but will change the distribution of the original data. 2) The classification algorithm mainly includes two parts: cost-sensitive learning and ensemble learning. The cost of misclassification of majority and minority samples is different. Cost-sensitive learning balances the classification tendency of the classifier by setting different misclassification penalty factors for the two types of samples and increasing the misclassification penalty factor of minority samples. Make the minority class samples be classified as correctly as possible. Cost-sensitive learning does not change the distribution of the original samples, but it needs to determine the penalty factors for the two types of samples. The integrated learning method is to combine several weak classifiers, assign different weights and integrate them into a strong classifier according to the classification performance of each weak classifier. In the SPY algorithm, the majority class samples around some minority class samples are regarded as SPY samples, and their labels are modified to balance the data set. However, due to the large number of SPY samples required for the balanced data set, this will affect the correct classification of the majority class samples. Therefore, the present invention combines sample edge sampling with the SPY algorithm, optimizes the SPY sample selection method, and adds training weight control to it, thereby improving the overall prediction performance.
发明内容SUMMARY OF THE INVENTION
本发明针对现有技术的不足,提出了一种基于样本分布特征和SPY算法的软件缺陷预测方法。Aiming at the deficiencies of the prior art, the invention proposes a software defect prediction method based on sample distribution characteristics and SPY algorithm.
本发明基于软件缺陷数据集的分布特征,提出了边界k值确定公式。根据公式为不同的数据集选择合适的少数类边界样本。此外,本发明将SPY算法和边界采样算法结合,通过边缘样本优化SPY算法,提出了一种新的边界过采样方法BSGSMOTE,主要是将少数类边界区域的部分多数类样本设置成为SPY样本,并为SPY样本设置较小训练样本权重,在原始少数类边界区域内部使用边界采样的算法。SPY样本可以对边界上的少数类样本起到引导的作用,确保边界区域的少数类样本能够被正确分类。同时,通过为SPY样本设立一个较小的样本权重,减少了SPY样本对多数类样本的分类影响,最终达到一个较好的分类效果。The invention proposes a formula for determining the boundary k value based on the distribution characteristics of the software defect data set. Select appropriate minority class boundary samples for different data sets according to the formula. In addition, the present invention combines the SPY algorithm with the boundary sampling algorithm, optimizes the SPY algorithm through edge samples, and proposes a new boundary oversampling method BSGSMOTE, which mainly sets part of the majority class samples in the minority class boundary area as SPY samples, and Set a smaller training sample weight for the SPY sample, and use a boundary sampling algorithm inside the original minority class boundary area. The SPY samples can guide the minority class samples on the boundary to ensure that the minority class samples in the border area can be correctly classified. At the same time, by setting a smaller sample weight for SPY samples, the influence of SPY samples on the classification of majority class samples is reduced, and a better classification effect is finally achieved.
本发明主体包括以下步骤:The subject of the present invention comprises the following steps:
步骤1)基于软件缺陷数据集提取样本特征;包括获取样本不平衡度、获取同类样本之间的平均距离和获取样本方差;Step 1) Extracting sample features based on the software defect data set; including obtaining sample imbalance, obtaining the average distance between similar samples, and obtaining sample variance;
a)计算样本不平衡度a) Calculate the sample imbalance
统计数据集中多数类样本个数与少数类样本个数的比值。计算样The ratio of the number of samples in the majority class to the number of samples in the minority class in the statistical data set. Calculation sample
本不平衡度的公式为:The formula for this unbalance degree is:
imblance=numN/numP。imblance= numN / numP .
式中numN为多数类样本个数,numP为少数类样本个数。In the formula, num N is the number of samples in the majority class, and num P is the number of samples in the minority class.
b)计算同类样本之间的平均距离b) Calculate the average distance between similar samples
同类样本之间的平均距离描述了样本之间的接近程度。以少数类样本为例,对于少数类样本集合Sp中的每一个少数类样本Pi,计算Pi到周围的k个近邻同类样本的距离,得到距离d1,d2…dk,把得到的k个距离求均值,可以得到Pi样本到周围近邻样本的平均距离dpi,公式为:The average distance between samples of the same class describes how close the samples are to each other. Taking the minority class sample as an example, for each minority class sample P i in the minority class sample set S p , calculate the distance from Pi to the surrounding k nearest neighbor samples of the same type, and obtain the distance d 1 , d 2 ...d k , put Calculate the average of the k distances obtained, and the average distance dp i from the P i sample to the surrounding neighbor samples can be obtained, the formula is:
dpi=Avg(d1,d2…dk)。dp i =Avg(d 1 ,d 2 . . . d k ).
计算每一个dpi,最终得到dp=[dp1,dp2…dpnump],取dp的平均值作为少数类样本之间的平均距离度量指标dpaverage。同理,我们也可以得到多数类样本之间的平均距离dnaverage。Calculate each dp i , and finally get dp=[dp 1 ,dp 2 ...dp nump ], take the average value of dp as the average distance measure index dp average between minority class samples. Similarly, we can also get the average distance dn average between samples of the majority class.
c)计算样本方差c) Calculate the sample variance
样本的方差描述了样本的离散程度。样本方差可由每个样本值与总样本平均数之差的平方值的求和平均得到。对于同一类样本来说,样本的方差越大,样本的分布越分散;相反,样本的方差越小,样本的分布越集中。总体数据集有多数类和少数类两类样本,获取两类样本的方差为S1和S2。样本方差的计算公式为:The variance of the sample describes the degree of dispersion of the sample. The sample variance can be obtained by summing the average of the squares of the differences between each sample value and the overall sample mean. For the same type of samples, the larger the variance of the sample, the more dispersed the distribution of the sample; on the contrary, the smaller the variance of the sample, the more concentrated the distribution of the sample. The overall data set has two types of samples, the majority class and the minority class, and the variances of obtaining the two types of samples are S 1 and S 2 . The formula for calculating the sample variance is:
式中σ2表示总体方差,X表示单个样本,μ表示总体样本的均值,N为数据集样本的个数。In the formula, σ2 represents the overall variance, X represents a single sample, μ represents the mean of the overall sample, and N is the number of samples in the data set.
步骤2)自适应边界k值计算,根据k近邻算法选择合适边界样本,为边界样本重采样做好准备;Step 2) Adaptive boundary k value calculation, select appropriate boundary samples according to the k-nearest neighbor algorithm, and prepare for boundary sample resampling;
a)自适应边界k值计算a) Calculation of adaptive boundary k value
不同数据集的分布特征都不相同,k的取值需要根据数据集的分布特征来自适应调整。根据整体数据集的分布情况,从距离和方差两个角度出发,提出了k值的两种计算公式。为了防止边界k过大或者过小,将k值约束了范围,在5至15之间。The distribution characteristics of different data sets are different, and the value of k needs to be adjusted adaptively according to the distribution characteristics of the data sets. According to the distribution of the overall data set, two calculation formulas for the k value are proposed from the perspectives of distance and variance. In order to prevent the boundary k from being too large or too small, the value of k is limited to a range between 5 and 15.
从样本个体的距离出发,结合样本整体不平衡率,得到了以下的公式。Starting from the distance of the sample individual, combined with the overall imbalance rate of the sample, the following formula is obtained.
s.t.k1∈[5,15]stk 1 ∈ [5,15]
式中imblance为样本不平衡率,dpaverage为少数类样本之间的平均距离,dnaverage为多数类样本之间的平均距离。In the formula, imbalance is the sample imbalance rate, dp average is the average distance between minority class samples, and dn average is the average distance between majority class samples.
从两类样本总体的方差角度出发,结合样本整体的不平衡率,得到了k值的另一个计算公式:From the perspective of the variance of the two types of sample populations, combined with the imbalance rate of the sample as a whole, another calculation formula for the k value is obtained:
s.t.k2∈[5,15]stk 2 ∈ [5,15]
式中imblance为样本不平衡率,SP为少数类样本的总体方差,SN为多数类样本的总体方差。In the formula, imbalance is the sample imbalance rate, S P is the overall variance of the minority class samples, and S N is the overall variance of the majority class samples.
b)边界样本选择b) Boundary sample selection
根据得到的边界k值,使用K近邻算法计算每个少数类样本周围的k个近邻样本。在这k个近邻样本中,如果多数类样本的个数多于少数类样本的个数,且近邻少数类样本的个数不为0,则被选为少数类边界样本。According to the boundary k value obtained, the K nearest neighbor algorithm is used to calculate the k nearest neighbor samples around each minority class sample. Among the k nearest neighbor samples, if the number of majority class samples is more than the number of minority class samples, and the number of neighboring minority class samples is not 0, it will be selected as the minority class boundary sample.
步骤3)选择少数类样本周围的SPY样本,帮助边界区域的两类样本更好地分类,以此来提高整体的软件缺陷预测水平;Step 3) Select the SPY samples around the minority class samples to help the two types of samples in the border area to be better classified, so as to improve the overall software defect prediction level;
根据步骤2得到的自适应边界k值,选择边界SPY样本。SPY样本是指那些靠近少数类样本边界的多数类样本,可以计算少数类样本周围的近邻情况来找到相应的SPY样本。对少数类边界进行近邻样本分析,选择合适的多数类样本作为SPY样本。具体方式是对少数类样本进行k近邻分析,计算得到少数类样本周围的近邻样本。对于每一个少数类样本而言,如果其近邻样本中少数类样本的个数大于多数类样本的个数,则说明该样本处于相对较安全的区域,此时,这些近邻样本中的多数类样本可以视为SPY样本。According to the adaptive boundary k value obtained in step 2, select boundary SPY samples. SPY samples refer to the majority class samples that are close to the boundary of the minority class samples, and the corresponding SPY samples can be found by calculating the neighbors around the minority class samples. Carry out the nearest neighbor sample analysis on the boundary of the minority class, and select the appropriate majority class sample as the SPY sample. The specific method is to perform k-nearest neighbor analysis on the minority class samples, and calculate the nearest neighbor samples around the minority class samples. For each minority class sample, if the number of minority class samples in its neighbor samples is greater than the number of majority class samples, it means that the sample is in a relatively safe area. At this time, the majority class samples in these neighbor samples Can be regarded as a SPY sample.
步骤4)在边界少数类样本中进行过采样,以此平衡数据集;Step 4) Perform oversampling in the marginal minority samples to balance the dataset;
少数类样本采用线性插值的方式进行过采样。本发明使用k=5的k近邻算法,获取边界上少数类的边界近邻样本。并在两个少数类样本之间进行随机的线性插值,使得新生成的少数类样本能分布在少数类边界区域,同时随机性导致新生成样本更具多样性。Minority samples are oversampled by linear interpolation. The present invention uses a k-nearest neighbor algorithm with k=5 to obtain boundary neighbor samples of a minority class on the boundary. And random linear interpolation is performed between two minority class samples, so that the newly generated minority class samples can be distributed in the minority class boundary area, and the randomness makes the newly generated samples more diverse.
步骤5)对SPY样本和其他样本分别设置训练权重,来减轻SPY样本对多数类样本的影响,以此提升整体效果;Step 5) Set training weights for SPY samples and other samples respectively, to reduce the impact of SPY samples on majority class samples, so as to improve the overall effect;
设置SPY样本的训练权重,由于SPY样本的本质是多数类样本,将SPY样本的标签设置为少数类样本标签,势必会影响到周围的多数类样本的分类决策。通过减少SPY样本的训练权重,可以减少其对决策边界的影响,并对少数类样本的分类决策起到引导作用。本发明中将SPY类样本的训练权重设置为0.5,将其他样本的权重都设置为1。Set the training weight of the SPY sample. Since the essence of the SPY sample is the majority class sample, setting the label of the SPY sample as the minority class sample label will inevitably affect the classification decision of the surrounding majority class samples. By reducing the training weight of SPY samples, its impact on the decision boundary can be reduced, and it can guide the classification decision of minority class samples. In the present invention, the training weight of the SPY sample is set to 0.5, and the weights of other samples are all set to 1.
步骤6)使用机器学习模型进行数据集的训练及预测;Step 6) use the machine learning model to train and predict the data set;
将得到的类别平衡的软件缺陷数据集放入训练模型中进行训练,得到训练好的模型,模型选用逻辑回归模型、决策树模型、k近邻模型以及贝叶斯模型。模型训练结束后,将测试集样本预处理后输入到模型中,即可得到模型预测的标签。Put the obtained category-balanced software defect data set into the training model for training, and obtain the trained model. The model uses logistic regression model, decision tree model, k-nearest neighbor model and Bayesian model. After the model training is completed, the test set samples are preprocessed and input into the model to obtain the predicted labels of the model.
本发明的有益效果:Beneficial effects of the present invention:
1、该技术根据原始样本的分布特征,自适应地决定k值,使用k近邻算法找到合适的少数类边界样本,为在少数类边界区域生成新样本做好了准备。1. This technology adaptively determines the value of k according to the distribution characteristics of the original samples, and uses the k-nearest neighbor algorithm to find suitable minority class boundary samples, making preparations for generating new samples in the minority class boundary area.
2、该技术将SPY算法和边界采样算法结合。通过设置指定区域的多数类样本作为SPY样本,对少数类样本进行决策引导,并加上训练权重的控制,使得更多的少数类样本能够被正确分类,提高了缺陷类样本被识别的概率。2. This technology combines the SPY algorithm with the boundary sampling algorithm. By setting the majority class samples in the specified area as SPY samples, decision-making guidance is made on the minority class samples, and the control of training weight is added, so that more minority class samples can be correctly classified and the probability of defect class samples being identified is improved.
附图说明Description of drawings
图1少数类边界样本的定义图。Figure 1 Definition diagram of minority class boundary samples.
图2同类样本之间的平均距离定义图。Figure 2 Definition of average distance between similar samples.
图3 SPY样本的定义图。Fig. 3 Definition map of SPY sample.
图4算法模型的整体流程图。Figure 4 The overall flowchart of the algorithm model.
具体实施方式Detailed ways
下面根据附图结合软件缺陷预测数据集对本发明进行详细说明。本发明整体流程如附图4所示,具体步骤如下:The present invention will be described in detail below in conjunction with the software defect prediction data set according to the accompanying drawings. Overall process of the present invention is as shown in accompanying drawing 4, and concrete steps are as follows:
步骤1、对原始软件缺陷数据集进行五折交叉验证,选择其中80%作为训练集,剩下20%作为测试集。对训练集中的样本进行特征提取,获取样本不平衡度、同类样本之间的平均距离和样本方差这三个特征。Step 1. Perform 5-fold cross-validation on the original software defect data set, select 80% of it as a training set, and the remaining 20% as a test set. Feature extraction is performed on the samples in the training set to obtain the three features of sample imbalance, average distance between similar samples and sample variance.
1)获取样本不平衡度1) Obtain the sample imbalance degree
软件缺陷数据集中多数类样本个数与少数类样本个数的比值,计算样本的不平衡度的公式为:The ratio of the number of samples in the majority class to the number of samples in the minority class in the software defect data set, the formula for calculating the imbalance degree of the samples is:
imblance=numN/numP;imblance= numN / numP ;
式中numN为多数类样本个数,numP为少数类样本个数;如图1所示;In the formula, num N is the number of samples in the majority class, and num P is the number of samples in the minority class; as shown in Figure 1;
2)获取同类样本之间的平均距离2) Obtain the average distance between similar samples
如图2所示,同类样本之间的平均距离描述了样本之间的接近程度;以少数类样本为例,对于少数类样本集合Sp中的每一个少数类样本Pi,计算Pi到周围的k个近邻同类样本的距离,得到距离d1,d2…dk,把得到的k个距离求均值,可以得到Pi样本到周围近邻样本的平均距离dpi,公式为:As shown in Figure 2, the average distance between samples of the same type describes the closeness between samples; taking minority samples as an example, for each minority sample P i in the minority sample set S p , calculate P i to The distances of the surrounding k neighbor samples of the same type can be obtained as distances d 1 , d 2 ...d k , and the average value of the obtained k distances can be obtained to obtain the average distance dp i from the Pi sample to the surrounding neighbor samples. The formula is:
dpi=Avg(d1,d2…dk);dp i =Avg(d 1 ,d 2 ...d k );
计算每一个dpi,最终得到dp=[dp1,dp2…dpnump],取dp的平均值作为少数类样本之间的平均距离度量指标dpaverage;同理,我们也可以得到多数类样本之间的平均距离dnaverage;Calculate each dp i , and finally get dp=[dp 1 ,dp 2 ...dp nump ], take the average value of dp as the average distance measurement index dp average between minority class samples; similarly, we can also get majority class samples The average distance dn average between;
3)获取样本方差3) Get the sample variance
样本的方差描述了样本的离散程度;样本方差可由每个样本值与总样本平均数之差的平方值的求和平均得到;对于同一类样本来说,样本的方差越大,样本的分布越分散;相反,样本的方差越小,样本的分布越集中;总体数据集有多数类和少数类两类样本,获取两类样本的方差为S1和S2;样本方差的计算公式为:The variance of the sample describes the degree of dispersion of the sample; the variance of the sample can be obtained by the sum of the square values of the difference between each sample value and the mean of the total sample; Scattered; on the contrary, the smaller the variance of the sample, the more concentrated the distribution of the sample; the overall data set has two types of samples, the majority class and the minority class, and the variances of the two types of samples obtained are S 1 and S 2 ; the formula for calculating the sample variance is:
式中σ2表示总体方差,X表示单个样本,μ表示总体样本的均值,N为数据集样本的个数。In the formula, σ2 represents the overall variance, X represents a single sample, μ represents the mean of the overall sample, and N is the number of samples in the data set.
步骤2、将步骤1求得的三个特征代入到提出的两个边界k值计算公式中,求得两个k值。两个公式如下所示:Step 2. Substituting the three features obtained in step 1 into the proposed calculation formulas for the two boundary k values to obtain two k values. The two formulas are as follows:
从样本个体的距离出发,结合样本的不平衡率的公式如下:Starting from the distance of the sample individual, the formula of combining the imbalance rate of the sample is as follows:
式中imblance为样本不平衡率,dpaverage为少数类样本之间的平均距离,dnaverage为多数类样本之间的平均距离。In the formula, imbalance is the sample imbalance rate, dp average is the average distance between minority class samples, and dn average is the average distance between majority class samples.
从样本总体的方差出发,结合样本的不平衡率的公式如下:Starting from the variance of the sample population, the formula combined with the imbalance rate of the sample is as follows:
式中imblance为样本不平衡率,SP为少数类样本的总体方差,SN为多数类样本的总体方差。In the formula, imbalance is the sample imbalance rate, S P is the overall variance of the minority class samples, and S N is the overall variance of the majority class samples.
根据步骤2中计算出来的边界k值,选择较优的一个作为K近邻的参数,本专利使用k2作为近邻参数。对少数类样本使用K近邻算法计算每个少数类样本周围的k个近邻样本,若k个近邻样本中多数类样本的个数多于少数类样本的个数,且少数类样本的个数不为0,则将被分析的少数类样本作为少数类边界样本。According to the boundary k value calculated in step 2, a better one is selected as the K-nearest neighbor parameter, and this patent uses k2 as the neighbor parameter. For the minority class samples, use the K nearest neighbor algorithm to calculate the k nearest neighbor samples around each minority class sample. If it is 0, the analyzed minority class samples will be regarded as minority class boundary samples.
步骤3、对少数类样本进行k近邻分析,计算得到少数类样本周围的近邻样本。对于每一个少数类样本而言,如果其近邻样本中少数类样本的个数大于多数类样本的个数,则说明该样本处于相对较安全的区域,挑选这些近邻样本中的多数类样本作为SPY样本。Step 3: Carry out k-nearest neighbor analysis on the minority class samples, and calculate the nearest neighbor samples around the minority class samples. For each minority class sample, if the number of minority class samples in its neighbor samples is greater than the number of majority class samples, it means that the sample is in a relatively safe area, and the majority class samples in these neighbor samples are selected as SPY sample.
步骤4、对边界区域中的少数类样本进行线性插值采样。本发明采用k=5的k近邻算法,获取边界上少数类的边界近邻样本。并在两个少数类样本之间进行随机的线性插值,使得新生成的少数类样本能分布在少数类边界区域,同时随机性导致新生成样本更具多样性。线性插值的公式如下所示:Step 4. Perform linear interpolation sampling on the minority class samples in the boundary area. The present invention adopts the k-nearest neighbor algorithm of k=5 to obtain the boundary neighbor samples of the minority class on the boundary. And random linear interpolation is performed between two minority class samples, so that the newly generated minority class samples can be distributed in the minority class boundary area, and the randomness makes the newly generated samples more diverse. The formula for linear interpolation is as follows:
ni=(pi-pj)*δ+pj n i =(p i -p j )*δ+p j
步骤5、设置SPY样本标签为少数类样本标签,并修改其训练权重为0.5,设置其他样本的训练权重为1。Step 5. Set the SPY sample label as the minority sample label, modify its training weight to 0.5, and set the training weight of other samples to 1.
步骤6、使用训练集在经典的分类模型如逻辑回归模型、朴素贝叶斯模型、逻辑回归模型、支持向量机模型和决策树模型上进行训练。然后对测试集上的样本数据进行归一化处理,将归一化之后的数据放入训练好的模型中进行分类预测,并求出评估指标Recall、F1、AUC和G-Mean。Step 6. Use the training set to train on classic classification models such as logistic regression models, naive Bayesian models, logistic regression models, support vector machine models and decision tree models. Then normalize the sample data on the test set, put the normalized data into the trained model for classification prediction, and calculate the evaluation indicators Recall, F1, AUC and G-Mean.
最终数据分布结果分析:本发明主要是在少数类样本的边界区域生成新样本,新样本处于边界区域。在挑选少数类的边界样本过程中,可以通过k近邻算法找到并除去样本中的噪声样本。与此同时,可以将部分的多数类样本视为SPY样本,在不过多新增少数类样本的情况下,通过SPY样本将两类样本的决策面向少数类样本区域移动,确保少数类样本能被正确分类。同时通过控制两类样本的训练权重,减少了SPY样本对多数类样本的分类的影响,最终提升了整体的分类性能。Analysis of the final data distribution results: the present invention mainly generates new samples in the boundary area of the minority class samples, and the new samples are in the boundary area. In the process of selecting the boundary samples of the minority class, the noise samples in the samples can be found and removed by the k-nearest neighbor algorithm. At the same time, part of the majority class samples can be regarded as SPY samples. In the case of not adding too many minority class samples, the decision-making of the two types of samples can be moved to the minority class sample area through SPY samples to ensure that the minority class samples can be correctly classified. At the same time, by controlling the training weights of the two types of samples, the influence of SPY samples on the classification of most samples is reduced, and the overall classification performance is finally improved.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110703322.2A CN113434401B (en) | 2021-06-24 | 2021-06-24 | Software defect prediction method based on sample distribution characteristics and SPY algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110703322.2A CN113434401B (en) | 2021-06-24 | 2021-06-24 | Software defect prediction method based on sample distribution characteristics and SPY algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113434401A CN113434401A (en) | 2021-09-24 |
CN113434401B true CN113434401B (en) | 2022-10-28 |
Family
ID=77753851
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110703322.2A Active CN113434401B (en) | 2021-06-24 | 2021-06-24 | Software defect prediction method based on sample distribution characteristics and SPY algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113434401B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114138632B (en) * | 2021-11-12 | 2025-06-03 | 杭州电子科技大学 | A real-time software defect prediction method based on data class imbalance distribution |
CN114490386B (en) * | 2022-01-26 | 2025-02-14 | 安徽大学 | A software defect prediction method and system based on information entropy oversampling |
CN114860297B (en) * | 2022-03-25 | 2024-09-13 | 上海师范大学 | SMOTE (short message analysis) improvement-based Bayes-LightGBM software defect prediction method |
CN114881166B (en) * | 2022-05-24 | 2025-06-27 | 江苏大学 | A charging pile fault detection method and oversampling algorithm |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112016756A (en) * | 2020-08-31 | 2020-12-01 | 北京深演智能科技股份有限公司 | Data prediction method and device |
CN112465153A (en) * | 2019-12-23 | 2021-03-09 | 北京邮电大学 | A Disk Failure Prediction Method Based on Imbalanced Ensemble Binary Classification |
CN112883855A (en) * | 2021-02-04 | 2021-06-01 | 东北林业大学 | Electroencephalogram signal emotion recognition based on CNN + data enhancement algorithm Borderline-SMOTE |
CN112932497A (en) * | 2021-03-10 | 2021-06-11 | 中山大学 | Unbalanced single-lead electrocardiogram data classification method and system |
CN112966778A (en) * | 2021-03-29 | 2021-06-15 | 上海冰鉴信息科技有限公司 | Data processing method and device for unbalanced sample data |
CN112990286A (en) * | 2021-03-08 | 2021-06-18 | 中电积至(海南)信息技术有限公司 | Malicious traffic detection method in data imbalance scene |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105760889A (en) * | 2016-03-01 | 2016-07-13 | 中国科学技术大学 | Efficient imbalanced data set classification method |
CN107944460A (en) * | 2016-10-12 | 2018-04-20 | 甘肃农业大学 | One kind is applied to class imbalance sorting technique in bioinformatics |
CN110019770A (en) * | 2017-07-24 | 2019-07-16 | 华为技术有限公司 | The method and apparatus of train classification models |
US11444957B2 (en) * | 2018-07-31 | 2022-09-13 | Fortinet, Inc. | Automated feature extraction and artificial intelligence (AI) based detection and classification of malware |
US20200143274A1 (en) * | 2018-11-06 | 2020-05-07 | Kira Inc. | System and method for applying artificial intelligence techniques to respond to multiple choice questions |
CN109871862A (en) * | 2018-12-28 | 2019-06-11 | 北京航天测控技术有限公司 | A kind of failure prediction method based on synthesis minority class over-sampling and deep learning |
CN110532542B (en) * | 2019-07-15 | 2021-07-13 | 西安交通大学 | A method and system for identifying false invoices based on positive examples and unlabeled learning |
CN112633337A (en) * | 2020-12-14 | 2021-04-09 | 哈尔滨理工大学 | Unbalanced data processing method based on clustering and boundary points |
-
2021
- 2021-06-24 CN CN202110703322.2A patent/CN113434401B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112465153A (en) * | 2019-12-23 | 2021-03-09 | 北京邮电大学 | A Disk Failure Prediction Method Based on Imbalanced Ensemble Binary Classification |
CN112016756A (en) * | 2020-08-31 | 2020-12-01 | 北京深演智能科技股份有限公司 | Data prediction method and device |
CN112883855A (en) * | 2021-02-04 | 2021-06-01 | 东北林业大学 | Electroencephalogram signal emotion recognition based on CNN + data enhancement algorithm Borderline-SMOTE |
CN112990286A (en) * | 2021-03-08 | 2021-06-18 | 中电积至(海南)信息技术有限公司 | Malicious traffic detection method in data imbalance scene |
CN112932497A (en) * | 2021-03-10 | 2021-06-11 | 中山大学 | Unbalanced single-lead electrocardiogram data classification method and system |
CN112966778A (en) * | 2021-03-29 | 2021-06-15 | 上海冰鉴信息科技有限公司 | Data processing method and device for unbalanced sample data |
Also Published As
Publication number | Publication date |
---|---|
CN113434401A (en) | 2021-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113434401B (en) | Software defect prediction method based on sample distribution characteristics and SPY algorithm | |
CN112633601B (en) | Method, device, equipment and computer medium for predicting disease event occurrence probability | |
CN111062425B (en) | Unbalanced data set processing method based on C-K-SMOTE algorithm | |
CN108304316B (en) | Software defect prediction method based on collaborative migration | |
CN112633337A (en) | Unbalanced data processing method based on clustering and boundary points | |
CN108877947B (en) | Deep sample learning method based on iterative mean clustering | |
CN112200392B (en) | Service prediction method and device | |
CN112557034A (en) | Bearing fault diagnosis method based on PCA _ CNNS | |
CN111815209A (en) | Data dimension reduction method and device applied to wind control model | |
CN112634022A (en) | Credit risk assessment method and system based on unbalanced data processing | |
CN112466461A (en) | Medical image intelligent diagnosis method based on multi-network integration | |
CN111639688B (en) | Local interpretation method of Internet of things intelligent model based on linear kernel SVM | |
CN118364346A (en) | A Classification Method for Imbalanced Data Based on Mixed Sampling | |
Budiman et al. | Optimization of classification results by minimizing class imbalance on decision tree algorithm | |
TWI613545B (en) | Analysis method and analysis system of drawing processing program | |
CN113792141A (en) | Feature selection method based on covariance measure factor | |
Ŝkvorc et al. | Analyzing the generalizability of automated algorithm selection: a case study for numerical optimization | |
Hou et al. | A new imbalanced data oversampling method based on Bootstrap method and Wasserstein Generative Adversarial Network | |
Pilát et al. | Improving many-objective optimizers with aggregate meta-models | |
CN112215290B (en) | Fisher score-based Q learning auxiliary data analysis method and Fisher score-based Q learning auxiliary data analysis system | |
CN114612255B (en) | Insurance pricing method based on electronic medical record data feature selection | |
Cheng et al. | OAGAN: an oversampling approach for imbalanced data problems | |
CN119888350A (en) | Multi-model unbalanced node classification method and device | |
CN119948504A (en) | System and method for training machine learning based models | |
CN118246593A (en) | A method for predicting scenic spot prosperity index based on intelligent judgment algorithm of power big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |