CN113434401B

CN113434401B - Software defect prediction method based on sample distribution characteristics and SPY algorithm

Info

Publication number: CN113434401B
Application number: CN202110703322.2A
Authority: CN
Inventors: 陈滨; 俞坚强; 方景龙
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2022-10-28
Anticipated expiration: 2041-06-24
Also published as: CN113434401A

Abstract

The invention discloses a software defect prediction method based on sample distribution characteristics and SPY algorithm; based on the distribution characteristics of software defect data sets, the invention proposes a boundary k value determination formula. Choose appropriate minority class boundary samples for different datasets according to the formula. In addition, the present invention combines the SPY algorithm with the boundary sampling algorithm, optimizes the SPY algorithm through the edge samples, sets some majority class samples in the minority class boundary region as SPY samples, and sets a smaller training sample weight for the SPY samples, in the original minority class Algorithms that use boundary sampling inside the boundary region. SPY samples can guide the minority class samples on the boundary to ensure that the minority class samples in the boundary area can be correctly classified. At the same time, by setting a smaller sample weight for SPY samples, the influence of SPY samples on the classification of most samples is reduced, and a better classification effect is finally achieved.

Description

Software Defect Prediction Method Based on Sample Distribution Characteristics and SPY Algorithm

技术领域technical field

本发明涉及软件缺陷预测方法，具体涉及一种基于样本分布特征和SPY算法的软件缺陷预测方法；本发明是对项目内软件缺陷预测的一种类不平衡处理方法，旨在使用该方法可以平衡软件缺陷数据集，提升模型分类效果，最终帮助测试人员更有效地发现缺陷文件和分配测试资源，从而降低软件测试的成本。The present invention relates to a method for predicting software defects, in particular to a method for predicting software defects based on sample distribution characteristics and the SPY algorithm; The defect data set improves the classification effect of the model, and ultimately helps testers find defect files and allocate test resources more effectively, thereby reducing the cost of software testing.

背景技术Background technique

对于类别分布均衡的数据集，传统的分类算法能够达到较好的分类效果。但在实际的应用场景中，数据的分布通常是不平衡的，例如金融欺骗、医疗诊断、软件故障等场景。在这些场景中，数据主要分为两大类，大多数样本属于多数类数据，剩余的属于少数类样本。传统的分类算法对不平衡数据进行分类时，会将结果倾向于多数类，而对少数类样本的识别率偏低。但在实际场景中，少数类样本更具有实际的价值。因此，不平衡数据的分类有较高的研究价值。For datasets with balanced category distribution, traditional classification algorithms can achieve better classification results. But in actual application scenarios, the distribution of data is usually unbalanced, such as financial fraud, medical diagnosis, software failure and other scenarios. In these scenarios, the data are mainly divided into two categories, most of the samples belong to the majority class data, and the rest belong to the minority class samples. When the traditional classification algorithm classifies the unbalanced data, the result tends to be the majority class, but the recognition rate of the minority class samples is low. But in actual scenarios, minority class samples have more practical value. Therefore, the classification of imbalanced data has high research value.

现有的处理类不平衡数据分类问题的方法主要从两个方面着手：1)数据采样。通过增加少数类数据或是减少多数类数据来平衡原始数据集。其中通过增加少数类样本的方法称为过采样方法，通过减少多数类样本的方法称为欠采样方法。另外还有混合过采样和过采样的混合采样方法。以上这些方法直接从数据数量上进行了平衡，但会改变原始数据的分布。2)分类算法，主要包括两部分：代价敏感学习和集成学习。多数类和少数类样本被误分类的代价是不同的，代价敏感学习通过给两类样本设置不同的误分类惩罚因子，通过提高少数类样本的误分类惩罚因子，来平衡分类器的分类倾向，使少数类样本尽可能被分类正确。代价敏感学习这种方法不会改变原始样本的分布，但是需要确定两类样本的惩罚因子。集成学习方法是将若干个弱分类器组合起来，根据每个弱分类器的分类性能，分配不同的权重并整合成一个强分类器。SPY算法中，将部分少数类样本周围的多数类样本视为SPY样本，并修改其标签，以此来平衡数据集。但由于平衡数据集所需要的SPY样本比较多，这将会影响到多数类样本的正确分类。所以本发明将样本边缘采样和SPY算法结合，优化了SPY样本的选择方式，并对其添加了训练权重的控制，以此提升整体预测性能。Existing approaches to deal with the class imbalanced data classification problem mainly start from two aspects: 1) Data sampling. Balance the original dataset by adding minority class data or subtracting majority class data. The method of increasing minority class samples is called oversampling method, and the method of reducing majority class samples is called undersampling method. There is also a hybrid sampling method that mixes oversampling and oversampling. The above methods directly balance the amount of data, but will change the distribution of the original data. 2) The classification algorithm mainly includes two parts: cost-sensitive learning and ensemble learning. The cost of misclassification of majority and minority samples is different. Cost-sensitive learning balances the classification tendency of the classifier by setting different misclassification penalty factors for the two types of samples and increasing the misclassification penalty factor of minority samples. Make the minority class samples be classified as correctly as possible. Cost-sensitive learning does not change the distribution of the original samples, but it needs to determine the penalty factors for the two types of samples. The integrated learning method is to combine several weak classifiers, assign different weights and integrate them into a strong classifier according to the classification performance of each weak classifier. In the SPY algorithm, the majority class samples around some minority class samples are regarded as SPY samples, and their labels are modified to balance the data set. However, due to the large number of SPY samples required for the balanced data set, this will affect the correct classification of the majority class samples. Therefore, the present invention combines sample edge sampling with the SPY algorithm, optimizes the SPY sample selection method, and adds training weight control to it, thereby improving the overall prediction performance.

发明内容SUMMARY OF THE INVENTION

本发明针对现有技术的不足，提出了一种基于样本分布特征和SPY算法的软件缺陷预测方法。Aiming at the deficiencies of the prior art, the invention proposes a software defect prediction method based on sample distribution characteristics and SPY algorithm.

本发明基于软件缺陷数据集的分布特征，提出了边界k值确定公式。根据公式为不同的数据集选择合适的少数类边界样本。此外，本发明将SPY算法和边界采样算法结合，通过边缘样本优化SPY算法，提出了一种新的边界过采样方法BSGSMOTE，主要是将少数类边界区域的部分多数类样本设置成为SPY样本，并为SPY样本设置较小训练样本权重，在原始少数类边界区域内部使用边界采样的算法。SPY样本可以对边界上的少数类样本起到引导的作用，确保边界区域的少数类样本能够被正确分类。同时，通过为SPY样本设立一个较小的样本权重，减少了SPY样本对多数类样本的分类影响，最终达到一个较好的分类效果。The invention proposes a formula for determining the boundary k value based on the distribution characteristics of the software defect data set. Select appropriate minority class boundary samples for different data sets according to the formula. In addition, the present invention combines the SPY algorithm with the boundary sampling algorithm, optimizes the SPY algorithm through edge samples, and proposes a new boundary oversampling method BSGSMOTE, which mainly sets part of the majority class samples in the minority class boundary area as SPY samples, and Set a smaller training sample weight for the SPY sample, and use a boundary sampling algorithm inside the original minority class boundary area. The SPY samples can guide the minority class samples on the boundary to ensure that the minority class samples in the border area can be correctly classified. At the same time, by setting a smaller sample weight for SPY samples, the influence of SPY samples on the classification of majority class samples is reduced, and a better classification effect is finally achieved.

本发明主体包括以下步骤：The subject of the present invention comprises the following steps:

步骤1)基于软件缺陷数据集提取样本特征；包括获取样本不平衡度、获取同类样本之间的平均距离和获取样本方差；Step 1) Extracting sample features based on the software defect data set; including obtaining sample imbalance, obtaining the average distance between similar samples, and obtaining sample variance;

a)计算样本不平衡度a) Calculate the sample imbalance

统计数据集中多数类样本个数与少数类样本个数的比值。计算样The ratio of the number of samples in the majority class to the number of samples in the minority class in the statistical data set. Calculation sample

本不平衡度的公式为：The formula for this unbalance degree is:

imblance＝num_N/num_P。imblance= _numN / _numP .

式中num_N为多数类样本个数，num_P为少数类样本个数。In the formula, num _N is the number of samples in the majority class, and num _P is the number of samples in the minority class.

b)计算同类样本之间的平均距离b) Calculate the average distance between similar samples

同类样本之间的平均距离描述了样本之间的接近程度。以少数类样本为例，对于少数类样本集合S_p中的每一个少数类样本P_i，计算P_i到周围的k个近邻同类样本的距离，得到距离d₁,d₂…d_k，把得到的k个距离求均值，可以得到P_i样本到周围近邻样本的平均距离dp_i，公式为：The average distance between samples of the same class describes how close the samples are to each other. Taking the minority class sample as an example, for each minority class sample P _i in the minority class sample set S _p , calculate the distance from _Pi to the surrounding k nearest neighbor samples of the same type, and obtain the distance d ₁ , d ₂ ...d _k , put Calculate the average of the k distances obtained, and the average distance dp _i from the P _i sample to the surrounding neighbor samples can be obtained, the formula is:

dp_i＝Avg(d₁,d₂…d_k)。dp _i =Avg(d ₁ ,d ₂ . . . d _k ).

计算每一个dp_i，最终得到dp＝[dp₁,dp₂…dp_nump]，取dp的平均值作为少数类样本之间的平均距离度量指标dp_average。同理，我们也可以得到多数类样本之间的平均距离dn_average。Calculate each dp _i , and finally get dp=[dp ₁ ,dp ₂ ...dp _nump ], take the average value of dp as the average distance measure index dp _average between minority class samples. Similarly, we can also get the average distance dn _average between samples of the majority class.

c)计算样本方差c) Calculate the sample variance

样本的方差描述了样本的离散程度。样本方差可由每个样本值与总样本平均数之差的平方值的求和平均得到。对于同一类样本来说，样本的方差越大，样本的分布越分散；相反，样本的方差越小，样本的分布越集中。总体数据集有多数类和少数类两类样本，获取两类样本的方差为S₁和S₂。样本方差的计算公式为：The variance of the sample describes the degree of dispersion of the sample. The sample variance can be obtained by summing the average of the squares of the differences between each sample value and the overall sample mean. For the same type of samples, the larger the variance of the sample, the more dispersed the distribution of the sample; on the contrary, the smaller the variance of the sample, the more concentrated the distribution of the sample. The overall data set has two types of samples, the majority class and the minority class, and the variances of obtaining the two types of samples are S ₁ and S ₂ . The formula for calculating the sample variance is:

式中σ²表示总体方差，X表示单个样本，μ表示总体样本的均值，N为数据集样本的个数。In the formula, ^σ2 represents the overall variance, X represents a single sample, μ represents the mean of the overall sample, and N is the number of samples in the data set.

步骤2)自适应边界k值计算，根据k近邻算法选择合适边界样本，为边界样本重采样做好准备；Step 2) Adaptive boundary k value calculation, select appropriate boundary samples according to the k-nearest neighbor algorithm, and prepare for boundary sample resampling;

a)自适应边界k值计算a) Calculation of adaptive boundary k value

不同数据集的分布特征都不相同，k的取值需要根据数据集的分布特征来自适应调整。根据整体数据集的分布情况，从距离和方差两个角度出发，提出了k值的两种计算公式。为了防止边界k过大或者过小，将k值约束了范围，在5至15之间。The distribution characteristics of different data sets are different, and the value of k needs to be adjusted adaptively according to the distribution characteristics of the data sets. According to the distribution of the overall data set, two calculation formulas for the k value are proposed from the perspectives of distance and variance. In order to prevent the boundary k from being too large or too small, the value of k is limited to a range between 5 and 15.

从样本个体的距离出发，结合样本整体不平衡率，得到了以下的公式。Starting from the distance of the sample individual, combined with the overall imbalance rate of the sample, the following formula is obtained.

s.t.k₁∈[5,15]stk ₁ ∈ [5,15]

式中imblance为样本不平衡率，dp_average为少数类样本之间的平均距离，dn_average为多数类样本之间的平均距离。In the formula, imbalance is the sample imbalance rate, dp _average is the average distance between minority class samples, and dn _average is the average distance between majority class samples.

从两类样本总体的方差角度出发，结合样本整体的不平衡率，得到了k值的另一个计算公式：From the perspective of the variance of the two types of sample populations, combined with the imbalance rate of the sample as a whole, another calculation formula for the k value is obtained:

s.t.k₂∈[5,15]stk ₂ ∈ [5,15]

式中imblance为样本不平衡率，S_P为少数类样本的总体方差，S_N为多数类样本的总体方差。In the formula, imbalance is the sample imbalance rate, S _P is the overall variance of the minority class samples, and S _N is the overall variance of the majority class samples.

b)边界样本选择b) Boundary sample selection

根据得到的边界k值，使用K近邻算法计算每个少数类样本周围的k个近邻样本。在这k个近邻样本中，如果多数类样本的个数多于少数类样本的个数，且近邻少数类样本的个数不为0，则被选为少数类边界样本。According to the boundary k value obtained, the K nearest neighbor algorithm is used to calculate the k nearest neighbor samples around each minority class sample. Among the k nearest neighbor samples, if the number of majority class samples is more than the number of minority class samples, and the number of neighboring minority class samples is not 0, it will be selected as the minority class boundary sample.

步骤3)选择少数类样本周围的SPY样本，帮助边界区域的两类样本更好地分类，以此来提高整体的软件缺陷预测水平；Step 3) Select the SPY samples around the minority class samples to help the two types of samples in the border area to be better classified, so as to improve the overall software defect prediction level;

根据步骤2得到的自适应边界k值，选择边界SPY样本。SPY样本是指那些靠近少数类样本边界的多数类样本，可以计算少数类样本周围的近邻情况来找到相应的SPY样本。对少数类边界进行近邻样本分析，选择合适的多数类样本作为SPY样本。具体方式是对少数类样本进行k近邻分析，计算得到少数类样本周围的近邻样本。对于每一个少数类样本而言，如果其近邻样本中少数类样本的个数大于多数类样本的个数，则说明该样本处于相对较安全的区域，此时，这些近邻样本中的多数类样本可以视为SPY样本。According to the adaptive boundary k value obtained in step 2, select boundary SPY samples. SPY samples refer to the majority class samples that are close to the boundary of the minority class samples, and the corresponding SPY samples can be found by calculating the neighbors around the minority class samples. Carry out the nearest neighbor sample analysis on the boundary of the minority class, and select the appropriate majority class sample as the SPY sample. The specific method is to perform k-nearest neighbor analysis on the minority class samples, and calculate the nearest neighbor samples around the minority class samples. For each minority class sample, if the number of minority class samples in its neighbor samples is greater than the number of majority class samples, it means that the sample is in a relatively safe area. At this time, the majority class samples in these neighbor samples Can be regarded as a SPY sample.

步骤4)在边界少数类样本中进行过采样，以此平衡数据集；Step 4) Perform oversampling in the marginal minority samples to balance the dataset;

少数类样本采用线性插值的方式进行过采样。本发明使用k＝5的k近邻算法，获取边界上少数类的边界近邻样本。并在两个少数类样本之间进行随机的线性插值，使得新生成的少数类样本能分布在少数类边界区域，同时随机性导致新生成样本更具多样性。Minority samples are oversampled by linear interpolation. The present invention uses a k-nearest neighbor algorithm with k=5 to obtain boundary neighbor samples of a minority class on the boundary. And random linear interpolation is performed between two minority class samples, so that the newly generated minority class samples can be distributed in the minority class boundary area, and the randomness makes the newly generated samples more diverse.

步骤5)对SPY样本和其他样本分别设置训练权重，来减轻SPY样本对多数类样本的影响，以此提升整体效果；Step 5) Set training weights for SPY samples and other samples respectively, to reduce the impact of SPY samples on majority class samples, so as to improve the overall effect;

设置SPY样本的训练权重，由于SPY样本的本质是多数类样本，将SPY样本的标签设置为少数类样本标签，势必会影响到周围的多数类样本的分类决策。通过减少SPY样本的训练权重，可以减少其对决策边界的影响，并对少数类样本的分类决策起到引导作用。本发明中将SPY类样本的训练权重设置为0.5，将其他样本的权重都设置为1。Set the training weight of the SPY sample. Since the essence of the SPY sample is the majority class sample, setting the label of the SPY sample as the minority class sample label will inevitably affect the classification decision of the surrounding majority class samples. By reducing the training weight of SPY samples, its impact on the decision boundary can be reduced, and it can guide the classification decision of minority class samples. In the present invention, the training weight of the SPY sample is set to 0.5, and the weights of other samples are all set to 1.

步骤6)使用机器学习模型进行数据集的训练及预测；Step 6) use the machine learning model to train and predict the data set;

将得到的类别平衡的软件缺陷数据集放入训练模型中进行训练，得到训练好的模型，模型选用逻辑回归模型、决策树模型、k近邻模型以及贝叶斯模型。模型训练结束后，将测试集样本预处理后输入到模型中，即可得到模型预测的标签。Put the obtained category-balanced software defect data set into the training model for training, and obtain the trained model. The model uses logistic regression model, decision tree model, k-nearest neighbor model and Bayesian model. After the model training is completed, the test set samples are preprocessed and input into the model to obtain the predicted labels of the model.

本发明的有益效果：Beneficial effects of the present invention:

1、该技术根据原始样本的分布特征，自适应地决定k值，使用k近邻算法找到合适的少数类边界样本，为在少数类边界区域生成新样本做好了准备。1. This technology adaptively determines the value of k according to the distribution characteristics of the original samples, and uses the k-nearest neighbor algorithm to find suitable minority class boundary samples, making preparations for generating new samples in the minority class boundary area.

2、该技术将SPY算法和边界采样算法结合。通过设置指定区域的多数类样本作为SPY样本，对少数类样本进行决策引导，并加上训练权重的控制，使得更多的少数类样本能够被正确分类，提高了缺陷类样本被识别的概率。2. This technology combines the SPY algorithm with the boundary sampling algorithm. By setting the majority class samples in the specified area as SPY samples, decision-making guidance is made on the minority class samples, and the control of training weight is added, so that more minority class samples can be correctly classified and the probability of defect class samples being identified is improved.

附图说明Description of drawings

图1少数类边界样本的定义图。Figure 1 Definition diagram of minority class boundary samples.

图2同类样本之间的平均距离定义图。Figure 2 Definition of average distance between similar samples.

图3 SPY样本的定义图。Fig. 3 Definition map of SPY sample.

图4算法模型的整体流程图。Figure 4 The overall flowchart of the algorithm model.

具体实施方式Detailed ways

下面根据附图结合软件缺陷预测数据集对本发明进行详细说明。本发明整体流程如附图4所示，具体步骤如下：The present invention will be described in detail below in conjunction with the software defect prediction data set according to the accompanying drawings. Overall process of the present invention is as shown in accompanying drawing 4, and concrete steps are as follows:

步骤1、对原始软件缺陷数据集进行五折交叉验证，选择其中80％作为训练集，剩下20％作为测试集。对训练集中的样本进行特征提取，获取样本不平衡度、同类样本之间的平均距离和样本方差这三个特征。Step 1. Perform 5-fold cross-validation on the original software defect data set, select 80% of it as a training set, and the remaining 20% as a test set. Feature extraction is performed on the samples in the training set to obtain the three features of sample imbalance, average distance between similar samples and sample variance.

1)获取样本不平衡度1) Obtain the sample imbalance degree

软件缺陷数据集中多数类样本个数与少数类样本个数的比值，计算样本的不平衡度的公式为：The ratio of the number of samples in the majority class to the number of samples in the minority class in the software defect data set, the formula for calculating the imbalance degree of the samples is:

imblance＝num_N/num_P；imblance= _numN / _numP ;

式中num_N为多数类样本个数，num_P为少数类样本个数；如图1所示；In the formula, num _N is the number of samples in the majority class, and num _P is the number of samples in the minority class; as shown in Figure 1;

2)获取同类样本之间的平均距离2) Obtain the average distance between similar samples

如图2所示，同类样本之间的平均距离描述了样本之间的接近程度；以少数类样本为例，对于少数类样本集合S_p中的每一个少数类样本P_i，计算P_i到周围的k个近邻同类样本的距离，得到距离d₁,d₂…d_k，把得到的k个距离求均值，可以得到P_i样本到周围近邻样本的平均距离dp_i，公式为：As shown in Figure 2, the average distance between samples of the same type describes the closeness between samples; taking minority samples as an example, for each minority sample P _i in the minority sample set S _p , calculate P _i to The distances of the surrounding k neighbor samples of the same type can be obtained as distances d ₁ , d ₂ ...d _k , and the average value of the obtained k distances can be obtained to obtain the average distance dp _i from the _Pi sample to the surrounding neighbor samples. The formula is:

dp_i＝Avg(d₁,d₂…d_k)；dp _i =Avg(d ₁ ,d ₂ ...d _k );

计算每一个dp_i，最终得到dp＝[dp₁,dp₂…dp_nump]，取dp的平均值作为少数类样本之间的平均距离度量指标dp_average；同理，我们也可以得到多数类样本之间的平均距离dn_average；Calculate each dp _i , and finally get dp=[dp ₁ ,dp ₂ ...dp _nump ], take the average value of dp as the average distance measurement index dp _average between minority class samples; similarly, we can also get majority class samples The average distance dn _average between;

3)获取样本方差3) Get the sample variance

样本的方差描述了样本的离散程度；样本方差可由每个样本值与总样本平均数之差的平方值的求和平均得到；对于同一类样本来说，样本的方差越大，样本的分布越分散；相反，样本的方差越小，样本的分布越集中；总体数据集有多数类和少数类两类样本，获取两类样本的方差为S₁和S₂；样本方差的计算公式为：The variance of the sample describes the degree of dispersion of the sample; the variance of the sample can be obtained by the sum of the square values of the difference between each sample value and the mean of the total sample; Scattered; on the contrary, the smaller the variance of the sample, the more concentrated the distribution of the sample; the overall data set has two types of samples, the majority class and the minority class, and the variances of the two types of samples obtained are S ₁ and S ₂ ; the formula for calculating the sample variance is:

步骤2、将步骤1求得的三个特征代入到提出的两个边界k值计算公式中，求得两个k值。两个公式如下所示：Step 2. Substituting the three features obtained in step 1 into the proposed calculation formulas for the two boundary k values to obtain two k values. The two formulas are as follows:

从样本个体的距离出发，结合样本的不平衡率的公式如下：Starting from the distance of the sample individual, the formula of combining the imbalance rate of the sample is as follows:

从样本总体的方差出发，结合样本的不平衡率的公式如下：Starting from the variance of the sample population, the formula combined with the imbalance rate of the sample is as follows:

根据步骤2中计算出来的边界k值，选择较优的一个作为K近邻的参数，本专利使用k2作为近邻参数。对少数类样本使用K近邻算法计算每个少数类样本周围的k个近邻样本，若k个近邻样本中多数类样本的个数多于少数类样本的个数，且少数类样本的个数不为0，则将被分析的少数类样本作为少数类边界样本。According to the boundary k value calculated in step 2, a better one is selected as the K-nearest neighbor parameter, and this patent uses k2 as the neighbor parameter. For the minority class samples, use the K nearest neighbor algorithm to calculate the k nearest neighbor samples around each minority class sample. If it is 0, the analyzed minority class samples will be regarded as minority class boundary samples.

步骤3、对少数类样本进行k近邻分析，计算得到少数类样本周围的近邻样本。对于每一个少数类样本而言，如果其近邻样本中少数类样本的个数大于多数类样本的个数，则说明该样本处于相对较安全的区域，挑选这些近邻样本中的多数类样本作为SPY样本。Step 3: Carry out k-nearest neighbor analysis on the minority class samples, and calculate the nearest neighbor samples around the minority class samples. For each minority class sample, if the number of minority class samples in its neighbor samples is greater than the number of majority class samples, it means that the sample is in a relatively safe area, and the majority class samples in these neighbor samples are selected as SPY sample.

步骤4、对边界区域中的少数类样本进行线性插值采样。本发明采用k＝5的k近邻算法，获取边界上少数类的边界近邻样本。并在两个少数类样本之间进行随机的线性插值，使得新生成的少数类样本能分布在少数类边界区域，同时随机性导致新生成样本更具多样性。线性插值的公式如下所示：Step 4. Perform linear interpolation sampling on the minority class samples in the boundary area. The present invention adopts the k-nearest neighbor algorithm of k=5 to obtain the boundary neighbor samples of the minority class on the boundary. And random linear interpolation is performed between two minority class samples, so that the newly generated minority class samples can be distributed in the minority class boundary area, and the randomness makes the newly generated samples more diverse. The formula for linear interpolation is as follows:

n_i＝(p_i-p_j)*δ+p_j n _i =(p _i -p _j )*δ+p _j

步骤5、设置SPY样本标签为少数类样本标签，并修改其训练权重为0.5，设置其他样本的训练权重为1。Step 5. Set the SPY sample label as the minority sample label, modify its training weight to 0.5, and set the training weight of other samples to 1.

步骤6、使用训练集在经典的分类模型如逻辑回归模型、朴素贝叶斯模型、逻辑回归模型、支持向量机模型和决策树模型上进行训练。然后对测试集上的样本数据进行归一化处理，将归一化之后的数据放入训练好的模型中进行分类预测，并求出评估指标Recall、F1、AUC和G-Mean。Step 6. Use the training set to train on classic classification models such as logistic regression models, naive Bayesian models, logistic regression models, support vector machine models and decision tree models. Then normalize the sample data on the test set, put the normalized data into the trained model for classification prediction, and calculate the evaluation indicators Recall, F1, AUC and G-Mean.

最终数据分布结果分析：本发明主要是在少数类样本的边界区域生成新样本，新样本处于边界区域。在挑选少数类的边界样本过程中，可以通过k近邻算法找到并除去样本中的噪声样本。与此同时，可以将部分的多数类样本视为SPY样本，在不过多新增少数类样本的情况下，通过SPY样本将两类样本的决策面向少数类样本区域移动，确保少数类样本能被正确分类。同时通过控制两类样本的训练权重，减少了SPY样本对多数类样本的分类的影响，最终提升了整体的分类性能。Analysis of the final data distribution results: the present invention mainly generates new samples in the boundary area of the minority class samples, and the new samples are in the boundary area. In the process of selecting the boundary samples of the minority class, the noise samples in the samples can be found and removed by the k-nearest neighbor algorithm. At the same time, part of the majority class samples can be regarded as SPY samples. In the case of not adding too many minority class samples, the decision-making of the two types of samples can be moved to the minority class sample area through SPY samples to ensure that the minority class samples can be correctly classified. At the same time, by controlling the training weights of the two types of samples, the influence of SPY samples on the classification of most samples is reduced, and the overall classification performance is finally improved.

Claims

1. The software defect prediction method based on the sample distribution characteristics and the SPY algorithm is characterized by comprising the following steps of:

step 1) extracting sample characteristics based on a software defect data set; acquiring sample unbalance, acquiring average distance between similar samples and acquiring sample variance;

step 2), calculating a self-adaptive boundary k value, and selecting a boundary sample according to a k nearest neighbor algorithm;

a) Adaptive boundary k value calculation

The distribution characteristics of different data sets are different, and the value of k needs to be adaptively adjusted according to the distribution characteristics of the data sets; according to the distribution condition of the whole data set, two calculation formulas of a k value are provided from the aspects of distance and variance; to prevent the boundary k from being too large or too small, the k value is constrained to range between 5 and 15;

starting from the distance of the individual sample, combining the integral unbalance rate of the sample to obtain the following formula;

s.t.k ₁ ∈[5,15]

where the average is the sample imbalance, dp _average Is a small numberAverage distance between class samples, dn _average Is the average distance between most classes of samples;

from the aspect of the variance of the two types of sample populations, the imbalance rate of the sample population is combined to obtain another calculation formula of the k value:

s.t.k ₂ ∈[5,15]

where the imbalance is the sample imbalance ratio, S _P Is the overall variance, S, of the minority samples _N Is the overall variance of the majority class samples;

b) Boundary sample selection

Calculating K neighbor samples around each minority sample by using a K neighbor algorithm according to the obtained boundary K value; if the number of the majority class samples is more than that of the minority class samples and the number of the neighbor minority class samples is not 0, selecting the k neighbor class samples as the minority class boundary samples;

step 3) performing k neighbor operation on the minority samples, and calculating to obtain neighbor samples around the minority samples; for each minority sample, if the number of minority samples in the neighbor samples is greater than that of majority samples, the sample is in a relatively safe area, and at this time, the majority samples in the neighbor samples are regarded as SPY samples; selecting SPY samples around the few samples, and using the SPY samples to guide the few samples in the boundary region to be better classified so as to improve the overall software defect prediction level;

step 4) oversampling in a few boundary samples to balance the data set;

step 5) respectively setting training weights for the SPY sample and other samples;

setting the training weight of the SPY sample to be 0.5, and setting the weights of other samples to be 1; the control of the weight leads SPY samples to guide the accurate classification of few samples of the boundary, and simultaneously reduces the classification influence on most samples of the boundary area, thereby integrally improving the overall classification prediction effect;

and 6) training and predicting the data set by using a machine learning model.

2. The method for predicting software defects based on sample distribution characteristics and SPY algorithm according to claim 1, wherein the step 1 of extracting sample characteristics based on the software defect data set specifically comprises the following steps:

1) Obtaining sample imbalance

The ratio of the number of the majority samples to the number of the minority samples in the software defect data set is calculated by the following formula:

imblance＝num _N /num _P ；

num in the formula _N Number of majority sample, num _P The number of the samples is a minority number;

2) Obtaining average distance between homogeneous samples

The average distance between homogeneous samples describes the proximity between samples; taking the minority class sample as an example, for each minority class sample P in the minority class sample set _i Calculating P _i The distances to the surrounding k neighboring homogeneous samples are obtained to obtain the distance d ₁ ,d ₂ …d _k The obtained k distances are averaged to obtain P _i Average distance dp of a sample to surrounding neighbor samples _i The formula is as follows:

dp _i ＝Avg(d ₁ ,d ₂ …d _k )；

calculating each dp _i Finally obtaining dp = [ dp ] ₁ ,dp ₂ …dp _nump ]Taking the average value of dp as the average distance measure index dp between the minority class samples _average (ii) a Similarly, we can also get the average distance dn between most kinds of samples _average ；

3) Obtaining a sample variance

The variance of a sample describes the degree of dispersion of the sample; the sample variance may be obtained from the sum average of the squared values of the difference between each sample value and the total sample mean; for the same type of sample, the larger the variance of the sample, the more dispersed the distribution of the sample; conversely, the smaller the variance of a sample, the more concentrated the distribution of the sample; the total data set comprises a majority type sample and a minority type sample, and the variances of the two types of samples are calculated respectively; the sample variance is calculated as:

in the formula sigma ² Represents the global variance, X represents the individual samples, μ represents the mean of the global samples, and N is the number of data set samples.

3. The method of claim 1, wherein the oversampling is performed in the boundary minority sample class according to the step 4, specifically as follows:

a few types of samples are subjected to oversampling in a linear interpolation mode; acquiring boundary neighbor samples of a few classes on a boundary by using a k neighbor algorithm with k = 5; and random linear interpolation is carried out between the two minority samples, so that the newly generated minority samples can be distributed in a minority boundary region, and meanwhile, the randomness causes the newly generated samples to have more diversity.

4. The method of claim 1, wherein the training and prediction of the data set using the machine learning model in step 6 is performed by using a sample distribution feature and an SPY algorithm, specifically as follows:

training and predicting in a classical decision model such as a logistic regression model, a decision tree model, a k nearest neighbor model and a Bayesian model through the obtained class-balanced software defect data set to obtain a trained model; after the model training is finished, preprocessing a test set sample and inputting the preprocessed test set sample into the model, so that a label predicted by the model can be obtained.