CN103810101A

CN103810101A - Software defect prediction method and system

Info

Publication number: CN103810101A
Application number: CN201410056779.9A
Authority: CN
Inventors: 胡昌振; 单纯; 陈博洋; 马锐; 王勇
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2014-02-19
Filing date: 2014-02-19
Publication date: 2014-05-21
Anticipated expiration: 2034-02-19
Also published as: CN103810101B

Abstract

The invention provides a software defect prediction method and a software defect prediction system, which are used to solve the problem of low accuracy of the existing software defect prediction. Including: a dimensionality reduction processing unit, an SVM training unit, and a defect prediction unit; wherein step 1, performs dimensionality reduction processing on the first training data set according to the local linear embedding algorithm LLE, and obtains that each sample point in the first training data set is mapped to a low-dimensional The low-dimensional vector in the space obtains the second training data set that is made up of each low-dimensional vector; Step 2, supports vector machine SVM classifier is trained according to the second training data set, obtains the optimal classification of SVM classifier hyperplane function, and then obtain a trained SVM classifier; step 3, perform defect prediction on the software to be predicted according to the trained SVM classifier.

Description

A software defect prediction method and software defect prediction system

技术领域 technical field

本发明涉及软件安全领域，特别涉及一种软件缺陷预测方法和软件缺陷预测系统。 The invention relates to the field of software security, in particular to a software defect prediction method and a software defect prediction system. the

背景技术 Background technique

软件缺陷预测技术诞生于20世纪70年代，主要作用体现在对质量保证工作的指导以及为平衡软件成本提供高价值参考。软件缺陷预测主要分为动态预测和静态预测，目前主要的研究集中在静态预测方面，本发明属于静态预测中的分布预测技术。支持向量机(Support Vector Machine，简称SVM）在统计学习理论基础上发展起来的一种新的机器学习方法,在解决小样本、非线性及高维模式识别中具备有许多独特优势,现有的软件缺陷预测主要是利用是支持向量机SVM这一工具来建立预测模型对软件缺陷进行预测。与软件缺陷预测相关的专利主要有：基于需求变更的缺陷预测方法和系统（公开号CN200910080742）以及基于改进的支持向量机的软件缺陷优先级预测方法（公开号CN201210057888）。 Software defect prediction technology was born in the 1970s, and its main function is to guide quality assurance work and provide high-value reference for balancing software costs. Software defect prediction is mainly divided into dynamic prediction and static prediction. At present, main researches are concentrated on static prediction. The present invention belongs to distribution prediction technology in static prediction. Support Vector Machine (SVM for short) is a new machine learning method developed on the basis of statistical learning theory. It has many unique advantages in solving small sample, nonlinear and high-dimensional pattern recognition. The existing Software defect prediction mainly utilizes the support vector machine (SVM) tool to establish a prediction model to predict software defects. The patents related to software defect prediction mainly include: defect prediction method and system based on requirement change (publication number CN200910080742) and software defect priority prediction method based on improved support vector machine (publication number CN201210057888). the

现有技术的思路包含两个部分，对数据集的降维和对支持向量机参数的寻优，针对这两个问题，现有技术提出不同的解决方案，并取得了一定成果，但现有技术所选择的降维方法具有一定的局限性，降维后的结果不能保证原始数据的完整性，也不是本征维数的最好体现，而软件缺陷预测技术本身是对数据集的操作，数据完整性的保证对保证预测结果的准确性有着很重要的意义。 The idea of the existing technology includes two parts, the dimensionality reduction of the data set and the optimization of the parameters of the support vector machine. For these two problems, the existing technology proposes different solutions and achieves certain results, but the existing technology The selected dimensionality reduction method has certain limitations. The result after dimensionality reduction cannot guarantee the integrity of the original data, nor is it the best embodiment of the intrinsic dimensionality. The software defect prediction technology itself is an operation on the data set, and the data The guarantee of completeness is of great significance to ensure the accuracy of prediction results. the

发明内容 Contents of the invention

本发明提供了一种软件缺陷预测方法和软件缺陷预测系统，用以解决现有的软件缺陷预测精度不高的问题。 The invention provides a software defect prediction method and a software defect prediction system, which are used to solve the problem of low accuracy of the existing software defect prediction. the

一种软件缺陷预测方法,包括以下步骤： A software defect prediction method, comprising the following steps:

步骤一、根据局部线性嵌入算法LLE对第一训练数据集进行降维处理，得到第一训练数据集中每个样本点映射到低维空间中的低维向量，得到由各低维向量组成的第二训练数据集； Step 1. Perform dimensionality reduction processing on the first training data set according to the local linear embedding algorithm LLE, and obtain the low-dimensional vectors in which each sample point in the first training data set is mapped to the low-dimensional space, and obtain the first training data set composed of various low-dimensional vectors. Two training data sets;

步骤二、根据所述第二训练数据集对支持向量机SVM分类器进行训练，得到SVM分类器的最优分类超平面函数，进而得到训练好的SVM分类器； Step 2, support vector machine SVM classifier is trained according to the second training data set, obtain the optimal classification hyperplane function of SVM classifier, and then obtain the trained SVM classifier;

步骤三、根据所述训练好的SVM分类器对待预测软件进行缺陷预测。 Step 3, perform defect prediction on the software to be predicted according to the trained SVM classifier. the

其中步骤一中得到由各低维向量组成的第二训练数据集采用下述方法： The second training data set composed of various low-dimensional vectors obtained in step 1 adopts the following method:

1.1设第一训练数据集为{X₁,X₂,...,X_N},X_i∈R^D，，其中X_i是属于D维空间的向量； 1.1 Let the first training data set be {X ₁ ,X ₂ ,...,X _N },X _i ∈R ^D , where X _i is a vector belonging to D-dimensional space;

1.2计算第一训练数据集中每个样本点X_i的K个近邻点； 1.2 Calculating the K nearest neighbor points of each sample point _Xi in the first training data set;

1.3利用每个样本点的K个近邻点根据公式1计算出局部重建权值矩阵W； 1.3 Use the K neighbor points of each sample point to calculate the local reconstruction weight matrix W according to formula 1;

$\{\begin{matrix} \min ϵ (w) = Σ_{i = 1}^{N} {| | x_{i} - Σ_{j = 1}^{K} w_{ij} x_{ij} | |}^{2} \\ s . t Σ_{i = 1}^{N} w_{ij} = 1 \end{matrix}$ 公式1 $\{\begin{matrix} \min ϵ (w) = Σ_{i = 1}^{N} {| | x_{i} - Σ_{j = 1}^{K} w_{ij} x_{ij} | |}^{2} \\ the s . t Σ_{i = 1}^{N} w_{ij} = 1 \end{matrix}$ Formula 1

其中，N为样本点数量，w_ij代表第i个样本点X_i使用第j个近邻点表示的系数；第一训练数据集中所有样本点X_i使用其近邻点表示的系数组成了局部重建权值矩阵W； Among them, N is the number of sample points, w _ij represents the coefficient represented by the i-th sample point _Xi using the j-th neighbor point; all sample points _Xi in the first training data set use the coefficients represented by their neighbor points to form the local reconstruction weight value matrix W;

1.4根据得到的局部重建权值矩阵W和样本点的近邻点并依据公式2计算出每个样本点对应的低维向量； 1.4 Calculate the low-dimensional vector corresponding to each sample point according to the obtained local reconstruction weight matrix W and the neighbor points of the sample point according to formula 2;

$\{\begin{matrix} \min ϵ (w) = Σ_{i = 1}^{N} {| | Y_{j} - Σ_{j = 1}^{K} w_{ij} Y_{ij} | |}^{2} = \min ({YMY}^{T}) \\ s . t : {YY}^{T} = I \end{matrix}$ 公式2 $\{\begin{matrix} \min ϵ (w) = Σ_{i = 1}^{N} {| | Y_{j} - Σ_{j = 1}^{K} w_{ij} Y_{ij} | |}^{2} = \min ({YMY}^{T}) \\ the s . t : {YY}^{T} = I \end{matrix}$ Formula 2

其中，I是单位阵，M＝(I-W)^T(I-W)。 Wherein, I is the identity matrix, M=(IW) ^T (IW).

其中步骤二所述的得到训练好的SVM分类器采用下述方法： Wherein step 2 described obtains the trained SVM classifier adopting the following method:

根据公式3求解SVM分类器的最优分类超平面函数 Solve the optimal classification hyperplane function of the SVM classifier according to formula 3

$\{\begin{matrix} \min {\frac{1}{2} {| | ω | |}^{2} + C Σ_{i = 1}^{n} ξ_{i}} \\ s . t . y_{i} (ω^{T} φ (x_{i}) + b) &GreaterEqual; 1 - ξ_{i}, i = 1, . . ., n \end{matrix}, ξ_{i} &GreaterEqual; 0$ 公式3 $\{\begin{matrix} \min {\frac{1}{2} {| | ω | |}^{2} + C Σ_{i = 1}^{no} ξ_{i}} \\ the s . t . {the y}_{i} (ω^{T} φ (x_{i}) + b) &Greater Equal; 1 - ξ_{i}, i = 1, . . ., no \end{matrix}, ξ_{i} &Greater Equal; 0$ Formula 3

其中，ω是正交于分类超平面的d维向量，b是偏差项，C是惩罚系数，ξ_i 是松弛变量，φ(x)是SVM分类器使用的核函数。 Among them, ω is a d-dimensional vector orthogonal to the classification hyperplane, b is the bias term, C is the penalty coefficient, _ξi is the slack variable, and φ(x) is the kernel function used by the SVM classifier.

上述的核函数为径向基核函数，形式为： The above kernel function is a radial basis kernel function in the form of:

$K (x, x_{i}) = \exp {- \frac{{| x - x_{i} |}^{2}}{σ^{2}}}$ 公式4 $K (x, x_{i}) = \exp {- \frac{{| x - x_{i} |}^{2}}{σ^{2}}}$ Formula 4

其中，σ是径向基核函数的宽度参数。 where σ is the width parameter of the radial basis kernel function. the

上述的得到SVM分类器的最优分类超平面函数中，采用网格搜索方法和十折交叉验证方法对SVM分类器的参数C以及核函数参数σ进行寻优，找到使SVM分类准确率最高的那对参数C和σ的取值，以确定SVM分类器的最优分类超平面函数。 In the above-mentioned optimal classification hyperplane function of the SVM classifier, the grid search method and the ten-fold cross-validation method are used to optimize the parameter C of the SVM classifier and the parameter σ of the kernel function, and find the one with the highest classification accuracy of the SVM. The value of the parameters C and σ to determine the optimal classification hyperplane function of the SVM classifier. the

上述的采用网格搜索方法和十折交叉验证方法对SVM分类器的参数C以及核函数参数σ进行寻优包括：采用网格搜索方法对参数C和σ进行取值；得到C的取值区间内所有的值与σ取值区间内所有的值组成的所有组合并进行搜索。 The above-mentioned optimization of the parameter C of the SVM classifier and the parameter σ of the kernel function using the grid search method and the ten-fold cross-validation method includes: using the grid search method to value the parameters C and σ; obtaining the value range of C Search for all combinations of all values in σ and all values in the range of σ. the

上述的采用网格搜索方法和十折交叉验证方法对SVM分类器的参数C以及核函数参数σ进行寻优包括：对选定的每组参数C、σ得到在此组参数取值下的分类准确率，采用十折交叉方法进行验证，取使分类准确率最高的那组参数作为最佳的参数取值；其中，采用十折交叉方法进行验证是指将第二数据集分为10个子集，1个子集做测试集，其余9个子集做训练集，得到在选定的某组参数下的1个分类准确率，如此重复10次；得到在这组参数下的10个分类准确率，将这10个分类准确率的平均数作为评价每一组参数优劣的指标，然后，比较选定的每组参数的分类准确率的平均数，将平均数最高的那组参数C，σ作为最佳的参数取值。 The above-mentioned optimization of the parameter C and the kernel function parameter σ of the SVM classifier by using the grid search method and the ten-fold cross-validation method includes: obtaining the classification under this set of parameter values for each set of parameters C and σ selected Accuracy, using the ten-fold crossover method for verification, taking the group of parameters with the highest classification accuracy as the best parameter value; among them, using the ten-fold crossover method for verification refers to dividing the second data set into 10 subsets , 1 subset is used as the test set, and the remaining 9 subsets are used as the training set to obtain 1 classification accuracy rate under a selected set of parameters, and repeat this 10 times; to obtain 10 classification accuracy rates under this set of parameters, The average of the 10 classification accuracy rates is used as an index to evaluate the pros and cons of each group of parameters, and then, the average of the classification accuracy rates of each selected group of parameters is compared, and the group of parameters C, σ with the highest average number is used as the best parameter values. the

其中根据最优分类超平面函数进行软件缺陷预测采用下述方法： Among them, the software defect prediction according to the optimal classification hyperplane function adopts the following method:

首先，对待预测软件的数据集利用LLE算法进行降维处理； First, use the LLE algorithm to reduce the dimensionality of the data set of the forecasting software;

其次，将降维后的数据集输入到所述训练好的SVM分类器中并进行判断；若所述输入的数据落入所述最优分类超平面函数确定的没有缺陷的空间中时，则确定该数据对应的软件模块未包含缺陷并在SVM分类器的输出结果中进行标记；若所述输入的数据落入所述最优分类超平面函数确定的有缺陷的空间中时，则确定所述数据对应的软件模块包含缺陷并在SVM分类器的输出结果中进行标记。 Secondly, input the data set after dimension reduction into the trained SVM classifier and make a judgment; if the input data falls into the space without defects determined by the optimal classification hyperplane function, then Determine that the software module corresponding to the data does not contain a defect and mark it in the output result of the SVM classifier; if the input data falls into the defective space determined by the optimal classification hyperplane function, then determine that the The software modules corresponding to the above data contain defects and are marked in the output of the SVM classifier. the

一种软件缺陷预测系统，包括：降维处理单元、SVM训练单元和缺陷预测单元； A software defect prediction system, comprising: dimensionality reduction processing unit, SVM training unit and defect prediction unit;

降维处理单元，用于根据局部线性嵌入算法LLE对第一训练数据集进行降维处理，得到第一训练数据集中每个样本点映射到低维空间中的低维向量，得到由各低维向量组成的第二训练数据集； The dimensionality reduction processing unit is used to perform dimensionality reduction processing on the first training data set according to the local linear embedding algorithm LLE, to obtain a low-dimensional vector in which each sample point in the first training data set is mapped to a low-dimensional space, and to obtain a low-dimensional vector composed of each low-dimensional The second training data set composed of vectors;

SVM训练单元，用于根据第二训练数据集对支持向量机SVM分类器进行训练，得到SVM分类器的最优分类超平面函数，进而得到训练好的SVM分类器； The SVM training unit is used to train the support vector machine SVM classifier according to the second training data set, obtain the optimal classification hyperplane function of the SVM classifier, and then obtain the trained SVM classifier;

缺陷预测单元，用于根据训练好的SVM分类器对待预测软件进行缺陷预测。 The defect prediction unit is used to perform defect prediction on the software to be predicted according to the trained SVM classifier. the

本发明的有益效果： Beneficial effects of the present invention:

本发明提供的软件缺陷预测方法和软件缺陷预测系统，首先，采用局部线性嵌入算法对训练数据集进行降维处理，保证降维后数据集中样本点的几何结构不变，使得降维后的数据能更完全地反映出原始数据集的各种特征，其次，根据网格搜索算法寻找SVM的参数C和核函数的参数σ进行寻优，配合十折交叉验证方法找到使SVM分类准确率最高的那组C、σ的值，确定为最优参数，并根据该最优参数确定SVM的最优分类超平面函数，利用最优分类超平面函数进行软件缺陷预测达到提高软件缺陷预测准确率的目的。 In the software defect prediction method and software defect prediction system provided by the present invention, firstly, the local linear embedding algorithm is used to reduce the dimensionality of the training data set to ensure that the geometric structure of the sample points in the data set remains unchanged after dimensionality reduction, so that the dimensionality reduction data It can more completely reflect the various characteristics of the original data set. Secondly, according to the grid search algorithm, find the parameter C of the SVM and the parameter σ of the kernel function for optimization, and cooperate with the ten-fold cross-validation method to find the SVM with the highest classification accuracy. The value of that group of C and σ is determined as the optimal parameter, and the optimal classification hyperplane function of SVM is determined according to the optimal parameter, and the optimal classification hyperplane function is used for software defect prediction to improve the accuracy of software defect prediction. . the

附图说明 Description of drawings

图1是本发明一个实施例提供的一种软件缺陷预测方法的框图； Fig. 1 is the block diagram of a kind of software defect prediction method that an embodiment of the present invention provides;

图2是本发明又一个实施例提供的一种软件缺陷预测方法的流程图； Fig. 2 is the flow chart of a kind of software defect prediction method that another embodiment of the present invention provides;

图3是本发明又一个实施例提供的一种软件缺陷预测系统的框图。 Fig. 3 is a block diagram of a software defect prediction system provided by another embodiment of the present invention. the

具体实施方式 Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施方式作进一步地详细描述。 In order to make the object, technical solution and advantages of the present invention clearer, the implementation manner of the present invention will be further described in detail below in conjunction with the accompanying drawings. the

本发明的技术构思是针对现有的降维方法的局限性，即降维后的结果并不能保证数据的完整性，也不是本征维数的最好体现。本发明实施例采用局部线性嵌入（locally linear embedding，简称LLE）算法进行软件缺陷数据集的降维，该算法的思想即是从样本数据的空间结构出发，能够保证降维后数据样本的几何结构不变，使得降维后的数据能更完全地反映出原始数据集的各种特征，软件缺陷预测技术本身是对数据集的操作，更加完整的体现原始数据的特征对提高预测结果的准确性是极其重要的。 The technical idea of the present invention is aimed at the limitations of the existing dimensionality reduction methods, that is, the result after dimensionality reduction cannot guarantee the integrity of the data, nor is it the best embodiment of the intrinsic dimensionality. The embodiment of the present invention adopts the locally linear embedding (LLE for short) algorithm to reduce the dimensionality of the software defect data set. The idea of the algorithm is to start from the spatial structure of the sample data and to ensure the geometric structure of the data sample after dimensionality reduction. Unchanged, so that the data after dimensionality reduction can more completely reflect the various characteristics of the original data set. The software defect prediction technology itself is an operation on the data set. A more complete reflection of the characteristics of the original data will improve the accuracy of the prediction results. is extremely important. the

本发明一个实施例提供了一种软件缺陷预测方法。图1是本发明一个实施例提供的一种软件缺陷预测方法的框图，参见图1，该方法包括： An embodiment of the present invention provides a software defect prediction method. Fig. 1 is a block diagram of a software defect prediction method provided by an embodiment of the present invention, referring to Fig. 1, the method includes:

步骤S100：根据局部线性嵌入算法LLE对第一训练数据集进行降维处理，得到第一训练数据集中每个样本点映射到低维空间中的低维向量，得到由各低维向量组成的第二训练数据集； Step S100: Perform dimensionality reduction processing on the first training data set according to the local linear embedding algorithm LLE, obtain the low-dimensional vectors in which each sample point in the first training data set is mapped to the low-dimensional space, and obtain the first training data set composed of various low-dimensional vectors Two training data sets;

步骤S110：根据第二训练数据集对支持向量机SVM分类器进行训练，得到SVM分类器的最优分类超平面函数，进而得到训练好的SVM分类器； Step S110: Train the support vector machine SVM classifier according to the second training data set, obtain the optimal classification hyperplane function of the SVM classifier, and then obtain the trained SVM classifier;

步骤S120：根据最优分类超平面函数对待预测软件进行缺陷预测。 Step S120: Perform defect prediction on the software to be predicted according to the optimal classification hyperplane function. the

在本实施例中，根据局部线性嵌入算法LLE对第一训练数据集进行降维处理，得到第一训练数据集中每个样本点映射到低维空间中的低维向量，得到由各低维向量组成的第二训练数据集包括： In this embodiment, the first training data set is subjected to dimensionality reduction processing according to the local linear embedding algorithm LLE, and the low-dimensional vectors in which each sample point in the first training data set is mapped to a low-dimensional space are obtained. The second training data set composed includes:

设第一训练数据集为{X₁,X₂,...,X_N},X_i∈R^D其中，X_i是属于D维空间的向量； Let the first training data set be {X ₁ , X ₂ ,...,X _N }, X _i ∈ R ^D where X _i is a vector belonging to D-dimensional space;

计算第一训练数据集中每个样本点X_i的K个近邻点； Calculate the K nearest neighbor points of each sample point _Xi in the first training data set;

利用每个样本点的K个近邻点根据公式1计算局部重建权值矩阵W； Use the K nearest neighbor points of each sample point to calculate the local reconstruction weight matrix W according to formula 1;

根据得到的局部重建权值矩阵W和其近邻点并依据公式2计算出每个样本点的对应的低维向量； Calculate the corresponding low-dimensional vector of each sample point according to the obtained local reconstruction weight matrix W and its neighbor points according to formula 2;

在本实施例中，根据第二训练数据集对支持向量机SVM分类器进行训练，得到SVM分类器的最优分类超平面函数包括： In the present embodiment, support vector machine SVM classifier is trained according to the second training data set, and the optimal classification hyperplane function of obtaining SVM classifier includes:

其中，ω是正交于分类超平面的d维向量，b是偏差项，C是惩罚系数，ξ_i是松弛变量，φ(x)是SVM分类器使用的核函数。 Among them, ω is a d-dimensional vector orthogonal to the classification hyperplane, b is the bias term, C is the penalty coefficient, _ξi is the slack variable, and φ(x) is the kernel function used by the SVM classifier.

在本实施例中，核函数为径向基核函数，形式为： In this embodiment, the kernel function is a radial basis kernel function in the form:

图2是本发明又一个实施例提供的一种软件缺陷预测的方法的流程图；参见图2，具体地，本发明实施例可以具体分为三个部分，第一部分，对训练数据集进行降维处理：这一部分包括步骤S200和S210：第二部分包括步骤S220；第三部分则包括步骤S230. Fig. 2 is a flow chart of a method for software defect prediction provided by another embodiment of the present invention; referring to Fig. 2, specifically, the embodiment of the present invention can be specifically divided into three parts, the first part is to reduce the training data set Dimension processing: this part includes steps S200 and S210: the second part includes step S220; the third part then includes step S230.

步骤S200：获取软件缺陷预测时使用的第一训练数据集； Step S200: Obtain the first training data set used in software defect prediction;

步骤S210：使用LLE算法对第一训练数据集进行降维；本实施例中采用的数据集为软件缺陷预测领域研究中广泛使用的NASA MDP软件缺陷数据集，该数据集可从网上通过下载获得。该数据集包含13个子数据集，每个子数据集记录了NASA的实际软件项目中各个模块的度量属性和标记位，其中标记位代表该模块是否具有缺陷。得到第一训练数据集后，对数据集进行降维处理。具体地，降维步骤可以分为： Step S210: Use the LLE algorithm to perform dimensionality reduction on the first training data set; the data set used in this embodiment is the NASA MDP software defect data set widely used in research in the field of software defect prediction, which can be downloaded from the Internet . The dataset contains 13 sub-datasets, and each sub-dataset records the measurement attributes and flag bits of each module in NASA's actual software project, where the flag bit represents whether the module has defects. After the first training data set is obtained, dimensionality reduction processing is performed on the data set. Specifically, the dimensionality reduction steps can be divided into:

1）设第一训练数据集为{X₁,X₂,...,X_N},X_i∈R^D，其中，R代表空间，D代表维度。 1) Let the first training data set be {X ₁ ,X ₂ ,...,X _N },X _i ∈R ^D , where R represents space and D represents dimension.

2）确定每个样本点和其他样本点之间的距离，计算公式为d_ij＝||X_i-X_j||，计算出每个样本点和其他样本点之间的距离后，选定其中距离最短的K个作为近邻点； 2) Determine the distance between each sample point and other sample points, the calculation formula is d _ij =||X _i -X _j ||, after calculating the distance between each sample point and other sample points, select Among them, the K with the shortest distance are taken as the nearest neighbor points;

3）由样本点X_i的近邻点计算出局部重建权值矩阵W，使样本点的重建误差最小，即求解最优化问题： 3) Calculate the local reconstruction weight matrix W from the neighbor points of the sample point _Xi , so as to minimize the reconstruction error of the sample point, that is, to solve the optimization problem:

其中，N为样本点数量，w_ij代表第i个样本点使用第j个近邻点表示的系数，w_ij也是一个权值，代表了第j个近邻点对第i个样本点的贡献。采用LLE算法对数据集进行降维具体而言：是对数据集中的每一个样本点都用该样本点的K近邻点表示该样本点。这样，每个样本点在用近邻点来表示时，都有K近邻点表示该样本点的K个系数，单个近邻点表示该样本点时，系数是一个具体的数值，每个样本点的K个系数组成了一个系数向量；数据集中所有样本点的系数向量就构成了一个权值矩阵W。 Among them, N is the number of sample points, w _ij represents the coefficient represented by the i-th sample point using the j-th neighbor point, and w _ij is also a weight, representing the contribution of the j-th neighbor point to the i-th sample point. Specifically, the LLE algorithm is used to reduce the dimension of the data set: for each sample point in the data set, the K nearest neighbor points of the sample point are used to represent the sample point. In this way, when each sample point is represented by a neighbor point, there are K coefficients representing the sample point. When a single neighbor point represents the sample point, the coefficient is a specific value, and the K coefficient of each sample point is The coefficients form a coefficient vector; the coefficient vectors of all sample points in the data set form a weight matrix W.

4）然后固定上一步得到的局部重建权值矩阵W，按目标函数求解每一个样本点X_i对应的低维向量Y_i，目标函数为： 4) Then fix the local reconstruction weight matrix W obtained in the previous step, and solve the low-dimensional vector Y _i corresponding to each sample point _Xi according to the objective function. The objective function is:

其中，I是一个单位矩阵，M＝(I-W)^T(I-W)，最终M的第2至d+1个特征向量就是输出结果。这里，d代表对样本点进行降维后的维度，最终的输出结果是d个低维向量。 Wherein, I is an identity matrix, M=(IW) ^T (IW), and finally the 2nd to d+1 eigenvectors of M are the output results. Here, d represents the dimensionality of the sample points after dimensionality reduction, and the final output is d low-dimensional vectors.

经过上述4个步骤，得到第一训练数据集中每个样本点映射到低维空间中的低维向量，然后用这些低维向量组成的第二训练数据集对SVM分类器进行训练。 After the above four steps, the low-dimensional vectors that each sample point in the first training data set is mapped to the low-dimensional space are obtained, and then the SVM classifier is trained with the second training data set composed of these low-dimensional vectors. the

第二部分，使用降维后的数据集对SVM分类器进行训练： In the second part, use the reduced data set to train the SVM classifier:

步骤S220：将降维后的数据集输入到SVM分类器中，结合网格搜索方法和十折交叉验证方法对参数寻优，并对SVM分类器进行训练。 Step S220: Input the dimensionally reduced data set into the SVM classifier, optimize the parameters by combining the grid search method and the ten-fold cross-validation method, and train the SVM classifier. the

其中，根据第二训练数据集对支持向量机SVM分类器进行训练，得到SVM分类器的最优分类超平面函数具体包含如下的过程： Wherein, the support vector machine SVM classifier is trained according to the second training data set, and the optimal classification hyperplane function of the SVM classifier is obtained, which specifically includes the following process:

使用降维后的第二训练数据集对SVM分类器进行训练，训练的过程即是求解SVM的最优分类超平面。 The SVM classifier is trained using the dimensionally reduced second training data set, and the training process is to solve the optimal classification hyperplane of the SVM. the

具体地对SVM进行训练的问题可转换为一个求凸二次规划的问题： Specifically, the problem of training SVM can be converted into a problem of convex quadratic programming:

其中，ω是正交于分类超平面的d维向量，b是偏差项，C是惩罚系数，ξ_i是松弛变量，φ(x)为选择使用的核函数。惩罚因子C决定了有多重视离群点带来的损失，显然当所有离群点的松弛变量的和一定时，定的C越大，对目标函数的损失也越大，此时就暗示着你非常不愿意放弃这些离群点，最极端的情况是你把C定为无限大，这样只要稍有一个点离群，目标函数的值马上变成无限大，马上让问题变成无解，这就退化成了硬间隔问题。松弛变量ξ_i的值实际上标示出了对应的点到底离群有多远，值越大，点就越远。核函数的作用通过将低维空间的数据映射到高维空间，从而使线性不可分转换为线性可分。 Among them, ω is a d-dimensional vector orthogonal to the classification hyperplane, b is the bias term, C is the penalty coefficient, ξ _i is the slack variable, and φ(x) is the kernel function selected for use. The penalty factor C determines how much attention is paid to the loss caused by outliers. Obviously, when the sum of the slack variables of all outliers is constant, the larger the fixed C, the greater the loss to the objective function, which implies You are very unwilling to give up these outliers. The most extreme case is that you set C to be infinite, so that as long as there is a slight outlier, the value of the objective function will immediately become infinite, and the problem will immediately become unsolvable. This degenerates into a hard margin problem. The value of the slack variable _ξi actually marks how far the corresponding point is out of the group. The larger the value, the farther the point is. The role of the kernel function is to transform the linear inseparable into linearly separable by mapping the data in the low-dimensional space to the high-dimensional space.

由于径向基核函数具有较宽的收敛范围，因此，在本实施例采用径向基核函数作为SVM分类器的核函数。核函数的形式为： Since the radial basis kernel function has a wide convergence range, the radial basis kernel function is used as the kernel function of the SVM classifier in this embodiment. The form of the kernel function is:

引入拉格朗日乘子，利用标准拉格朗日对偶原理化简求解前述二次规划问题，得到一个符号判别函数： Introduce Lagrangian multipliers, use the standard Lagrangian duality principle to simplify and solve the aforementioned quadratic programming problem, and obtain a sign discriminant function:

$f (x) = sign (Σ_{i = 1}^{n} λ_{i} y_{i} K (x_{i}, x) + b)$ 公式5 $f (x) = sign (Σ_{i = 1}^{no} λ_{i} {the y}_{i} K (x_{i}, x) + b)$ Formula 5

对于SVM中惩罚系数C和径向基核函数中的参数σ的确定，在本实施例中，采用网格搜索方法配合十折交叉验证方法对SVM分类器的参数C以及核函数参数σ进行寻优，找到使SVM分类准确率最高的那对参数C和σ的取值，以确定SVM分类器的最优分类超平面函数。 For the determination of the penalty coefficient C in SVM and the parameter σ in the radial basis kernel function, in this embodiment, the grid search method and the ten-fold cross-validation method are used to search for the parameter C of the SVM classifier and the parameter σ of the kernel function. Optimum, find the value of the pair of parameters C and σ that makes the SVM classification accuracy the highest, so as to determine the optimal classification hyperplane function of the SVM classifier. the

具体地，在本实施例中，采用网格搜索方法确定最优的参数C和σ的值；即让这两个参数在预先给定的范围划分网格并遍历所有网格进行取值，其中，C的取值区间设为[2^-10,2⁷]，σ取值区间设为[2^-10,2³]，两个参数的步长都为0.1，得到C的取值区间内所有的值与σ取值区间内所有的值组成的所有组合并进行搜索。 Specifically, in this embodiment, a grid search method is used to determine the optimal values of parameters C and σ; that is, let these two parameters be divided into grids in a predetermined range and traverse all grids to take values, where , the value interval of C is set to [2 ^-10 ,2 ⁷ ], the value interval of σ is set to [2 ^-10 ,2 ³ ], the step size of both parameters is 0.1, and all All combinations of the value of σ and all the values in the value range of σ are searched.

在本实施例中，对选定的每组参数C、σ得到在此组参数取值下的分类准确率，采用十折交叉方法进行验证，取使分类准确率最高的那组参数C、σ作为最佳的参数取值；其中，采用十折交叉方法进行验证的实现过程为：将第二数据集分为10个子集，1个子集做测试集，其余9个子集做训练集，得到选定的某组参数下的1个分类准确率，如此重复10次；得到在这组参数下的10个分类准确率，将这10个分类准确率的平均数作为评价每一组参数优劣的指标，然后，比较选定的每组参数的分类准确率的平均数，将平均数最高的那组参数C，σ作为最佳的参数取值。 In this embodiment, for each selected group of parameters C and σ, the classification accuracy rate under the value of this group of parameters is obtained, and the ten-fold crossover method is used for verification, and the group of parameters C, σ with the highest classification accuracy is selected. As the best parameter value; among them, the implementation process of using the ten-fold crossover method for verification is: divide the second data set into 10 subsets, 1 subset is used as the test set, and the remaining 9 subsets are used as the training set, and the selected 1 classification accuracy rate under a given set of parameters, repeating this 10 times; get 10 classification accuracy rates under this set of parameters, and use the average of these 10 classification accuracy rates as the criterion for evaluating the pros and cons of each set of parameters Index, and then compare the average of the classification accuracy of each selected set of parameters, and use the set of parameters C, σ with the highest average as the best parameter value. the

找到最优的参数C，σ的取值后，确定SVM分类器的最优分类超平面函数，进而得到训练好的SVM分类器。 After finding the optimal parameter C, the value of σ, determine the optimal classification hyperplane function of the SVM classifier, and then obtain the trained SVM classifier. the

第三部分：利用训练好的SVM分类器对待测软件进行缺陷预测。 The third part: Use the trained SVM classifier to predict the defects of the software to be tested. the

步骤S230：使用训练好的SVM分类器进行软件缺陷预测； Step S230: use the trained SVM classifier to predict software defects;

具体地，在本实施例中，首先对待预测软件的数据集利用LLE算法进行降维处理；若输入的数据落入最优分类超平面函数确定的没有缺陷的空间中时，则确定该数据对应的软件模块未包含缺陷并在SVM分类器的输出结果中进行标记；若输入的数据落入最优分类超平面函数确定的有缺陷的空间中时，则确定数据对应的软件模块包含缺陷并在SVM分类器的输出结果中进行标记。 Specifically, in this embodiment, firstly, the data set of the forecasting software is subjected to dimensionality reduction processing using the LLE algorithm; if the input data falls into the space without defects determined by the optimal classification hyperplane function, then it is determined that The software module of the software module does not contain defects and is marked in the output result of the SVM classifier; if the input data falls into the defective space determined by the optimal classification hyperplane function, it is determined that the software module corresponding to the data contains defects and is in the Labeled in the output of the SVM classifier. the

在本实施例中，在SVM分类器的输出结果上进行显示时，若软件模块具有缺陷则用字母Y进行标记为。若软件模块不具有缺陷则用字母N进行标。 In this embodiment, when the output result of the SVM classifier is displayed, if the software module has a defect, it will be marked with the letter Y. If the software module has no defects, it is marked with the letter N. the

由此，本发明实施例提供的软件缺陷预测方法采用局部线性嵌入算法对训练数据集进行降维处理，保证降维后数据集中样本点的几何结构不变，使得降维后的数据能更完全地反映出原始数据集的各种特征。 Therefore, the software defect prediction method provided by the embodiment of the present invention uses a local linear embedding algorithm to perform dimension reduction processing on the training data set, ensuring that the geometric structure of the sample points in the data set remains unchanged after dimension reduction, so that the data after dimension reduction can be more complete. accurately reflect the various characteristics of the original data set. the

本发明又一个实施例还提供了一种软件缺陷预测的系统，图3是本发明又一个实施例提供的一种软件缺陷预测系统的框图。参见图3，该系统300包括：降维处理单元310、SVM训练单元320和缺陷预测单元330； Yet another embodiment of the present invention also provides a software defect prediction system. FIG. 3 is a block diagram of a software defect prediction system provided by another embodiment of the present invention. Referring to Fig. 3, the system 300 includes: a dimensionality reduction processing unit 310, an SVM training unit 320 and a defect prediction unit 330;

降维处理单元310，用于根据局部线性嵌入算法LLE对第一训练数据集进行降维处理，得到第一训练数据集中每个样本点映射到低维空间中的低维向量，得到由各低维向量组成的第二训练数据集； The dimensionality reduction processing unit 310 is used to perform dimensionality reduction processing on the first training data set according to the local linear embedding algorithm LLE, to obtain a low-dimensional vector in which each sample point in the first training data set is mapped to a low-dimensional space, and to obtain a low-dimensional vector obtained by each low-dimensional A second training data set composed of dimension vectors;

SVM训练单元320，用于根据第二训练数据集对支持向量机SVM分类器进行训练，得到SVM分类器的最优分类超平面函数，进而得到训练好的SVM分类器； The SVM training unit 320 is used to train the support vector machine SVM classifier according to the second training data set, obtain the optimal classification hyperplane function of the SVM classifier, and then obtain the trained SVM classifier;

缺陷预测单元330，用于根据训练好的SVM分类器对待预测软件进行缺陷预测。 The defect prediction unit 330 is configured to perform defect prediction on the software to be predicted according to the trained SVM classifier. the

在本发明的一个实施例中，根据局部线性嵌入算法LLE对第一训练数据集进行降维处理，得到第一训练数据集中每个样本点映射到低维空间中的低维向量，得到由各低维向量组成的第二训练数据集包括： In one embodiment of the present invention, according to the local linear embedding algorithm LLE, the first training data set is subjected to dimensionality reduction processing, and the low-dimensional vectors in which each sample point in the first training data set is mapped to a low-dimensional space are obtained. The second training data set composed of low-dimensional vectors includes:

利用每个样本点的K个近邻点根据公式1计算出局部重建权值矩阵W； Use the K nearest neighbor points of each sample point to calculate the local reconstruction weight matrix W according to formula 1;

其中，N为样本点数量，w_ij代表第i个样本点X_i使用第j个近邻点表示的系数，第一训练数据集中所有样本点X_i使用近邻点表示的系数组成了所有样本点的局部重建权值矩阵W； Among them, N is the number of sample points, w _ij represents the coefficient represented by the i-th sample point _Xi using the j-th neighbor point, and the coefficients represented by all sample points _Xi in the first training data set using the neighbor point constitute the coefficient of all sample points Local reconstruction weight matrix W;

在本发明的一个是实施例中，根据第二训练数据集对支持向量机SVM分类器进行训练，得到SVM分类器的最优分类超平面函数包括： In one embodiment of the present invention, support vector machine SVM classifier is trained according to the second training data set, obtains the optimal classification hyperplane function of SVM classifier comprising:

在本发明的一个实施例中，核函数为径向基核函数，形式为： In one embodiment of the present invention, the kernel function is a radial basis kernel function, in the form of:

在本发明的一个实施例中，SVM训练单元，还用于采用网格搜索方法配合十折交叉验证方法对SVM分类器的参数C以及核函数参数σ进行寻优，找到使SVM分类准确率最高的那对参数C和σ的取值，以确定SVM分类器的最优分类超平面函数。 In one embodiment of the present invention, the SVM training unit is also used to optimize the parameter C of the SVM classifier and the parameter σ of the kernel function by using the grid search method and the ten-fold cross-validation method to find the highest classification accuracy of the SVM The value of the pair of parameters C and σ to determine the optimal classification hyperplane function of the SVM classifier. the

在本发明的一个实施例中，采用网格搜索方法配合十折交叉验证方法对SVM分类器的参数C以及核函数参数σ进行寻优包括： In one embodiment of the present invention, using the grid search method in conjunction with the ten-fold cross-validation method to optimize the parameter C of the SVM classifier and the kernel function parameter σ includes:

采用网格搜索方法对所述参数C和σ进行取值；其中，C的取值区间设为[2^-10,2⁷]，σ取值区间设为[2^-10,2³]，两参数的步长都为0.1，得到由C的取值区间内所有的值与σ取值区间内所有的值组成的所有组合并进行搜索。 The parameters C and σ are selected by the grid search method; wherein, the value range of C is set to [2 ^-10 ,2 ⁷ ], and the value range of σ is set to [2 ^-10 ,2 ³ ]. The step size of the parameters is 0.1, and all combinations consisting of all values in the value range of C and all values in the value range of σ are obtained and searched.

在本发明的一个实施例中，采用网格搜索方法配合十折交叉验证方法对SVM分类器的参数C以及核函数参数σ进行寻优还包括： In one embodiment of the present invention, using the grid search method in conjunction with the ten-fold cross-validation method to optimize the parameter C of the SVM classifier and the kernel function parameter σ also includes:

对选定的每组参数C、σ得到在此组参数取值下分类准确率，采用十折交叉方法进行验证，取使分类准确率最高的那组参数作为最佳的参数取值，其中，所述采用十折交叉方法进行验证是指将所述第二数据集分为10个子集，1个子集做测试集，其余9个子集做训练集，得到选定的某组参数下的1个分类准确率，如此重复10次；得到在这组参数下的10个分类准确率，将这10个分类准确率的平均数作为评价每一组参数优劣的指标，然后，比较选定的每组参数的分类准确率的平均数，将平均数最高的那组参数C，σ作为最佳的参数取值。 For each set of parameters C and σ selected, the classification accuracy rate under this set of parameter values is obtained, and the ten-fold crossover method is used for verification, and the set of parameters with the highest classification accuracy rate is taken as the best parameter value. Among them, The verification by using the ten-fold crossover method refers to dividing the second data set into 10 subsets, one subset is used as a test set, and the remaining 9 subsets are used as a training set to obtain a selected set of parameters. Classification accuracy rate, so repeated 10 times; get 10 classification accuracy rates under this group of parameters, use the average of these 10 classification accuracy rates as an index to evaluate the pros and cons of each group of parameters, and then compare the selected The average of the classification accuracy of the group parameters, and the group of parameters C and σ with the highest average are taken as the best parameter values. the

在本发明的一个实施例中，根据训练好的SVM分类器进行软件缺陷预测包括： In one embodiment of the present invention, performing software defect prediction according to a trained SVM classifier includes:

对待预测软件的数据集利用LLE算法进行降维处理； Use the LLE algorithm to reduce the dimensionality of the data set of the forecasting software;

将降维后的数据集输入到训练好的SVM分类器中并进行判断；若输入的数据落入最优分类超平面函数确定的没有缺陷的空间中时，则确定该数据对应的软件模块未包含缺陷并在SVM分类器的输出结果中进行标记；若输入的数据落入最优分类超平面函数确定的有缺陷的空间中时，则确定数据对应的软件模块包含缺陷并在SVM分类器的输出结果中进行标记。 Input the data set after dimension reduction into the trained SVM classifier and judge it; if the input data falls into the space without defects determined by the optimal classification hyperplane function, it is determined that the software module corresponding to the data is not Contains defects and marks them in the output of the SVM classifier; if the input data falls into the defective space determined by the optimal classification hyperplane function, it is determined that the software module corresponding to the data contains defects and is included in the SVM classifier. mark in the output. the

需要强调的是，本发明实施例提供的这种软件缺陷预测系统进行软件缺陷预测的过程可以概括为构建基于LLE算法以及SVM分类器的预测模型的过程。该预测模型构建过程主要包含两个模块，第一是降维处理，第二是缺陷预测。其中，降维处理中对SVM分类器采用的训练集需要进行降维处理，同时，在实际应用中，对待测软件的测试数据集也同样采用LLE降维处理，然后根据降维后的数据集以及求得的SVM最优分类超平面函数进行具体预测。这样可以保证降维后的数据集能够更加全面的体现原始数据的数据特征，从而提高软件缺陷预测的准确率。 It should be emphasized that the software defect prediction process of the software defect prediction system provided by the embodiment of the present invention can be summarized as the process of constructing a prediction model based on the LLE algorithm and the SVM classifier. The prediction model construction process mainly includes two modules, the first is dimensionality reduction processing, and the second is defect prediction. Among them, in the dimensionality reduction process, the training set used by the SVM classifier needs to be subjected to dimensionality reduction processing. At the same time, in practical applications, the test data set of the software to be tested also uses LLE dimensionality reduction processing, and then according to the dimensionality reduction data set And the obtained SVM optimal classification hyperplane function for specific prediction. This can ensure that the data set after dimension reduction can more fully reflect the data characteristics of the original data, thereby improving the accuracy of software defect prediction. the

本发明实施例提供的软件缺陷预测系统是与前述介绍的软件缺陷预测方法相对应的，具体的使用过程参见前述方法实施例中的相关内容，此处不在赘述。 The software defect prediction system provided by the embodiment of the present invention corresponds to the software defect prediction method described above. For the specific use process, refer to the relevant content in the foregoing method embodiments, and will not be repeated here. the

综上所述，本发明实施例提供的这种软件缺陷预测方法和软件缺陷预测系统，采用局部线性嵌入算法对训练数据集进行降维处理，使得降维后的数据能更完全地反映出原始数据集的各种特征，并根据SVM的最优分类超平面函数，利用最优分类超平面函数进行软件缺陷预测，从而达到提高软件缺陷预测准确率的目的。 In summary, the software defect prediction method and software defect prediction system provided by the embodiments of the present invention use a local linear embedding algorithm to reduce the dimensionality of the training data set, so that the dimensionality-reduced data can more completely reflect the original According to the various characteristics of the data set, and according to the optimal classification hyperplane function of SVM, the optimal classification hyperplane function is used to predict software defects, so as to achieve the purpose of improving the accuracy of software defect prediction. the

以上所述仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。 The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present invention are included in the protection scope of the present invention. the

Claims

1. A software defect prediction method is characterized in that, comprising the following steps:

Step 1. Perform dimensionality reduction processing on the first training data set according to the local linear embedding algorithm LLE, and obtain the low-dimensional vectors in which each sample point in the first training data set is mapped to the low-dimensional space, and obtain the first training data set composed of various low-dimensional vectors. Two training data sets;

Step 2, the support vector machine SVM classifier is trained according to the second training data set, the optimal classification hyperplane function of the SVM classifier is obtained, and then the trained SVM classifier is obtained;

Step 3, perform defect prediction on the software to be predicted according to the trained SVM classifier.

2. A kind of software defect prediction method as claimed in claim 1, is characterized in that, wherein in step 1, obtain the second training data set that is made up of each low-dimensional vector and adopt following method:

1.1 Let the first training data set be {X ₁ ,X ₂ ,...,X _N },X _i ∈R ^D , where X _i is a vector belonging to D-dimensional space;

1.2 Calculating the K nearest neighbor points of each sample point _Xi in the first training data set;

1.3 Use the K neighbor points of each sample point to calculate the local reconstruction weight matrix W according to formula 1;

\{\begin{matrix} \min ϵ (w) = Σ_{i = 1}^{N} {| | x_{i} - Σ_{j = 1}^{K} w_{ij} x_{ij} | |}^{2} \\ the s . t Σ_{i = 1}^{N} w_{ij} = 1 \end{matrix}

Formula 1

Among them, N is the number of sample points, w _ij represents the coefficient represented by the i-th sample point _Xi using the j-th neighbor point; all sample points _Xi in the first training data set use the coefficients represented by their neighbor points to form the local reconstruction weight value matrix W;

1.4 Calculate the low-dimensional vector corresponding to each sample point according to the obtained local reconstruction weight matrix W and the neighboring points of the sample point according to formula 2;

\{\begin{matrix} \min ϵ (w) = Σ_{i = 1}^{N} {| | Y_{j} - Σ_{j = 1}^{K} w_{ij} Y_{ij} | |}^{2} = \min ({YMY}^{T}) \\ the s . t : {YY}^{T} = I \end{matrix}

Formula 2

Wherein, I is the identity matrix, M=(IW) ^T (IW).

3. A kind of software defect prediction method as claimed in claim 1 or 2, is characterized in that, obtains the trained SVM classifier described in wherein step 2 and adopts following method:

Solve the optimal classification hyperplane function of the SVM classifier according to formula 3

\{\begin{matrix} \min {\frac{1}{2} {| | ω | |}^{2} + C Σ_{i = 1}^{no} ξ_{i}} \\ the s . t . {the y}_{i} (ω^{T} φ (x_{i}) + b) &Greater Equal; 1 - ξ_{i}, i = 1, . . ., no \end{matrix}, ξ_{i} &Greater Equal; 0

Formula 3

Among them, ω is a d-dimensional vector orthogonal to the classification hyperplane, b is the bias term, C is the penalty coefficient, _ξi is the slack variable, and φ(x) is the kernel function used by the SVM classifier.

4. A kind of software defect prediction method as claimed in claim 3, is characterized in that, above-mentioned kernel function is radial basis kernel function, and the form is:

K (x, x_{i}) = \exp {- \frac{{| x - x_{i} |}^{2}}{σ^{2}}}

Formula 4

where σ is the width parameter of the radial basis kernel function.

5. a kind of software defect prediction method as claimed in claim 1 or 2 or 4 is characterized in that, in the above-mentioned optimal classification hyperplane function that obtains SVM classifier, adopt grid search method and ten-fold cross-validation method Optimize the parameter C of the SVM classifier and the parameter σ of the kernel function, and find the value of the pair of parameters C and σ that makes the SVM classification accuracy the highest, so as to determine the optimal classification hyperplane function of the SVM classifier.

6. A kind of software defect prediction method as claimed in claim 5, is characterized in that, above-mentioned employing grid search method and ten-fold cross-validation method carries out optimization to parameter C of SVM classifier and kernel function parameter σ comprising: Use the grid search method to value the parameters C and σ; get all the combinations of all the values in the range of C and all the values in the range of σ, and search.

7. A kind of software defect prediction method as claimed in claim 5, is characterized in that, above-mentioned employing grid search method and ten-fold cross-validation method carries out optimization to parameter C of SVM classifier and kernel function parameter σ comprising: For each set of parameters C and σ selected, the classification accuracy rate under this set of parameter values is obtained, and the ten-fold crossover method is used for verification, and the set of parameters with the highest classification accuracy rate is taken as the best parameter value; among them , using the ten-fold crossover method for verification means that the second data set is divided into 10 subsets, 1 subset is used as a test set, and the remaining 9 subsets are used as a training set to obtain 1 classification accuracy under a selected set of parameters. rate, repeating this 10 times; get 10 classification accuracy rates under this set of parameters, use the average of these 10 classification accuracy rates as an index to evaluate the pros and cons of each set of parameters, and then compare each set of selected parameters The average of the classification accuracy, the group of parameters C, σ with the highest average is taken as the best parameter value.

8. A kind of software defect prediction method as claimed in claim 1 or 2 or 4 or 6 or 7, is characterized in that, wherein carries out software defect prediction according to optimal classification hyperplane function and adopts following method:

First, use the LLE algorithm to reduce the dimensionality of the data set of the forecasting software;

Secondly, input the data set after dimension reduction into the trained SVM classifier and make a judgment; if the input data falls into the space without defects determined by the optimal classification hyperplane function, then Determine that the software module corresponding to the data does not contain a defect and mark it in the output result of the SVM classifier; if the input data falls into the defective space determined by the optimal classification hyperplane function, then determine that the The software modules corresponding to the above data contain defects and are marked in the output of the SVM classifier.

9. A software defect prediction system, comprising: a dimensionality reduction processing unit, an SVM training unit and a defect prediction unit;

The dimensionality reduction processing unit is used to perform dimensionality reduction processing on the first training data set according to the local linear embedding algorithm LLE, to obtain a low-dimensional vector in which each sample point in the first training data set is mapped to a low-dimensional space, and to obtain a low-dimensional vector composed of each low-dimensional A second training data set composed of vectors;

The SVM training unit is used to train the support vector machine SVM classifier according to the second training data set, and obtains the optimal classification hyperplane function of the SVM classifier, and then obtains the trained SVM classifier; the defect prediction unit is used for according to The trained SVM classifier performs defect prediction on the prediction software.

10. A kind of software defect prediction system as claimed in claim 9, is characterized in that, wherein obtaining the second training data set that is made up of each low-dimensional vector adopts the following method:

\{\begin{matrix} \min ϵ (w) = Σ_{i = 1}^{N} {| | x_{i} - Σ_{j = 1}^{K} w_{ij} x_{ij} | |}^{2} \\ the s . t Σ_{i = 1}^{N} w_{ij} = 1 \end{matrix}

Formula 1

\{\begin{matrix} \min ϵ (w) = Σ_{i = 1}^{N} {| | Y_{j} - Σ_{j = 1}^{K} w_{ij} Y_{ij} | |}^{2} = \min ({YMY}^{T}) \\ the s . t : {YY}^{T} = I \end{matrix}

Formula 2

Wherein, I is the identity matrix, M=(IW) ^T (IW).

11. A kind of software defect prediction system as claimed in claim 9 or 10, it is characterized in that, wherein said obtained trained SVM classifier adopts the following method:

\{\begin{matrix} \min {\frac{1}{2} {| | ω | |}^{2} + C Σ_{i = 1}^{no} ξ_{i}} \\ the s . t . {the y}_{i} (ω^{T} φ (x_{i}) + b) &Greater Equal; 1 - ξ_{i}, i = 1, . . ., no \end{matrix}, ξ_{i} &Greater Equal; 0

Formula 3

12. A kind of software defect prediction system as claimed in claim 11, is characterized in that, above-mentioned kernel function is radial basis kernel function, and the form is:

K (x, x_{i}) = \exp {- \frac{{| x - x_{i} |}^{2}}{σ^{2}}}

Formula 4

where σ is the width parameter of the radial basis kernel function.

13. A kind of software defect prediction system as claimed in claim 9 or 10 or 12, it is characterized in that, in the optimal classification hyperplane function of above-mentioned obtaining SVM classifier, adopt grid search method and ten-fold cross-validation method Optimize the parameter C of the SVM classifier and the parameter σ of the kernel function, and find the value of the pair of parameters C and σ that makes the SVM classification accuracy the highest, so as to determine the optimal classification hyperplane function of the SVM classifier.

14. A kind of software defect prediction system as claimed in claim 13, is characterized in that, above-mentioned employing grid search method and ten-fold cross-validation method carries out optimization to parameter C of SVM classifier and kernel function parameter σ comprising: Use the grid search method to value the parameters C and σ; get all the combinations of all the values in the range of C and all the values in the range of σ, and search.

15. A kind of software defect prediction system as claimed in claim 13, is characterized in that, above-mentioned employing grid search method and ten-fold cross-validation method carries out optimization to parameter C of SVM classifier and kernel function parameter σ comprising: For each set of parameters C and σ selected, the classification accuracy rate under this set of parameter values is obtained, and the ten-fold crossover method is used for verification, and the set of parameters with the highest classification accuracy rate is taken as the best parameter value; among them , using the ten-fold crossover method for verification means that the second data set is divided into 10 subsets, 1 subset is used as a test set, and the remaining 9 subsets are used as a training set to obtain 1 classification accuracy under a selected set of parameters. rate, repeating this 10 times; get 10 classification accuracy rates under this set of parameters, use the average of these 10 classification accuracy rates as an index to evaluate the pros and cons of each set of parameters, and then compare each set of selected parameters The average of the classification accuracy, the group of parameters C, σ with the highest average is taken as the best parameter value.

16. A kind of software defect prediction system as claimed in claim 9 or 10 or 12 or 14 or 15, it is characterized in that, wherein according to optimal classification hyperplane function, software defect prediction adopts the following method: