CN115994310A

CN115994310A - Software defect prediction method based on clustering integration

Info

Publication number: CN115994310A
Application number: CN202211224284.3A
Authority: CN
Inventors: 李志强; 谢娟英; 祁超
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2023-04-21

Abstract

The invention provides a software defect prediction method based on clustering integration, which belongs to the technical field of software defect prediction and comprises the following steps: randomly sampling N software entities X from given project data, randomly extracting m metric elements from the randomly sampled N software entities X to form a data set X _* Wherein N is the total number of software entities of the project data; based on data set X _* Adopting a clustering algorithm to construct an unsupervised software defect prediction model; marking the predicted result with defects and no defects to obtain a predicted tag vector p of the extracted software entity; resampled data set X _* And generating a predictive label vector p multiple times for the extracted software entity x ⁱ Calculating an integrated prediction result P of the method; if P is greater than 0.5, the extracted software entity x is described ⁱ There is a shortage ofA trap, describing the extracted software entity x if P is less than or equal to 0.5 ⁱ Without defects. The method can be used for predicting software defects.

Description

Software defect prediction method based on clustering integration

Technical Field

The invention belongs to the technical field of software defect prediction, and particularly relates to a clustering integration-based software defect prediction method.

Background

Software defect prediction is one of key technologies in software quality assurance, and the motivation is to discover defects in software through some economic and efficient technologies and ensure the quality and reliability of software products. In the software development process, software defects are inevitably generated, and are discovered and repaired as early as possible, so that the method has an extremely important effect on the later development and maintenance of the software. Therefore, the software defect prediction technology has wide application prospect in identifying the defects of the software products.

The unsupervised software defect prediction does not need to mark the data in the software project to be tested in advance, but automatically identifies the software defect by learning and mining the potential structure of the data. When the target project is a newly developed project, no or very rare historical defect data exists, unsupervised defect prediction can be used. The method has important research significance for solving the software defect prediction problem without historical defect data or with sparse data.

In recent years, researchers have proposed some cluster-based unsupervised software defect prediction methods. However, most of these methods rely on single clustered prediction results, which tend to have lower accuracy and poor robustness, and it is difficult to implement accurate prediction of software defects.

In a word, the existing technology has the problems that the cluster-based unsupervised software defect prediction method relies on the prediction result of a single cluster, has lower accuracy and poor robustness, and is difficult to realize accurate prediction of the software defect.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a software defect prediction method based on clustering integration.

In order to achieve the above object, the present invention provides the following technical solutions:

a software defect prediction method based on cluster integration comprises the following steps:

randomly sampling N software entities X from given project data, randomly extracting m metric elements from the randomly sampled N software entities X to form a data set X _* Wherein N is the total number of software entities of project data；

Based on data set X _* Adopting a clustering algorithm to construct an unsupervised software defect prediction model;

performing defect prediction on the software entity according to the unsupervised software defect prediction model, and marking the predicted result with defects and without defects to obtain a predicted tag vector p of the software entity containing defects and without defects;

removing the subsampled software entity from the predictive label vector p;

resampled data set X _* And generating a predictive label vector p multiple times for the extracted software entity x ⁱ Calculate a predictive label vector p (x ⁱ ) Average value P (x) ⁱ ) P (x) ⁱ ) As an integrated prediction result thereof;

if P (x) ⁱ ) Greater than 0.5 indicates the extracted software entity x ⁱ If there is a defect, if P (x ⁱ ) 0.5 or less, the extracted software entity x is described ⁱ Without defects.

Further, the project data is data of a software project to be tested, and the software entity refers to an example module extracted from program codes or a development process and is a method, a class, a file, a package or a code change.

Further, the method further comprises the following steps: carrying out normalization processing on given project data by adopting a z-score method; the normalization processing algorithm is as follows:

wherein x is _i Is the original value of the ith metric element of the software entity x,

is x _i Normalized value, mu _x Is the mean value, sigma, of the software entity x _x Is the standard deviation of software entity x.

Further, the clustering algorithm is spectral clustering, and the algorithm is as follows:

constructing an adjacency matrix W, wherein the algorithm is as follows:

the Laplace matrix L is calculated, and the algorithm is as follows:

L＝D-W

wherein D is a degree matrix, which is a diagonal matrix with diagonal elements

Normalized Laplace matrix L to obtain L _sym The algorithm is as follows:

for L _sym And (3) carrying out eigenvalue decomposition, selecting an eigenvector v corresponding to the next small eigenvalue, and normalizing the eigenvector v to obtain the unsupervised software defect prediction model.

Further, the marking of the predicted outcome with defects and no defects includes:

dividing v into two clusters, marking the software entity corresponding to v >0 as a defect class, marking the software entity corresponding to v < 0 as a defect-free class, and marking the software entity corresponding to v >0 as a defect class if the total metric element value of the software entity corresponding to v >0 is smaller than the total metric element value of the software entity corresponding to v < 0.

Further, the integrated prediction result P (x ⁱ ) The algorithm is as follows:

wherein x is ⁱ For the extracted software entity to be tested,

representing the actual extraction to software entity x ⁱ Pj is the number of times the extracted software entity x ⁱ The jth predicts the resulting tag vector.

Further, the replaced random sampling N software entities x include: from the given item data x= { X with each put back ¹ ,x ² ,…,x ^N }∈R ^M×N And extracting 1 entity for N times, wherein M represents the number of software metric elements of the project data X, and N represents the number of software entities of the data X.

The software defect prediction method based on clustering integration has the following beneficial effects:

the invention designs an unsupervised software defect prediction method based on cluster integration by adopting an integrated learning and unsupervised clustering technology, and the method does not need to mark data in advance so as to automatically identify software defects, and can be used in software projects without historical defect data or lack of the historical defect data; by integrating a plurality of base clustering results, the method has higher defect identification effect and better robustness, and is beneficial to better ensuring the quality and reliability of software products. The method solves the problems that the method for predicting the defects of the unsupervised software based on the clusters in the prior art depends on the prediction result of a single cluster, has lower accuracy and poor robustness, and is difficult to realize the accurate prediction of the defects of the software.

Drawings

In order to more clearly illustrate the embodiments of the present invention and the design thereof, the drawings required for the embodiments will be briefly described below. The drawings in the following description are only some of the embodiments of the present invention and other drawings may be made by those skilled in the art without the exercise of inventive faculty.

FIG. 1 is a schematic structural diagram of a software defect prediction method based on clustering integration according to an embodiment of the present invention;

FIG. 2 is a graph comparing the experimental results of the present invention with the other 5 methods of the present invention on 5 items.

Detailed Description

The present invention will be described in detail below with reference to the drawings and the embodiments, so that those skilled in the art can better understand the technical scheme of the present invention and can implement the same. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Examples:

the invention provides a software defect prediction method based on clustering integration, which is shown in figure 1 and comprises the following steps:

s1, data preprocessing, including: carrying out normalization processing on given project data, and eliminating the influence of dimensions among different measurement elements of a software entity; project data is data of a software project to be tested, and a software entity refers to an example module extracted from program codes or a development process and can be a method, a class, a file, a package or a code change. Specifically, the adopted normalization processing method is a z-score method, and the algorithm is as follows:

S2, randomly sampling N software entities X from given project data, randomly extracting m metric elements from the N software entities X to form a data set X _* ；

Specifically, randomly sampling i software entities x includes: from the given item data x= { X with each put back ¹ ,x ² ,…,x ^N }∈R ^M×N And extracting 1 entity for N times, wherein M represents the number of software metric elements of the project data X, and N represents the number of software entities of the data X.

Specifically, randomly sampling m software metrics includes: extracting M (m.ltoreq.M) metric elements from the extracted software entity to generate a new data set

S3 based on data X _* A clustering algorithm is adopted to obtain a clustering result, wherein the clustering algorithm is spectral clustering, and the algorithm is as follows:

constructing an adjacency matrix W, wherein the algorithm is as follows:

the Laplace matrix L is calculated, and the algorithm is as follows:

L＝D-W，

wherein D is a degree matrix, which is a diagonal matrix with diagonal elements

/>

Normalized Laplace matrix L to obtain L _sym The algorithm is as follows:

for L _sym Performing eigenvalue decomposition, selecting an eigenvector v corresponding to a small eigenvalue, and normalizing the eigenvector v to obtain a clustering result;

s4, labeling a clustering result to obtain a predictive label vector p, wherein the labeling process is as follows:

prior studies have shown that the metric values of defective software entities are generally higher than those of non-defective software entities, based on which the data X to be measured _* The predictive label vector p of (2) is calculated as follows:

p＝v>0

the above formula divides v into two clusters, marks the software entity corresponding to v >0 as a defect class, and marks the software entity corresponding to v.ltoreq.0 as a defect-free class. If the total metric value of the software entity corresponding to v >0 is smaller than the total metric value of the software entity corresponding to v.ltoreq.0, the software entity corresponding to v >0 is marked as a defect-free class, and the rest is marked as a defect class.

The predictive label vector p will contain the resampled software entity, de-duplicated for final integrated prediction. And the non-sampled software entity is given a special mark, which is not used for final integrated prediction.

S5, repeating the steps S2-S4 for T times to obtain T predictive label vectors, and fusing the T predictive label vectors to obtain an integrated predictive result P, wherein the algorithm is as follows:

for the ith software entity x to be tested ⁱ ：

Wherein x is ⁱ For the extracted software entity to be tested,

In order to fully prove the feasibility of the software defect prediction method based on clustering integration, a verification experiment is carried out;

experimental environment configuration: windows10 system, matlab2020b

Experimental data: the experiment uses 5 public data sets as experimental data, and detailed information is shown in table 1:

table 1 data set for experiments

The specific operation process comprises the following steps:

input: project data to be measured x= { X ¹ ,x ² ,…,x ^N }∈R ^M×N ；

And (3) outputting: whether each software entity in the project has a defect or not is indicated by 1, and 0 indicates that the software entity has no defect.

Step one, performing z-score normalization processing on project data X to be detected;

secondly, aiming at the supervised classifier method, carrying out experiments by adopting a mode of 50 times of 2-fold cross validation, and generating 100 times of predicted result values in total; aiming at an unsupervised clustering method, a clustering experiment is directly carried out on each fold of data, and 100 predicted result values are generated in total;

thirdly, randomly sampling N software entities of the project data X to be tested, and randomly sampling m=log from the N software entities ₂ (M) metric elements, generating a new data set X _* ；

Fourth, adopting a spectral clustering algorithm to perform X _* Clustering is carried out, a prediction model is constructed, and the results are marked with defects and no defects to obtain a prediction label vector;

and fifthly, repeating the third step and the fourth step for a plurality of times, and fusing a plurality of prediction label vectors to obtain an integrated prediction result.

The evaluation is performed by using the commonly used performance index of Area under the curve of the receiver operation characteristic, the AUC represents the Area under the curve of the receiver operation characteristic, the value range is 0 to 1, the larger the value is, the better the algorithm performance is, and the value is 0.5, the random guess performance is represented.

Under the same experimental environment, the method (ESC) and logistic regression (logistic regression, LR) of the invention and naive Bayes

Bayes, NB) and Random Forest (RF) 3 common supervised classifier methods, and simultaneously with K-means (K-means) and spectral clustering (spectral clustering, SC) 2 unsupervised clustering methods, and comparing the experimental results with respect to the AUC index on 5 software projects.

The experimental results of all methods on 5 items are given in fig. 2, which shows AUC values for 100 runs, with minimum, first quantile, median, third quantile and maximum. The vertical bars represent the range of the first to third quantiles, the vertical bars represent the minimum and maximum values, the dots represent the median value, and the plus signs represent outliers. As can be seen from the figure, the present method is able to achieve similar AUC results compared to the supervised LR, NB and RF methods over a single item and all items; compared with unsupervised K-means and SC methods, the method of the present invention can achieve superior AUC results. The result shows that the method can be applied to unsupervised software defect prediction, and the feasibility and effectiveness of the method are verified.

The above embodiments are merely preferred embodiments of the present invention, the protection scope of the present invention is not limited thereto, and any simple changes or equivalent substitutions of technical solutions that can be obviously obtained by those skilled in the art within the technical scope of the present invention disclosed in the present invention belong to the protection scope of the present invention.

Claims

1. The software defect prediction method based on cluster integration is characterized by comprising the following steps of:

randomly sampling N software entities X from given project data, randomly extracting m metric elements from the randomly sampled N software entities X to form a data set X _* Wherein N is the total number of software entities of the project data;

removing the subsampled software entity from the predictive label vector p;

resampled data set X _* And generating a predictive label vector p multiple times for the extracted software entity x ⁱ Calculate a prediction generated by predicting itTag vector p (x) ⁱ ) Average value P (x) ⁱ ) P (x) ⁱ ) As an integrated prediction result thereof;

2. The method for predicting software defects based on clustering as recited in claim 1, wherein,

the project data are data of a software project to be tested, and the software entity refers to an example module extracted from program codes or a development process and is a method, a class, a file, a package or a code change.

3. The cluster integration-based software defect prediction method according to claim 2, further comprising: carrying out normalization processing on given project data by adopting a z-score method;

the normalization processing algorithm is as follows:

wherein x is _i Is the original value of the ith metric element of the software entity x, x _i ^* Is x _i Normalized value, mu _x Is the mean value, sigma, of the software entity x _x Is the standard deviation of software entity x.

4. The software defect prediction method based on clustering integration according to claim 1, wherein the clustering algorithm is spectral clustering, and the algorithm is:

constructing an adjacency matrix W, wherein the algorithm is as follows:

the Laplace matrix L is calculated, and the algorithm is as follows:

L＝D-W

wherein D is a degree matrix, which is a diagonal matrix with diagonal elements

Normalized Laplace matrix L to obtain L _sym The algorithm is as follows:

for L _sym And (3) carrying out eigenvalue decomposition, selecting an eigenvector v corresponding to a small eigenvalue, and normalizing the eigenvector v to obtain an unsupervised software defect prediction model.

5. The method for predicting software defects based on clustering as set forth in claim 4, wherein said marking defective and non-defective prediction results comprises:

6. The cluster-integration-based software defect prediction method according to claim 1, wherein the integrated prediction result P (x ⁱ ) The algorithm is as follows:

wherein x is ⁱ For the extracted software entity to be tested,

representing the actual extraction to software entity x ⁱ Number of times p _j (x ⁱ ) For extracted software entity x ⁱ The jth predicts the resulting tag vector.

7. The method for predicting software defects based on clustering as set forth in claim 1, wherein said replaced random sampling N software entities x comprises: from the given item data x= { X with each put back ¹ ,x ² ,…,x ^N }∈R ^M×N And extracting 1 entity for N times, wherein M represents the number of software metric elements of the project data X, and N represents the number of software entities of the data X.