CN115994310A - Software defect prediction method based on clustering integration - Google Patents

Software defect prediction method based on clustering integration Download PDF

Info

Publication number
CN115994310A
CN115994310A CN202211224284.3A CN202211224284A CN115994310A CN 115994310 A CN115994310 A CN 115994310A CN 202211224284 A CN202211224284 A CN 202211224284A CN 115994310 A CN115994310 A CN 115994310A
Authority
CN
China
Prior art keywords
software
software entity
entity
defects
defect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211224284.3A
Other languages
Chinese (zh)
Inventor
李志强
谢娟英
祁超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN202211224284.3A priority Critical patent/CN115994310A/en
Publication of CN115994310A publication Critical patent/CN115994310A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a software defect prediction method based on clustering integration, which belongs to the technical field of software defect prediction and comprises the following steps: randomly sampling N software entities X from given project data, randomly extracting m metric elements from the randomly sampled N software entities X to form a data set X * Wherein N is the total number of software entities of the project data; based on data set X * Adopting a clustering algorithm to construct an unsupervised software defect prediction model; marking the predicted result with defects and no defects to obtain a predicted tag vector p of the extracted software entity; resampled data set X * And generating a predictive label vector p multiple times for the extracted software entity x i Calculating an integrated prediction result P of the method; if P is greater than 0.5, the extracted software entity x is described i There is a shortage ofA trap, describing the extracted software entity x if P is less than or equal to 0.5 i Without defects. The method can be used for predicting software defects.

Description

Software defect prediction method based on clustering integration
Technical Field
The invention belongs to the technical field of software defect prediction, and particularly relates to a clustering integration-based software defect prediction method.
Background
Software defect prediction is one of key technologies in software quality assurance, and the motivation is to discover defects in software through some economic and efficient technologies and ensure the quality and reliability of software products. In the software development process, software defects are inevitably generated, and are discovered and repaired as early as possible, so that the method has an extremely important effect on the later development and maintenance of the software. Therefore, the software defect prediction technology has wide application prospect in identifying the defects of the software products.
The unsupervised software defect prediction does not need to mark the data in the software project to be tested in advance, but automatically identifies the software defect by learning and mining the potential structure of the data. When the target project is a newly developed project, no or very rare historical defect data exists, unsupervised defect prediction can be used. The method has important research significance for solving the software defect prediction problem without historical defect data or with sparse data.
In recent years, researchers have proposed some cluster-based unsupervised software defect prediction methods. However, most of these methods rely on single clustered prediction results, which tend to have lower accuracy and poor robustness, and it is difficult to implement accurate prediction of software defects.
In a word, the existing technology has the problems that the cluster-based unsupervised software defect prediction method relies on the prediction result of a single cluster, has lower accuracy and poor robustness, and is difficult to realize accurate prediction of the software defect.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a software defect prediction method based on clustering integration.
In order to achieve the above object, the present invention provides the following technical solutions:
a software defect prediction method based on cluster integration comprises the following steps:
randomly sampling N software entities X from given project data, randomly extracting m metric elements from the randomly sampled N software entities X to form a data set X * Wherein N is the total number of software entities of project data;
Based on data set X * Adopting a clustering algorithm to construct an unsupervised software defect prediction model;
performing defect prediction on the software entity according to the unsupervised software defect prediction model, and marking the predicted result with defects and without defects to obtain a predicted tag vector p of the software entity containing defects and without defects;
removing the subsampled software entity from the predictive label vector p;
resampled data set X * And generating a predictive label vector p multiple times for the extracted software entity x i Calculate a predictive label vector p (x i ) Average value P (x) i ) P (x) i ) As an integrated prediction result thereof;
if P (x) i ) Greater than 0.5 indicates the extracted software entity x i If there is a defect, if P (x i ) 0.5 or less, the extracted software entity x is described i Without defects.
Further, the project data is data of a software project to be tested, and the software entity refers to an example module extracted from program codes or a development process and is a method, a class, a file, a package or a code change.
Further, the method further comprises the following steps: carrying out normalization processing on given project data by adopting a z-score method; the normalization processing algorithm is as follows:
Figure BDA0003878337980000021
wherein x is i Is the original value of the ith metric element of the software entity x,
Figure BDA0003878337980000022
is x i Normalized value, mu x Is the mean value, sigma, of the software entity x x Is the standard deviation of software entity x.
Further, the clustering algorithm is spectral clustering, and the algorithm is as follows:
constructing an adjacency matrix W, wherein the algorithm is as follows:
Figure BDA0003878337980000031
the Laplace matrix L is calculated, and the algorithm is as follows:
L=D-W
wherein D is a degree matrix, which is a diagonal matrix with diagonal elements
Figure BDA0003878337980000032
Normalized Laplace matrix L to obtain L sym The algorithm is as follows:
Figure BDA0003878337980000033
for L sym And (3) carrying out eigenvalue decomposition, selecting an eigenvector v corresponding to the next small eigenvalue, and normalizing the eigenvector v to obtain the unsupervised software defect prediction model.
Further, the marking of the predicted outcome with defects and no defects includes:
dividing v into two clusters, marking the software entity corresponding to v >0 as a defect class, marking the software entity corresponding to v < 0 as a defect-free class, and marking the software entity corresponding to v >0 as a defect class if the total metric element value of the software entity corresponding to v >0 is smaller than the total metric element value of the software entity corresponding to v < 0.
Further, the integrated prediction result P (x i ) The algorithm is as follows:
Figure BDA0003878337980000034
wherein x is i For the extracted software entity to be tested,
Figure BDA0003878337980000035
representing the actual extraction to software entity x i Pj is the number of times the extracted software entity x i The jth predicts the resulting tag vector.
Further, the replaced random sampling N software entities x include: from the given item data x= { X with each put back 1 ,x 2 ,…,x N }∈R M×N And extracting 1 entity for N times, wherein M represents the number of software metric elements of the project data X, and N represents the number of software entities of the data X.
The software defect prediction method based on clustering integration has the following beneficial effects:
the invention designs an unsupervised software defect prediction method based on cluster integration by adopting an integrated learning and unsupervised clustering technology, and the method does not need to mark data in advance so as to automatically identify software defects, and can be used in software projects without historical defect data or lack of the historical defect data; by integrating a plurality of base clustering results, the method has higher defect identification effect and better robustness, and is beneficial to better ensuring the quality and reliability of software products. The method solves the problems that the method for predicting the defects of the unsupervised software based on the clusters in the prior art depends on the prediction result of a single cluster, has lower accuracy and poor robustness, and is difficult to realize the accurate prediction of the defects of the software.
Drawings
In order to more clearly illustrate the embodiments of the present invention and the design thereof, the drawings required for the embodiments will be briefly described below. The drawings in the following description are only some of the embodiments of the present invention and other drawings may be made by those skilled in the art without the exercise of inventive faculty.
FIG. 1 is a schematic structural diagram of a software defect prediction method based on clustering integration according to an embodiment of the present invention;
FIG. 2 is a graph comparing the experimental results of the present invention with the other 5 methods of the present invention on 5 items.
Detailed Description
The present invention will be described in detail below with reference to the drawings and the embodiments, so that those skilled in the art can better understand the technical scheme of the present invention and can implement the same. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Examples:
the invention provides a software defect prediction method based on clustering integration, which is shown in figure 1 and comprises the following steps:
s1, data preprocessing, including: carrying out normalization processing on given project data, and eliminating the influence of dimensions among different measurement elements of a software entity; project data is data of a software project to be tested, and a software entity refers to an example module extracted from program codes or a development process and can be a method, a class, a file, a package or a code change. Specifically, the adopted normalization processing method is a z-score method, and the algorithm is as follows:
Figure BDA0003878337980000051
wherein x is i Is the original value of the ith metric element of the software entity x,
Figure BDA0003878337980000052
is x i Normalized value, mu x Is the mean value, sigma, of the software entity x x Is the standard deviation of software entity x.
S2, randomly sampling N software entities X from given project data, randomly extracting m metric elements from the N software entities X to form a data set X *
Specifically, randomly sampling i software entities x includes: from the given item data x= { X with each put back 1 ,x 2 ,…,x N }∈R M×N And extracting 1 entity for N times, wherein M represents the number of software metric elements of the project data X, and N represents the number of software entities of the data X.
Specifically, randomly sampling m software metrics includes: extracting M (m.ltoreq.M) metric elements from the extracted software entity to generate a new data set
Figure BDA0003878337980000053
S3 based on data X * A clustering algorithm is adopted to obtain a clustering result, wherein the clustering algorithm is spectral clustering, and the algorithm is as follows:
constructing an adjacency matrix W, wherein the algorithm is as follows:
Figure BDA0003878337980000054
the Laplace matrix L is calculated, and the algorithm is as follows:
L=D-W,
wherein D is a degree matrix, which is a diagonal matrix with diagonal elements
Figure BDA0003878337980000055
/>
Normalized Laplace matrix L to obtain L sym The algorithm is as follows:
Figure BDA0003878337980000056
for L sym Performing eigenvalue decomposition, selecting an eigenvector v corresponding to a small eigenvalue, and normalizing the eigenvector v to obtain a clustering result;
s4, labeling a clustering result to obtain a predictive label vector p, wherein the labeling process is as follows:
prior studies have shown that the metric values of defective software entities are generally higher than those of non-defective software entities, based on which the data X to be measured * The predictive label vector p of (2) is calculated as follows:
p=v>0
the above formula divides v into two clusters, marks the software entity corresponding to v >0 as a defect class, and marks the software entity corresponding to v.ltoreq.0 as a defect-free class. If the total metric value of the software entity corresponding to v >0 is smaller than the total metric value of the software entity corresponding to v.ltoreq.0, the software entity corresponding to v >0 is marked as a defect-free class, and the rest is marked as a defect class.
The predictive label vector p will contain the resampled software entity, de-duplicated for final integrated prediction. And the non-sampled software entity is given a special mark, which is not used for final integrated prediction.
S5, repeating the steps S2-S4 for T times to obtain T predictive label vectors, and fusing the T predictive label vectors to obtain an integrated predictive result P, wherein the algorithm is as follows:
for the ith software entity x to be tested i
Figure BDA0003878337980000061
Wherein x is i For the extracted software entity to be tested,
Figure BDA0003878337980000062
representing the actual extraction to software entity x i Pj is the number of times the extracted software entity x i The jth predicts the resulting tag vector.
In order to fully prove the feasibility of the software defect prediction method based on clustering integration, a verification experiment is carried out;
experimental environment configuration: windows10 system, matlab2020b
Experimental data: the experiment uses 5 public data sets as experimental data, and detailed information is shown in table 1:
table 1 data set for experiments
Figure BDA0003878337980000063
The specific operation process comprises the following steps:
input: project data to be measured x= { X 1 ,x 2 ,…,x N }∈R M×N
And (3) outputting: whether each software entity in the project has a defect or not is indicated by 1, and 0 indicates that the software entity has no defect.
Step one, performing z-score normalization processing on project data X to be detected;
secondly, aiming at the supervised classifier method, carrying out experiments by adopting a mode of 50 times of 2-fold cross validation, and generating 100 times of predicted result values in total; aiming at an unsupervised clustering method, a clustering experiment is directly carried out on each fold of data, and 100 predicted result values are generated in total;
thirdly, randomly sampling N software entities of the project data X to be tested, and randomly sampling m=log from the N software entities 2 (M) metric elements, generating a new data set X *
Fourth, adopting a spectral clustering algorithm to perform X * Clustering is carried out, a prediction model is constructed, and the results are marked with defects and no defects to obtain a prediction label vector;
and fifthly, repeating the third step and the fourth step for a plurality of times, and fusing a plurality of prediction label vectors to obtain an integrated prediction result.
The evaluation is performed by using the commonly used performance index of Area under the curve of the receiver operation characteristic, the AUC represents the Area under the curve of the receiver operation characteristic, the value range is 0 to 1, the larger the value is, the better the algorithm performance is, and the value is 0.5, the random guess performance is represented.
Under the same experimental environment, the method (ESC) and logistic regression (logistic regression, LR) of the invention and naive Bayes
Figure BDA0003878337980000071
Bayes, NB) and Random Forest (RF) 3 common supervised classifier methods, and simultaneously with K-means (K-means) and spectral clustering (spectral clustering, SC) 2 unsupervised clustering methods, and comparing the experimental results with respect to the AUC index on 5 software projects.
The experimental results of all methods on 5 items are given in fig. 2, which shows AUC values for 100 runs, with minimum, first quantile, median, third quantile and maximum. The vertical bars represent the range of the first to third quantiles, the vertical bars represent the minimum and maximum values, the dots represent the median value, and the plus signs represent outliers. As can be seen from the figure, the present method is able to achieve similar AUC results compared to the supervised LR, NB and RF methods over a single item and all items; compared with unsupervised K-means and SC methods, the method of the present invention can achieve superior AUC results. The result shows that the method can be applied to unsupervised software defect prediction, and the feasibility and effectiveness of the method are verified.
The above embodiments are merely preferred embodiments of the present invention, the protection scope of the present invention is not limited thereto, and any simple changes or equivalent substitutions of technical solutions that can be obviously obtained by those skilled in the art within the technical scope of the present invention disclosed in the present invention belong to the protection scope of the present invention.

Claims (7)

1. The software defect prediction method based on cluster integration is characterized by comprising the following steps of:
randomly sampling N software entities X from given project data, randomly extracting m metric elements from the randomly sampled N software entities X to form a data set X * Wherein N is the total number of software entities of the project data;
based on data set X * Adopting a clustering algorithm to construct an unsupervised software defect prediction model;
performing defect prediction on the software entity according to the unsupervised software defect prediction model, and marking the predicted result with defects and without defects to obtain a predicted tag vector p of the software entity containing defects and without defects;
removing the subsampled software entity from the predictive label vector p;
resampled data set X * And generating a predictive label vector p multiple times for the extracted software entity x i Calculate a prediction generated by predicting itTag vector p (x) i ) Average value P (x) i ) P (x) i ) As an integrated prediction result thereof;
if P (x) i ) Greater than 0.5 indicates the extracted software entity x i If there is a defect, if P (x i ) 0.5 or less, the extracted software entity x is described i Without defects.
2. The method for predicting software defects based on clustering as recited in claim 1, wherein,
the project data are data of a software project to be tested, and the software entity refers to an example module extracted from program codes or a development process and is a method, a class, a file, a package or a code change.
3. The cluster integration-based software defect prediction method according to claim 2, further comprising: carrying out normalization processing on given project data by adopting a z-score method;
the normalization processing algorithm is as follows:
Figure QLYQS_1
wherein x is i Is the original value of the ith metric element of the software entity x, x i * Is x i Normalized value, mu x Is the mean value, sigma, of the software entity x x Is the standard deviation of software entity x.
4. The software defect prediction method based on clustering integration according to claim 1, wherein the clustering algorithm is spectral clustering, and the algorithm is:
constructing an adjacency matrix W, wherein the algorithm is as follows:
Figure QLYQS_2
the Laplace matrix L is calculated, and the algorithm is as follows:
L=D-W
wherein D is a degree matrix, which is a diagonal matrix with diagonal elements
Figure QLYQS_3
Normalized Laplace matrix L to obtain L sym The algorithm is as follows:
Figure QLYQS_4
for L sym And (3) carrying out eigenvalue decomposition, selecting an eigenvector v corresponding to a small eigenvalue, and normalizing the eigenvector v to obtain an unsupervised software defect prediction model.
5. The method for predicting software defects based on clustering as set forth in claim 4, wherein said marking defective and non-defective prediction results comprises:
dividing v into two clusters, marking the software entity corresponding to v >0 as a defect class, marking the software entity corresponding to v < 0 as a defect-free class, and marking the software entity corresponding to v >0 as a defect class if the total metric element value of the software entity corresponding to v >0 is smaller than the total metric element value of the software entity corresponding to v < 0.
6. The cluster-integration-based software defect prediction method according to claim 1, wherein the integrated prediction result P (x i ) The algorithm is as follows:
Figure QLYQS_5
wherein x is i For the extracted software entity to be tested,
Figure QLYQS_6
representing the actual extraction to software entity x i Number of times p j (x i ) For extracted software entity x i The jth predicts the resulting tag vector.
7. The method for predicting software defects based on clustering as set forth in claim 1, wherein said replaced random sampling N software entities x comprises: from the given item data x= { X with each put back 1 ,x 2 ,…,x N }∈R M×N And extracting 1 entity for N times, wherein M represents the number of software metric elements of the project data X, and N represents the number of software entities of the data X.
CN202211224284.3A 2022-10-08 2022-10-08 Software defect prediction method based on clustering integration Pending CN115994310A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211224284.3A CN115994310A (en) 2022-10-08 2022-10-08 Software defect prediction method based on clustering integration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211224284.3A CN115994310A (en) 2022-10-08 2022-10-08 Software defect prediction method based on clustering integration

Publications (1)

Publication Number Publication Date
CN115994310A true CN115994310A (en) 2023-04-21

Family

ID=85989416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211224284.3A Pending CN115994310A (en) 2022-10-08 2022-10-08 Software defect prediction method based on clustering integration

Country Status (1)

Country Link
CN (1) CN115994310A (en)

Similar Documents

Publication Publication Date Title
Kim et al. Detection and clustering of mixed-type defect patterns in wafer bin maps
CN110825644B (en) Cross-project software defect prediction method and system
CN111612041B (en) Abnormal user identification method and device, storage medium and electronic equipment
US20130101221A1 (en) Anomaly detection in images and videos
CN109993225B (en) Airspace complexity classification method and device based on unsupervised learning
CN103294716A (en) On-line semi-supervised learning method and device for classifier, and processing equipment
CN103390154A (en) Face recognition method based on extraction of multiple evolution features
CN111340086B (en) Processing method, system, medium and terminal of label-free electronic transaction data
CN111667135B (en) Load structure analysis method based on typical feature extraction
CN112926045B (en) Group control equipment identification method based on logistic regression model
CN110954734B (en) Fault diagnosis method, device, equipment and storage medium
CN112883990A (en) Data classification method and device, computer storage medium and electronic equipment
WO2023273249A1 (en) Tsvm-model-based abnormality detection method for automatic verification system of smart electricity meter
CN117155771B (en) Equipment cluster fault tracing method and device based on industrial Internet of things
CN112100617B (en) Abnormal SQL detection method and device
US20230385699A1 (en) Data boundary deriving system and method
CN115689407A (en) Account abnormity detection method and device and terminal equipment
CN116611003A (en) Transformer fault diagnosis method, device and medium
US20200310897A1 (en) Automatic optimization fault feature generation method
CN115994310A (en) Software defect prediction method based on clustering integration
CN116188445A (en) Product surface defect detection and positioning method and device and terminal equipment
CN112148605B (en) Software defect prediction method based on spectral clustering and semi-supervised learning
CN111290369A (en) Fault diagnosis method based on semi-supervised recursive feature retention
Ingle et al. Software Quality Analysis with Clustering Method
CN113141357B (en) Feature selection method and system for optimizing network intrusion detection performance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination