CN112148605B

CN112148605B - Software defect prediction method based on spectral clustering and semi-supervised learning

Info

Publication number: CN112148605B
Application number: CN202010999235.1A
Authority: CN
Inventors: 陆璐; 周璇
Original assignee: Meizhou Institute Of Technology South China University Of Technology; South China University of Technology SCUT
Current assignee: Meizhou Institute Of Technology South China University Of Technology; South China University of Technology SCUT
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2022-05-20
Anticipated expiration: 2040-09-22
Also published as: CN112148605A

Abstract

The invention discloses a software defect prediction method based on spectral clustering semi-supervised learning, which comprises the following steps: 1) acquiring original data, and performing data preprocessing operation to obtain a processed feature matrix; 2) judging whether the characteristic matrix has a label or not: clustering the label-free data through spectral clustering; performing label operation on the obtained clusters through heuristic rules of software defect prediction to obtain pseudo labels, and then turning to the step 3); for the data with the label, directly going to the step 3); 3) calculating a characteristic deviation fraction according to data distribution and performing characteristic selection, wherein the weight occupied by the original label data is greater than that occupied by the pseudo label data; 4) and performing clustering and labeling operation again according to the new characteristic matrix to obtain a prediction result. The method reduces the influence of irrelevant and redundant characteristics on the model result, utilizes the information of the original label data of the project, can effectively improve the accuracy of the software defect prediction result, and increases the applicability of the model.

Description

Software defect prediction method based on spectral clustering and semi-supervised learning

Technical Field

The invention relates to the field of software defect prediction, in particular to a software defect prediction method based on spectral clustering and semi-supervised learning.

Background

Software defect prediction is a process of predicting whether a software entity is defective. With the continuous expansion of the scale of the current software, the software defect prediction is a technology which can help to reduce the burden of software testers and optimize the configuration of developers and testers, and the technology is receiving more and more attention. It has been found that the cost of finding and repairing defects after development is much higher than the cost of finding and repairing defects at development time. Therefore, it is important to introduce software bug prediction early in the software lifecycle. However, currently, the application of software defect prediction is still less in the industry. This is mainly because most of the research in the field of software defect prediction is supervised learning, but in practice, the label data is often small, and collecting the data in this part is a time-consuming, labor-consuming and error-prone task.

To solve this problem, cross-project software bug prediction is proposed as a solution. Cross-project software bug prediction is trained using projects with sufficient historical data, and the resulting model is used to predict new projects. The main problem encountered with this approach is heterogeneity between projects: on the one hand, different item collections may be of different attributes; on the other hand, even if the source item and the target item having the same attribute are selected, the attribute distributions of the source item and the target item are still different, and the classifier trained using the source item is not necessarily suitable for the target item. Meanwhile, unsupervised learning is also used to solve the problem of label data deficiency, and the unsupervised learning mainly consists of the following two steps: 1) clustering the data; 2) and labeling different clusters to judge whether the corresponding clusters have defects. However, data owned by an actual project is often a set of a small part of labeled data and a large part of unlabeled data, and both schemes assume that a target project does not have any historical labeled data, so that a part of available known information is lost, and the performance of a model is reduced.

Furthermore, the main focus of the existing research is to construct a usable model without concern for feature selection. However, the large number of extraneous or redundant features used not only wastes computational resources, but also degrades the performance of the predictive model.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a software defect prediction method based on semi-supervised learning of spectral clustering, which utilizes spectral clustering with additional feature selection, not only reduces the influence of irrelevant and redundant features on model results, but also adapts to the condition that only a small amount of label data and a large amount of label-free data are always generated in actual production, utilizes the information of original label data of a project, and can effectively improve the accuracy of software defect prediction results and increase the applicability of the model.

The purpose of the invention is realized by the following technical scheme:

a software defect prediction method based on spectral clustering and semi-supervised learning comprises the following steps:

1) acquiring original data from a database, and performing data preprocessing operation to obtain a processed feature matrix;

2) judging whether the characteristic matrix has a label or not:

clustering the unlabeled data through spectral clustering; performing label operation on the obtained clusters through heuristic rules of software defect prediction to obtain pseudo labels, and then turning to the step 3);

for the data with the label, directly going to step 3);

3) calculating a characteristic deviation fraction according to the data distribution and performing characteristic selection, wherein the weight occupied by the original label data is greater than the weight occupied by the pseudo label data;

4) and clustering and labeling again according to the new characteristic matrix to obtain a prediction result.

In step 1), the database is a warehouse for storing sample data of a plurality of software entities; the software entity refers to the minimum unit of data representation forms in different databases, and comprises classes, functions and files; the sample data comprises two parts of non-label data and labeled data, the non-label data only comprises attribute data of a software entity, and the labeled data also comprises label data indicating whether the software entity has defects or not; the attributes refer to the characteristics of the database about the software entity, including the number of code lines, the number of methods in a class, and the size of bytecode in a class.

In the step 1), the data preprocessing operation comprises data standardization processing and missing data processing; the feature matrix is a matrix formed by taking a software entity as a row and taking an attribute as a column.

Step 2), clustering the label-free data through spectral clustering, which comprises the following specific steps:

computing an adjacency matrix W representing weights of edges between the software entities;

calculating a degree matrix D of the characteristic matrix;

calculating a Laplace matrix L, wherein the calculation formula is L-D-W;

performing characteristic decomposition on the Laplace matrix L to obtain a characteristic vector;

and according to the normalized tangent image algorithm, using whether the second small feature vector is larger than 0 as a clustering basis to obtain two clusters.

In step 2), the heuristic rule is specifically: the more complex the probability of a software entity being defective is higher; the pseudo label is a label obtained by clustering and label operation of all attributes extracted from the database, and specifically comprises the following steps:

calculating the row average attribute values of the two clusters respectively, and dividing the sum of all the attribute values of one cluster by the number of the corresponding clusters to obtain the average attribute values;

if the attribute is positively correlated with the software complexity, the cluster label with the larger row average attribute value is a defective cluster; if the attribute and software complexity are negatively correlated, then clusters with smaller row-average attribute values are labeled as defective clusters.

In step 3), the feature deviation score is the proportion of the number of software entities violating the heuristic rule of the software defect prediction in one feature to the total number of the software entities; the heuristic rule violating the software defect prediction specifically includes the following two cases:

a) the complexity of the feature exceeds a threshold, but the corresponding software entity is determined to be a non-defective software entity;

b) the complexity of the feature does not exceed the threshold, but the corresponding software entity is determined to be a defective software entity.

The step 3) is as follows:

calculating the defect-free proportion of all software entities according to the distribution of the label data, and setting the defect-free proportion as a threshold percentile;

obtaining a threshold percentile corresponding to each feature according to the threshold percentile; if the value corresponding to one feature is greater than the threshold percentile but the corresponding label is not defective, or the value corresponding to one feature is less than the threshold percentile but the corresponding label is defective, adding one to the number of feature deviations; dividing the sum of the feature deviation numbers of one feature in all the software entities by the total number of the software entities to obtain a feature deviation score of the corresponding feature;

if the deviation score of the feature is larger than the defect-free proportion, the correlation between the feature and the label is not high, and the corresponding feature is discarded; if the deviation score of the feature is not larger than the defect-free proportion, the feature is more relevant to the label, and the corresponding feature is reserved.

In step 4), the new feature matrix is the matrix obtained after the feature selection in step 3).

Compared with the prior art, the invention has the following advantages and beneficial effects:

in conventional software testing, the collection of the large amount of historical data required is a time consuming, labor intensive, and error prone task. To solve this problem, current research can be divided into cross-project software defect prediction and unsupervised learning. However, data in an actual project is often a set of a small part of labeled data and a large part of unlabeled data, and both schemes completely assume that the current project does not have any historical data, so that a part of available known information is lost, and the performance of the model is reduced. Furthermore, the main focus of the currently available research is to construct a usable model without concern for feature selection. However, the large amount of extraneous or redundant data used not only wastes computational resources, but also generally degrades the performance of the predictive model. Compared with the prior art, the invention provides the software defect prediction method based on the spectral clustering semi-supervised learning, which is added with feature selection, and is focused on effective features to reduce the dimensionality of a data set; meanwhile, label data existing in the project are utilized to improve the model effect.

The central idea of spectral clustering adopted by the invention is based on graph theory. The data is first viewed as discrete points in space, the discrete points being connected by edges, the weights of the edges representing the distances between the data points, the points and edges forming an undirected graph. The clustering process is a process of dividing a graph into two disjoint parts, and the guiding idea of the division is that the edge weight between the same cluster should be as low as possible, and the edge weight between different clusters should be as high as possible.

Drawings

FIG. 1 is a schematic diagram of an overall scheme of a software defect prediction method based on spectral clustering semi-supervised learning.

FIG. 2 is a flow chart of a software defect prediction method based on semi-supervised learning of spectral clustering according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Referring to fig. 1 and 2, a software defect prediction method based on semi-supervised learning of spectral clustering includes the following steps:

1) acquiring original data from a database, and performing data preprocessing operation to obtain a processed feature matrix, wherein the method specifically comprises the following steps:

1.1) taking into account that different attributes are different in range due to the difference of the features, the influence caused by small data on absolute numerical values is prevented from being covered by large data, and z-score standardized features are adopted to ensure that each feature is treated equally by a classifier.

1.2) after the z-score standardization treatment, the data accord with the standard normal distribution, and the missing data in the database is replaced by the average value of the existing data with the corresponding characteristics.

2) Clustering is carried out on the part of data without the label through spectral clustering according to the characteristic matrix, and the method specifically comprises the following steps:

2.1) computing the adjacency matrix W of the weights of the edges between the software entities: the weight between two identical software entities is zero; the weights of two different software entities are obtained by exponentiating the negative of the sum of the squares of the differences of all the attributes of the two software entities;

2.2) calculating a degree matrix D of the characteristic matrix. For any point v in the graph, his degree d is defined as the sum of the weights of all the edges connected to him. Thus, a degree matrix which is a diagonal matrix can be obtained, only the main diagonal has values, and the degree of the ith point corresponds to the value of the ith row diagonal.

2.3) calculating to obtain a Laplace matrix L, wherein the calculation method is L-D-W.

2.4) carrying out characteristic decomposition on the Laplace matrix L to obtain a characteristic vector.

2.5) according to the normalized cut-map algorithm, whether the second small feature vector is larger than 0 is used as a clustering basis, and two clusters are obtained.

3) For the part of data without labels, performing label operation on the clusters obtained in the step 2) through a heuristic rule of software defect prediction to obtain pseudo labels, wherein the method specifically comprises the following steps:

3.1) calculating the row average attribute values of the two clusters respectively, and dividing the sum of all the attribute values of one cluster by the number of the corresponding clusters to obtain the average attribute value.

3.2) if the attribute is positively correlated with the software complexity, the cluster label with the larger row average attribute value is a defective cluster; if the attribute and piece complexity are negatively correlated, then clusters with smaller row-average attribute values are labeled as defective clusters.

4) Calculating a characteristic deviation score according to the data distribution and selecting the characteristics, wherein the specific steps are as follows:

4.1) according to the distribution of the label data, setting the weight of the labeled data to be 2 times of that of the pseudo label, calculating the defect-free proportion of all software entities, and setting the defect-free proportion as a threshold percentile.

And 4.2) obtaining the threshold percentile corresponding to each characteristic according to the threshold percentile. And if the value corresponding to one feature is greater than the threshold percentile but the corresponding label is not defective, or the value corresponding to one feature is less than the threshold percentile but the corresponding label is defective, adding one to the feature deviation number, and dividing the sum of the feature deviation numbers of one feature in all the software entities by the total number of the software entities to obtain the feature deviation score of the corresponding feature.

4.3) if the deviation score of the feature is larger than the defect-free proportion, the correlation between the feature and the label is not high, and the corresponding feature is discarded; if the deviation score of the feature is not larger than the defect-free proportion, the feature is more relevant to the label, and the corresponding feature is reserved.

5) And (5) performing spectral clustering in the step 2) and label operation in the step 3) again according to the feature matrix obtained after the feature selection in the step 4) to obtain a final prediction result.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A software defect prediction method of semi-supervised learning based on spectral clustering is characterized by comprising the following steps:

1) acquiring original sample data from a database, and performing data preprocessing operation to obtain a processed feature matrix;

2) clustering the unlabeled data through spectral clustering;

3) performing label operation on the clusters obtained in the step 2) through heuristic rules of software defect prediction to obtain pseudo labels; the heuristic rule is specifically as follows: the more complex the probability of a software entity being defective; the pseudo label is a label obtained by clustering label-free data extracted from a database in the step 2) and performing label operation in the step 3); the step 3) is specifically as follows:

a) respectively calculating the row average attribute values of the two clusters, and dividing the sum of all the attribute values of one cluster by the row number of the corresponding cluster to obtain the average attribute value;

b) if the attribute is positively correlated with the software complexity, the cluster label with the larger row average attribute value is a defective cluster; if the attribute is negatively correlated with the software complexity, labeling the cluster with the smaller row average attribute value as a defective cluster;

4) calculating a characteristic deviation fraction according to the data distribution and performing characteristic selection, wherein the weight occupied by the original label data is greater than the weight occupied by the pseudo label data; the method comprises the following specific steps:

a) calculating the defect-free proportion of all software entities according to the distribution of the label data, and setting the defect-free proportion as a threshold percentile;

b) according to the threshold percentile, the threshold percentile corresponding to each feature can be obtained, and the feature deviation score is calculated according to the threshold percentile;

c) if the deviation score of the feature is larger than the defect-free proportion, the correlation between the feature and the label is not high, and the corresponding feature is discarded; if the deviation score of the characteristic is not larger than the defect-free proportion, the characteristic is proved to have larger correlation with the label, and the corresponding characteristic is reserved;

5) and clustering and labeling again according to the new characteristic matrix to obtain a prediction result.

2. The method for predicting software defects based on semi-supervised learning of spectral clustering according to claim 1, wherein in step 1), the database is a warehouse storing sample data of a plurality of software entities; the software entity refers to the smallest unit of data representation in different databases, including but not limited to: class, function, file; the sample data comprises two parts of non-label data and labeled data, the non-label data only comprises attribute data of the software entity, and the labeled data also comprises label data indicating whether the software entity has defects or not; the attributes refer to the characteristics of the database storage about the software entity, including but not limited to: number of code lines, number of methods in a class, bytecode size of a class.

3. The software defect prediction method based on spectral clustering semi-supervised learning as recited in claim 1, wherein in step 1), the data preprocessing operation includes, but is not limited to, data normalization processing, missing data processing; the feature matrix is a matrix formed by taking a software entity as a row and taking an attribute as a column.

4. The software defect prediction method based on spectral clustering semi-supervised learning of claim 1, wherein the step 2) specifically comprises the following steps:

a) computing an adjacency matrix W representing weights of edges between the software entities;

b) calculating a degree matrix D of the characteristic matrix;

c) calculating a Laplace matrix in a mode of L-D-W;

d) performing characteristic decomposition on the Laplace matrix to obtain a characteristic vector;

e) and according to the normalized tangent image algorithm, using whether the second small feature vector is larger than 0 as a clustering basis to obtain two clusters.

5. The method for predicting software defects based on semi-supervised learning of spectral clustering as recited in claim 1, wherein in the step 4), the feature deviation score is a ratio of the number of software entities of a feature that violates a heuristic rule of software defect prediction to the total number of software entities; the heuristic rule violating the software defect prediction specifically includes the following two cases:

a) the complexity of the feature exceeds a threshold value, but the corresponding software entity is judged to be a defect-free software entity;

b) the complexity of the feature is less than a threshold, but its corresponding software entity is determined to be a defective software entity.

6. The software defect prediction method based on spectral clustering semi-supervised learning of claim 1, wherein the new feature matrix is a matrix obtained after feature selection in step 4); and 5), specifically, repeating the clustering of the step 2) and the labeling operation of the step 3) on the new feature matrix to obtain a final prediction result.