CN112148605B - Software defect prediction method based on spectral clustering and semi-supervised learning - Google Patents

Software defect prediction method based on spectral clustering and semi-supervised learning Download PDF

Info

Publication number
CN112148605B
CN112148605B CN202010999235.1A CN202010999235A CN112148605B CN 112148605 B CN112148605 B CN 112148605B CN 202010999235 A CN202010999235 A CN 202010999235A CN 112148605 B CN112148605 B CN 112148605B
Authority
CN
China
Prior art keywords
data
software
label
feature
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202010999235.1A
Other languages
Chinese (zh)
Other versions
CN112148605A (en
Inventor
陆璐
周璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meizhou Institute Of Technology South China University Of Technology
South China University of Technology SCUT
Original Assignee
Meizhou Institute Of Technology South China University Of Technology
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Meizhou Institute Of Technology South China University Of Technology, South China University of Technology SCUT filed Critical Meizhou Institute Of Technology South China University Of Technology
Priority to CN202010999235.1A priority Critical patent/CN112148605B/en
Publication of CN112148605A publication Critical patent/CN112148605A/en
Application granted granted Critical
Publication of CN112148605B publication Critical patent/CN112148605B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a software defect prediction method based on spectral clustering semi-supervised learning, which comprises the following steps: 1) acquiring original data, and performing data preprocessing operation to obtain a processed feature matrix; 2) judging whether the characteristic matrix has a label or not: clustering the label-free data through spectral clustering; performing label operation on the obtained clusters through heuristic rules of software defect prediction to obtain pseudo labels, and then turning to the step 3); for the data with the label, directly going to the step 3); 3) calculating a characteristic deviation fraction according to data distribution and performing characteristic selection, wherein the weight occupied by the original label data is greater than that occupied by the pseudo label data; 4) and performing clustering and labeling operation again according to the new characteristic matrix to obtain a prediction result. The method reduces the influence of irrelevant and redundant characteristics on the model result, utilizes the information of the original label data of the project, can effectively improve the accuracy of the software defect prediction result, and increases the applicability of the model.

Description

Software defect prediction method based on spectral clustering and semi-supervised learning
Technical Field
The invention relates to the field of software defect prediction, in particular to a software defect prediction method based on spectral clustering and semi-supervised learning.
Background
Software defect prediction is a process of predicting whether a software entity is defective. With the continuous expansion of the scale of the current software, the software defect prediction is a technology which can help to reduce the burden of software testers and optimize the configuration of developers and testers, and the technology is receiving more and more attention. It has been found that the cost of finding and repairing defects after development is much higher than the cost of finding and repairing defects at development time. Therefore, it is important to introduce software bug prediction early in the software lifecycle. However, currently, the application of software defect prediction is still less in the industry. This is mainly because most of the research in the field of software defect prediction is supervised learning, but in practice, the label data is often small, and collecting the data in this part is a time-consuming, labor-consuming and error-prone task.
To solve this problem, cross-project software bug prediction is proposed as a solution. Cross-project software bug prediction is trained using projects with sufficient historical data, and the resulting model is used to predict new projects. The main problem encountered with this approach is heterogeneity between projects: on the one hand, different item collections may be of different attributes; on the other hand, even if the source item and the target item having the same attribute are selected, the attribute distributions of the source item and the target item are still different, and the classifier trained using the source item is not necessarily suitable for the target item. Meanwhile, unsupervised learning is also used to solve the problem of label data deficiency, and the unsupervised learning mainly consists of the following two steps: 1) clustering the data; 2) and labeling different clusters to judge whether the corresponding clusters have defects. However, data owned by an actual project is often a set of a small part of labeled data and a large part of unlabeled data, and both schemes assume that a target project does not have any historical labeled data, so that a part of available known information is lost, and the performance of a model is reduced.
Furthermore, the main focus of the existing research is to construct a usable model without concern for feature selection. However, the large number of extraneous or redundant features used not only wastes computational resources, but also degrades the performance of the predictive model.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a software defect prediction method based on semi-supervised learning of spectral clustering, which utilizes spectral clustering with additional feature selection, not only reduces the influence of irrelevant and redundant features on model results, but also adapts to the condition that only a small amount of label data and a large amount of label-free data are always generated in actual production, utilizes the information of original label data of a project, and can effectively improve the accuracy of software defect prediction results and increase the applicability of the model.
The purpose of the invention is realized by the following technical scheme:
a software defect prediction method based on spectral clustering and semi-supervised learning comprises the following steps:
1) acquiring original data from a database, and performing data preprocessing operation to obtain a processed feature matrix;
2) judging whether the characteristic matrix has a label or not:
clustering the unlabeled data through spectral clustering; performing label operation on the obtained clusters through heuristic rules of software defect prediction to obtain pseudo labels, and then turning to the step 3);
for the data with the label, directly going to step 3);
3) calculating a characteristic deviation fraction according to the data distribution and performing characteristic selection, wherein the weight occupied by the original label data is greater than the weight occupied by the pseudo label data;
4) and clustering and labeling again according to the new characteristic matrix to obtain a prediction result.
In step 1), the database is a warehouse for storing sample data of a plurality of software entities; the software entity refers to the minimum unit of data representation forms in different databases, and comprises classes, functions and files; the sample data comprises two parts of non-label data and labeled data, the non-label data only comprises attribute data of a software entity, and the labeled data also comprises label data indicating whether the software entity has defects or not; the attributes refer to the characteristics of the database about the software entity, including the number of code lines, the number of methods in a class, and the size of bytecode in a class.
In the step 1), the data preprocessing operation comprises data standardization processing and missing data processing; the feature matrix is a matrix formed by taking a software entity as a row and taking an attribute as a column.
Step 2), clustering the label-free data through spectral clustering, which comprises the following specific steps:
computing an adjacency matrix W representing weights of edges between the software entities;
calculating a degree matrix D of the characteristic matrix;
calculating a Laplace matrix L, wherein the calculation formula is L-D-W;
performing characteristic decomposition on the Laplace matrix L to obtain a characteristic vector;
and according to the normalized tangent image algorithm, using whether the second small feature vector is larger than 0 as a clustering basis to obtain two clusters.
In step 2), the heuristic rule is specifically: the more complex the probability of a software entity being defective is higher; the pseudo label is a label obtained by clustering and label operation of all attributes extracted from the database, and specifically comprises the following steps:
calculating the row average attribute values of the two clusters respectively, and dividing the sum of all the attribute values of one cluster by the number of the corresponding clusters to obtain the average attribute values;
if the attribute is positively correlated with the software complexity, the cluster label with the larger row average attribute value is a defective cluster; if the attribute and software complexity are negatively correlated, then clusters with smaller row-average attribute values are labeled as defective clusters.
In step 3), the feature deviation score is the proportion of the number of software entities violating the heuristic rule of the software defect prediction in one feature to the total number of the software entities; the heuristic rule violating the software defect prediction specifically includes the following two cases:
a) the complexity of the feature exceeds a threshold, but the corresponding software entity is determined to be a non-defective software entity;
b) the complexity of the feature does not exceed the threshold, but the corresponding software entity is determined to be a defective software entity.
The step 3) is as follows:
calculating the defect-free proportion of all software entities according to the distribution of the label data, and setting the defect-free proportion as a threshold percentile;
obtaining a threshold percentile corresponding to each feature according to the threshold percentile; if the value corresponding to one feature is greater than the threshold percentile but the corresponding label is not defective, or the value corresponding to one feature is less than the threshold percentile but the corresponding label is defective, adding one to the number of feature deviations; dividing the sum of the feature deviation numbers of one feature in all the software entities by the total number of the software entities to obtain a feature deviation score of the corresponding feature;
if the deviation score of the feature is larger than the defect-free proportion, the correlation between the feature and the label is not high, and the corresponding feature is discarded; if the deviation score of the feature is not larger than the defect-free proportion, the feature is more relevant to the label, and the corresponding feature is reserved.
In step 4), the new feature matrix is the matrix obtained after the feature selection in step 3).
Compared with the prior art, the invention has the following advantages and beneficial effects:
in conventional software testing, the collection of the large amount of historical data required is a time consuming, labor intensive, and error prone task. To solve this problem, current research can be divided into cross-project software defect prediction and unsupervised learning. However, data in an actual project is often a set of a small part of labeled data and a large part of unlabeled data, and both schemes completely assume that the current project does not have any historical data, so that a part of available known information is lost, and the performance of the model is reduced. Furthermore, the main focus of the currently available research is to construct a usable model without concern for feature selection. However, the large amount of extraneous or redundant data used not only wastes computational resources, but also generally degrades the performance of the predictive model. Compared with the prior art, the invention provides the software defect prediction method based on the spectral clustering semi-supervised learning, which is added with feature selection, and is focused on effective features to reduce the dimensionality of a data set; meanwhile, label data existing in the project are utilized to improve the model effect.
The central idea of spectral clustering adopted by the invention is based on graph theory. The data is first viewed as discrete points in space, the discrete points being connected by edges, the weights of the edges representing the distances between the data points, the points and edges forming an undirected graph. The clustering process is a process of dividing a graph into two disjoint parts, and the guiding idea of the division is that the edge weight between the same cluster should be as low as possible, and the edge weight between different clusters should be as high as possible.
Drawings
FIG. 1 is a schematic diagram of an overall scheme of a software defect prediction method based on spectral clustering semi-supervised learning.
FIG. 2 is a flow chart of a software defect prediction method based on semi-supervised learning of spectral clustering according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Referring to fig. 1 and 2, a software defect prediction method based on semi-supervised learning of spectral clustering includes the following steps:
1) acquiring original data from a database, and performing data preprocessing operation to obtain a processed feature matrix, wherein the method specifically comprises the following steps:
1.1) taking into account that different attributes are different in range due to the difference of the features, the influence caused by small data on absolute numerical values is prevented from being covered by large data, and z-score standardized features are adopted to ensure that each feature is treated equally by a classifier.
1.2) after the z-score standardization treatment, the data accord with the standard normal distribution, and the missing data in the database is replaced by the average value of the existing data with the corresponding characteristics.
2) Clustering is carried out on the part of data without the label through spectral clustering according to the characteristic matrix, and the method specifically comprises the following steps:
2.1) computing the adjacency matrix W of the weights of the edges between the software entities: the weight between two identical software entities is zero; the weights of two different software entities are obtained by exponentiating the negative of the sum of the squares of the differences of all the attributes of the two software entities;
2.2) calculating a degree matrix D of the characteristic matrix. For any point v in the graph, his degree d is defined as the sum of the weights of all the edges connected to him. Thus, a degree matrix which is a diagonal matrix can be obtained, only the main diagonal has values, and the degree of the ith point corresponds to the value of the ith row diagonal.
2.3) calculating to obtain a Laplace matrix L, wherein the calculation method is L-D-W.
2.4) carrying out characteristic decomposition on the Laplace matrix L to obtain a characteristic vector.
2.5) according to the normalized cut-map algorithm, whether the second small feature vector is larger than 0 is used as a clustering basis, and two clusters are obtained.
3) For the part of data without labels, performing label operation on the clusters obtained in the step 2) through a heuristic rule of software defect prediction to obtain pseudo labels, wherein the method specifically comprises the following steps:
3.1) calculating the row average attribute values of the two clusters respectively, and dividing the sum of all the attribute values of one cluster by the number of the corresponding clusters to obtain the average attribute value.
3.2) if the attribute is positively correlated with the software complexity, the cluster label with the larger row average attribute value is a defective cluster; if the attribute and piece complexity are negatively correlated, then clusters with smaller row-average attribute values are labeled as defective clusters.
4) Calculating a characteristic deviation score according to the data distribution and selecting the characteristics, wherein the specific steps are as follows:
4.1) according to the distribution of the label data, setting the weight of the labeled data to be 2 times of that of the pseudo label, calculating the defect-free proportion of all software entities, and setting the defect-free proportion as a threshold percentile.
And 4.2) obtaining the threshold percentile corresponding to each characteristic according to the threshold percentile. And if the value corresponding to one feature is greater than the threshold percentile but the corresponding label is not defective, or the value corresponding to one feature is less than the threshold percentile but the corresponding label is defective, adding one to the feature deviation number, and dividing the sum of the feature deviation numbers of one feature in all the software entities by the total number of the software entities to obtain the feature deviation score of the corresponding feature.
4.3) if the deviation score of the feature is larger than the defect-free proportion, the correlation between the feature and the label is not high, and the corresponding feature is discarded; if the deviation score of the feature is not larger than the defect-free proportion, the feature is more relevant to the label, and the corresponding feature is reserved.
5) And (5) performing spectral clustering in the step 2) and label operation in the step 3) again according to the feature matrix obtained after the feature selection in the step 4) to obtain a final prediction result.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (6)

1. A software defect prediction method of semi-supervised learning based on spectral clustering is characterized by comprising the following steps:
1) acquiring original sample data from a database, and performing data preprocessing operation to obtain a processed feature matrix;
2) clustering the unlabeled data through spectral clustering;
3) performing label operation on the clusters obtained in the step 2) through heuristic rules of software defect prediction to obtain pseudo labels; the heuristic rule is specifically as follows: the more complex the probability of a software entity being defective; the pseudo label is a label obtained by clustering label-free data extracted from a database in the step 2) and performing label operation in the step 3); the step 3) is specifically as follows:
a) respectively calculating the row average attribute values of the two clusters, and dividing the sum of all the attribute values of one cluster by the row number of the corresponding cluster to obtain the average attribute value;
b) if the attribute is positively correlated with the software complexity, the cluster label with the larger row average attribute value is a defective cluster; if the attribute is negatively correlated with the software complexity, labeling the cluster with the smaller row average attribute value as a defective cluster;
4) calculating a characteristic deviation fraction according to the data distribution and performing characteristic selection, wherein the weight occupied by the original label data is greater than the weight occupied by the pseudo label data; the method comprises the following specific steps:
a) calculating the defect-free proportion of all software entities according to the distribution of the label data, and setting the defect-free proportion as a threshold percentile;
b) according to the threshold percentile, the threshold percentile corresponding to each feature can be obtained, and the feature deviation score is calculated according to the threshold percentile;
c) if the deviation score of the feature is larger than the defect-free proportion, the correlation between the feature and the label is not high, and the corresponding feature is discarded; if the deviation score of the characteristic is not larger than the defect-free proportion, the characteristic is proved to have larger correlation with the label, and the corresponding characteristic is reserved;
5) and clustering and labeling again according to the new characteristic matrix to obtain a prediction result.
2. The method for predicting software defects based on semi-supervised learning of spectral clustering according to claim 1, wherein in step 1), the database is a warehouse storing sample data of a plurality of software entities; the software entity refers to the smallest unit of data representation in different databases, including but not limited to: class, function, file; the sample data comprises two parts of non-label data and labeled data, the non-label data only comprises attribute data of the software entity, and the labeled data also comprises label data indicating whether the software entity has defects or not; the attributes refer to the characteristics of the database storage about the software entity, including but not limited to: number of code lines, number of methods in a class, bytecode size of a class.
3. The software defect prediction method based on spectral clustering semi-supervised learning as recited in claim 1, wherein in step 1), the data preprocessing operation includes, but is not limited to, data normalization processing, missing data processing; the feature matrix is a matrix formed by taking a software entity as a row and taking an attribute as a column.
4. The software defect prediction method based on spectral clustering semi-supervised learning of claim 1, wherein the step 2) specifically comprises the following steps:
a) computing an adjacency matrix W representing weights of edges between the software entities;
b) calculating a degree matrix D of the characteristic matrix;
c) calculating a Laplace matrix in a mode of L-D-W;
d) performing characteristic decomposition on the Laplace matrix to obtain a characteristic vector;
e) and according to the normalized tangent image algorithm, using whether the second small feature vector is larger than 0 as a clustering basis to obtain two clusters.
5. The method for predicting software defects based on semi-supervised learning of spectral clustering as recited in claim 1, wherein in the step 4), the feature deviation score is a ratio of the number of software entities of a feature that violates a heuristic rule of software defect prediction to the total number of software entities; the heuristic rule violating the software defect prediction specifically includes the following two cases:
a) the complexity of the feature exceeds a threshold value, but the corresponding software entity is judged to be a defect-free software entity;
b) the complexity of the feature is less than a threshold, but its corresponding software entity is determined to be a defective software entity.
6. The software defect prediction method based on spectral clustering semi-supervised learning of claim 1, wherein the new feature matrix is a matrix obtained after feature selection in step 4); and 5), specifically, repeating the clustering of the step 2) and the labeling operation of the step 3) on the new feature matrix to obtain a final prediction result.
CN202010999235.1A 2020-09-22 2020-09-22 Software defect prediction method based on spectral clustering and semi-supervised learning Expired - Fee Related CN112148605B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010999235.1A CN112148605B (en) 2020-09-22 2020-09-22 Software defect prediction method based on spectral clustering and semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010999235.1A CN112148605B (en) 2020-09-22 2020-09-22 Software defect prediction method based on spectral clustering and semi-supervised learning

Publications (2)

Publication Number Publication Date
CN112148605A CN112148605A (en) 2020-12-29
CN112148605B true CN112148605B (en) 2022-05-20

Family

ID=73892716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010999235.1A Expired - Fee Related CN112148605B (en) 2020-09-22 2020-09-22 Software defect prediction method based on spectral clustering and semi-supervised learning

Country Status (1)

Country Link
CN (1) CN112148605B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326198A (en) * 2021-06-15 2021-08-31 深圳前海微众银行股份有限公司 Code defect state determination method and device, electronic equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105430099A (en) * 2015-12-22 2016-03-23 湖南科技大学 Collaborative Web service performance prediction method based on position clustering
CN107133176A (en) * 2017-05-09 2017-09-05 武汉大学 A kind of spanned item mesh failure prediction method based on semi-supervised clustering data screening
WO2019143542A1 (en) * 2018-01-21 2019-07-25 Microsoft Technology Licensing, Llc Time-weighted risky code prediction
CN111340086A (en) * 2020-02-21 2020-06-26 同济大学 Method, system, medium and terminal for processing label-free data
CN111338950A (en) * 2020-02-25 2020-06-26 北京高质系统科技有限公司 Software defect feature selection method based on spectral clustering

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106201871B (en) * 2016-06-30 2018-10-02 重庆大学 Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
US10737904B2 (en) * 2017-08-07 2020-08-11 Otis Elevator Company Elevator condition monitoring using heterogeneous sources
CN110751186B (en) * 2019-09-26 2022-04-08 北京航空航天大学 Cross-project software defect prediction method based on supervised expression learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105430099A (en) * 2015-12-22 2016-03-23 湖南科技大学 Collaborative Web service performance prediction method based on position clustering
CN107133176A (en) * 2017-05-09 2017-09-05 武汉大学 A kind of spanned item mesh failure prediction method based on semi-supervised clustering data screening
WO2019143542A1 (en) * 2018-01-21 2019-07-25 Microsoft Technology Licensing, Llc Time-weighted risky code prediction
CN111340086A (en) * 2020-02-21 2020-06-26 同济大学 Method, system, medium and terminal for processing label-free data
CN111338950A (en) * 2020-02-25 2020-06-26 北京高质系统科技有限公司 Software defect feature selection method based on spectral clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种半监督集成学习软件缺陷预测方法;张肖 等;《小型微型计算机系统》;20181031(第10期);全文 *

Also Published As

Publication number Publication date
CN112148605A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
CN110825644B (en) Cross-project software defect prediction method and system
Hsu et al. Ensemble convolutional neural networks with weighted majority for wafer bin map pattern classification
CN106708738B (en) Software test defect prediction method and system
CN112633601A (en) Method, device, equipment and computer medium for predicting disease event occurrence probability
CN113221960B (en) Construction method and collection method of high-quality vulnerability data collection model
CN113138920B (en) Software defect report allocation method and device based on knowledge graph and semantic role labeling
CN113590396A (en) Method and system for diagnosing defect of primary device, electronic device and storage medium
CN112148595A (en) Software change level defect prediction method for removing repeated change
CN112148605B (en) Software defect prediction method based on spectral clustering and semi-supervised learning
US20210166362A1 (en) Wafer map identification method and computer-readable recording medium
US20220207302A1 (en) Machine learning method and machine learning apparatus
CN113127342B (en) Defect prediction method and device based on power grid information system feature selection
CN112306731B (en) Two-stage defect-distinguishing report severity prediction method based on space word vector
CN112306730B (en) Defect report severity prediction method based on historical item pseudo label generation
CN114186644A (en) Defect report severity prediction method based on optimized random forest
CN110348005B (en) Distribution network equipment state data processing method and device, computer equipment and medium
Silva et al. Classifying feature models maintainability based on machine learning algorithms
Ayesha et al. Review on code examination proficient system in software engineering by using machine learning approach
CN117539920B (en) Data query method and system based on real estate transaction multidimensional data
Shahid et al. Machine learning-based false positive software vulnerability analysis
CN116821672A (en) Semi-supervision-based data management method for operation and maintenance fault samples
CN115033493A (en) Workload sensing instant software defect prediction method based on linear programming
CN116401153A (en) Software defect prediction method based on cluster fusion oversampling
BR102022016587A2 (en) METHOD FOR IDENTIFYING, MAPPING, CLASSIFYING AND DETERMINING THE SEVERITY OF WEAR MECHANISMS IN A METALLIC PLATE BASED ON COMPUTER VISION
Dash Design of data scoring model for big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220520