CN106991047B

CN106991047B - Method and system for predicting object-oriented software defects

Info

Publication number: CN106991047B
Application number: CN201710187847.9A
Authority: CN
Inventors: 朱朝阳; 韩丽芳; 张信明; 王志宏; 陈相舟; 应欢; 李怡康; 李梦涛
Original assignee: University of Science and Technology of China USTC; State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; State Grid Hebei Electric Power Co Ltd
Current assignee: University of Science and Technology of China USTC; State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; State Grid Hebei Electric Power Co Ltd
Priority date: 2017-03-27
Filing date: 2017-03-27
Publication date: 2020-11-17
Anticipated expiration: 2037-03-27
Also published as: CN106991047A

Abstract

The invention discloses a method for predicting object-oriented software defects, which comprises the following steps: processing the training data set to obtain effective characteristic attributes, and establishing a new training data set according to the effective characteristic attributes; training a Support Vector Machine (SVM) according to the new training data set, and performing parameter optimization through a Particle Swarm Optimization (PSO), wherein the parameters comprise: penalty factor and gaussian kernel bandwidth; and performing defect prediction on the prediction data by using an SVM model according to the optimized parameters, and acquiring a prediction result. The invention has the beneficial effects that: the training data set is processed to obtain effective characteristic attributes, and a new training data set is established according to the effective characteristic attributes, so that dimension disaster is effectively avoided, processing cost is reduced, and data processing speed is increased; and the particle swarm optimization PSO is utilized to optimize parameters, and the optimal parameters are selected, so that the accuracy of defect prediction is improved.

Description

Method and system for predicting object-oriented software defects

Technical Field

The present invention relates to the field of software defect prediction, and more particularly, to a method and system for predicting object-oriented software defects.

Background

In the long-term development process of information system software, the information system software is mainly developed by an object-oriented design technology. Essentially, object-oriented system design is the process of finding software structural and software functional model solutions. With the complexity of the structure and the model of the software, the object scale is larger, so that the security problem of the software of the whole information system is more serious, the software defects and bugs are found as early as possible in the software development process and solved as soon as possible, the national production and the normal market operation can be guaranteed, and the method is also an important way for reducing the test cost and period in the future and improving the software quality.

Software defect prediction techniques can be divided into static and dynamic defect prediction techniques. The existing static defect prediction technology is basically provided based on different machine learning algorithms, such as classification algorithms of decision trees, random forests, naive Bayes, BP neural networks, artificial immune systems and the like, and all the methods have a certain defect prediction capability, but more or less imply some problems. For example, decision trees are over-fit, ignoring the problem of correlation between feature attributes; naive Bayes requires known prior probability and has higher requirement on attribute independence; the neural network is easy to fall into the problem of local optimum or insufficient fitting degree, as with the Bayes model, factors related to defects need to be obtained according to expert experience, the calculation efficiency is low, the support vector machine has good learning and expansion capabilities, and the optimal parameters are not set in a uniform and efficient method. And when the object-oriented software is handled, various inevitable algorithms need to handle a great number of classes and object characteristic attributes to measure the software, so that dimension disaster is caused, the detection time is too long, and the practicability of a prediction model is reduced.

Therefore, it is necessary to provide a software defect prediction method to improve the accuracy of the software prediction result.

Disclosure of Invention

The invention provides a method for predicting object-oriented software defects, which is used for solving the problem of low accuracy of software prediction results.

In order to solve the above problem, according to an aspect of the present invention, there is provided a method for predicting object-oriented software defects, the method comprising:

processing a training data set to obtain effective characteristic attributes, and establishing a new training data set according to the effective characteristic attributes, wherein the training data set comprises: a defective data set and a non-defective data set;

training a Support Vector Machine (SVM) according to the new training data set, and performing parameter optimization through a Particle Swarm Optimization (PSO), wherein the parameters comprise: penalty factor and gaussian kernel bandwidth; and

and performing defect prediction on the prediction data by using an SVM model according to the optimized parameters, and acquiring a prediction result.

Preferably, the processing the training data set to obtain effective characteristic attributes, and establishing a new training data set according to the effective characteristic attributes includes:

normalizing the weight of the characteristic attribute corresponding to each sample data in the training data set;

randomly selecting a sample data, and respectively selecting a sample data of the same type and a sample data of a different type with the minimum Euclidean distance from the sample data;

calculating and updating the weight of each characteristic attribute corresponding to the sample data according to a weight calculation formula, wherein the weight calculation formula is as follows:

wherein, t_kWeight (t) as a feature attribute_k) Is a characteristic attribute t_kCorresponding weight, T_iIs sample data, T_missIs T_iCorresponding to the sample data of the same type, T, with the smallest Euclidean distance_hitIs T_iCorresponding to the same type of sample data with the smallest Euclidean distance, D (T)_i,T_miss,t_k) Is T_iAnd T_missAt a characteristic attribute t_kAbove Euclidean distance, D (T)_i,T_hit,t_k) Is T_iAnd T_hitAt a characteristic attribute t_kAbove Euclidean distance, max (D (t)_k) For all samples at attribute t)_kThe maximum euclidean distance above, n being the number of iterations.

Repeating the two steps according to preset times, and calculating the average weight of each characteristic attribute;

comparing the average weight of each characteristic attribute with a preset threshold, and selecting the characteristic attribute with the average weight larger than the preset threshold as an effective characteristic attribute; and

and selecting data corresponding to the effective characteristic attributes to establish a new training data set.

Preferably, the training a support vector machine SVM according to the new training data set and performing parameter optimization through a Particle Swarm Optimization (PSO) algorithm includes:

initializing settings for data, wherein the data comprises: a first learning factor, a second learning factor, an inertial weight, an iteration number, and a particle swarm, the particle swarm comprising: the position and velocity of the particle;

calculating the prediction accuracy of the SVM model according to the position of each particle to serve as fitness;

comparing the fitness with the individual fitness extreme value, and if the fitness is superior to the individual fitness extreme value, updating the individual fitness extreme value of the particle and the best position corresponding to the fitness by using the fitness; if the fitness is better than the individual fitness extreme value of all other particles and the group fitness extreme value in the previous iteration, updating the group fitness extreme value in the current iteration and the best position corresponding to the fitness extreme value by using the fitness; and

judging whether the maximum iteration times is reached or the population fitness extreme value is larger than a preset population fitness extreme value,

if the maximum iteration times is reached or the group fitness extreme value is not less than the preset group fitness extreme value, outputting the best position corresponding to the group fitness extreme value at the moment as an optimal parameter value;

and if the maximum iteration times are not reached and the population fitness extreme value is smaller than the preset population fitness extreme value, updating the position and the speed of each particle, returning to the step to calculate the prediction accuracy of the SVM model according to the position of each particle to be used as the fitness until the maximum iteration times are reached or the population fitness extreme value is not smaller than the preset population fitness extreme value, and outputting the best position corresponding to the population fitness extreme value at the moment as the optimal parameter value.

Preferably, wherein said updating the position and velocity of each particle comprises:

wherein, V_i ^k+1Is the velocity, w, of the updated (k + 1) th sub-particle i^kIs the inertial weight at the kth iteration, V_i ^kIs the velocity of particle i at the kth iteration, c₁And c₂For fixed parameters, PBest_i ^kIs the best position, S, corresponding to the individual fitness extremum of the particle i at the kth iteration_i ^kIs the position of particle i at the kth iteration, GBest^kIs the best position, S, corresponding to the extreme value of group fitness in the k iteration_i ^k+1Is the position of the particle i at the k +1 iteration, num is the number of particles, close_kThe degree of population clustering at the kth iteration,

is the Euclidean distance, | S, of each particle from the mean position (center of gravity) of the population_max-S_minI is the maximum diameter length of the solution space, w is the inertial weight, w_minLower bound of inertial weight, w_maxIs an upper bound on inertial weight.

Preferably, before the predicting the defect of the prediction data by using the SVM model according to the optimized parameter and obtaining the prediction result, the method further comprises:

and performing defect prediction on the test data set by utilizing an SVM model according to the optimized parameters, and verifying the accuracy of the optimized parameters.

According to another aspect of the present invention, there is provided a system for predicting object-oriented software defects, the system comprising: a data processing unit, a parameter optimization unit and a defect prediction unit,

the data processing unit is configured to process a training data set, acquire an effective characteristic attribute, and establish a new training data set according to the effective characteristic attribute, where the training data set includes: a defective data set and a non-defective data set;

the parameter optimization unit is configured to train a Support Vector Machine (SVM) according to the new training data set, and perform parameter optimization through a Particle Swarm Optimization (PSO), where the parameters include: penalty factor and gaussian kernel bandwidth; and

and the defect prediction unit is used for predicting the defects of the prediction data by utilizing an SVM model according to the optimized parameters and acquiring a prediction result.

Preferably, the processing the training data set by the data processing unit to obtain effective feature attributes, and establishing a new training data set according to the effective feature attributes includes:

Calculating the weight of each characteristic attribute corresponding to a plurality of sample data according to the preset times, and calculating the average weight of each characteristic attribute;

Preferably, the training of the support vector machine SVM by the parameter optimization unit according to the new training data set and the parameter optimization by the particle swarm optimization PSO include:

Preferably, wherein the system further comprises:

and a verification unit, configured to perform defect prediction on the prediction data by using an SVM model according to the optimized parameter in the defect prediction unit 803, and before obtaining a prediction result, perform defect prediction on the test data set by using the SVM model according to the optimized parameter, and verify the accuracy of the optimized parameter.

The invention has the beneficial effects that:

1. the technical scheme of the invention processes the training data set to obtain the effective characteristic attributes, and establishes a new training data set according to the effective characteristic attributes, thereby effectively avoiding dimension disaster, reducing processing cost and improving data processing speed.

2. According to the technical scheme, the SVM is trained according to the new training data set, parameter optimization is performed through a Particle Swarm Optimization (PSO), the optimal parameter is selected, and accuracy of defect prediction is improved.

Drawings

A more complete understanding of exemplary embodiments of the present invention may be had by reference to the following drawings in which:

FIG. 1 is a flow diagram of a method 100 for predicting object-oriented software defects, according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a method 200 of processing a training data set according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method 300 for parameter optimization using PSO according to an embodiment of the present invention;

FIG. 4 is a graph comparing accuracy across data sets according to embodiments of the present invention;

FIG. 5 is a graph comparing accuracy across data sets according to embodiments of the present invention;

FIG. 6 is a comparison graph of recall on various data sets according to an embodiment of the present invention;

FIG. 7 is a graph comparing F values on various data sets according to an embodiment of the present invention; and

FIG. 8 is a block diagram illustrating a system 800 for predicting object-oriented software bugs, according to an embodiment of the present invention.

Detailed Description

The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the embodiments described herein, which are provided for complete and complete disclosure of the present invention and to fully convey the scope of the present invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, the same units/elements are denoted by the same reference numerals.

Unless otherwise defined, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Further, it will be understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.

FIG. 1 is a flow diagram of a method 100 for predicting object-oriented software defects, according to an embodiment of the present invention. As shown in fig. 1, the method 100 for predicting object-oriented software defects is used for predicting object-oriented software defects. According to the method, the training data set is processed to obtain the effective characteristic attributes, a new training data set is established according to the effective characteristic attributes, then the Support Vector Machine (SVM) is trained according to the new training data set, parameter optimization is carried out through a Particle Swarm Optimization (PSO) algorithm, finally, the predicted data is subjected to defect prediction by using an SVM model according to the optimized parameters, the prediction result is obtained, and the accuracy of software defect prediction is improved. The method 100 for predicting object-oriented software defects starts at step 101, and processes a training data set at step 101 to obtain valid feature attributes, and establishes a new training data set according to the valid feature attributes, where the training data set includes: a defective data set and a non-defective data set.

FIG. 2 is a method 20 for processing a training data set according to an embodiment of the present invention0, flow chart. As shown in fig. 2, the method 200 for processing a training data set starts from step 201, and performs normalization processing on a weight value of a feature attribute corresponding to each sample data in the training data set in step 201. In an embodiment of the present invention, the training data set is set to T ═ T₁,T₂,…,T_i}，T_i＝(t₁,t₂,…,t_k) Wherein t is_kAnd for the characteristic attributes, assigning the weight value of each characteristic attribute of the samples in the training data set to be 0, digitizing the attributes represented by the non-numerical values, and normalizing all the numerical values according to a maximum and minimum normalization method.

Preferably, in step 202, a sample data is randomly selected, and a sample data of the same type and a sample data of a different type having the smallest euclidean distance with the sample data are respectively selected. In an embodiment of the invention, a sample T is randomly selected from the training data set_iAnd respectively selecting one and the sample T_iSample data T of the same type with the minimum Euclidean distance_hitAnd non-homogeneous sample data and T_missWherein the type means whether or not there is a defect. If T is_iFor defective samples, then T_hitFor defective samples, T_missSamples that were defect free; if T is_iFor a defect-free sample, then T_hitFor defect-free samples, T_missIs a defective sample.

Preferably, in step 203, the weight value of each feature attribute corresponding to the sample data is calculated and updated according to a weight value calculation formula, where the weight value calculation formula is:

wherein, t_kWeight (t) as a feature attribute_k) Is a characteristic attribute t_kCorresponding weight, T_iIs sample data, T_missIs T_iCorresponding to the sample data of the same type, T, with the smallest Euclidean distance_hitIs T_iCorresponding Euclidean distanceThe smallest specimen data of the same type, D (T)_i,T_miss,t_k) Is T_iAnd T_missAt a characteristic attribute t_kAbove Euclidean distance, D (T)_i,T_hit,t_k) Is T_iAnd T_hitAt a characteristic attribute t_kAbove Euclidean distance, max (D (t)_k) For all samples at attribute t)_kThe maximum euclidean distance above, n being the number of iterations.

Preferably, the above two steps are repeated according to a preset number of times in step 204, and an average weight of each feature attribute is calculated.

Preferably, in step 205, the average weight of each feature attribute is compared with a preset threshold, and the feature attribute with the average weight greater than the preset threshold is selected as the effective feature attribute. In the embodiment of the invention, the average weight of each feature attribute is sorted, and then the feature attributes with the average weight larger than a preset threshold are selected for reservation.

Preferably, in step 206, data corresponding to the valid feature attributes are selected to create a new training data set. In the implementation mode of the invention, the reserved characteristic attributes form a characteristic attribute set, and then the training data corresponding to the characteristic attribute set is selected for supporting the training and classification of the vector machine model.

Preferably, in step 202, a support vector machine SVM is trained according to the new training data set, and a particle swarm optimization PSO is used to perform parameter optimization, wherein the parameters include: penalty factor and gaussian kernel bandwidth. Fig. 3 is a flowchart of a method 300 for performing parameter optimization by using a particle swarm optimization PSO according to an embodiment of the present invention. As shown in fig. 3, the method 300 for parameter optimization by particle swarm optimization PSO starts at step 301, and initializes the data at step 301, wherein the data includes: a first learning factor, a second learning factor, an inertial weight, an iteration number, and a particle swarm, the particle swarm comprising: the position and velocity of the particles. In the embodiment of the invention, the relevant parameters optimized by the PSO (particle swarm optimization) algorithm comprise a penalty factor C and the bandwidth of a Gaussian kernel. At the time of data initialization setting, include: initializing a speed interval, learning factor c₁And c₂An inertial weight w, an iteration number n and a particle swarm, wherein the particle swarm is expressed as S { (S)_{1_c},s_{1_σ}),(s_{2_c},s_{2_σ}),...,(s_{num_c},s_{num_σ}) Including the position(s) of each particle_{i_c},s_{i_σ}) Velocity (v)_{i_c},v_{i_σ}) And the population number num. Wherein the position of the particle corresponds to the penalty factor and Gaussian kernel bandwidth of the SVM model, i.e. penalty factor s_{i_c}And gaussian kernel bandwidth s_{i_σ}。

Preferably, the prediction accuracy of the SVM model is calculated as the fitness according to the position of each particle in step 302. In an embodiment of the invention, the fitness of the current iteration of the particle is calculated for each particle position

In the present invention, the current penalty factor s is used_{i_c}And gaussian kernel bandwidth s_{i_σ}The prediction accuracy of the obtained SVM model is used as a fitness function return value, namely:

preferably, the fitness is compared with an individual fitness extreme value in step 303, and if the fitness is better than the individual fitness extreme value, the individual fitness extreme value of the particle and the best position corresponding to the fitness are updated by using the fitness; and if the fitness is better than the individual fitness extreme value of all other particles and the group fitness extreme value in the previous iteration, updating the group fitness extreme value in the current iteration and the best position corresponding to the fitness extreme value by using the fitness.

Preferably, in step 304, it is determined whether the maximum iteration number is reached or the population fitness extreme value is greater than a preset population fitness extreme value, and if the maximum iteration number is reached or the population fitness extreme value is not less than the preset population fitness extreme value, the step 305 is performed; if the maximum iteration number is not reached and the population fitness extreme value is smaller than the preset population fitness extreme value, step 306 is entered.

Preferably, the best position corresponding to the extreme value of the population fitness at this time is output as the optimal parameter value in step 305.

Preferably, the position and velocity of each particle is updated in step 306 and returns to step 302. If the maximum iteration number is not reached and the population fitness extreme value is less than the preset population fitness extreme value, updating the position and the speed of each particle, returning to the step 302 until the maximum iteration number is reached or the population fitness extreme value is not less than the preset population fitness extreme value, and outputting the best position corresponding to the population fitness extreme value at the moment as the optimal parameter value. In the embodiment of the present invention, if position

Is adapted to

Is superior to individual fitness extremum fitness (PBest)_i) Updating the best position corresponding to the individual fitness extremum of the particle by using the position; if it is not

And is also superior to the individual extremum of all other particles and the population extremum fitness (GBest) in the previous iteration^k-1) And updating the best position corresponding to the group extremum in the iteration by using the position information. If the maximum number of iterations or the current population extremum fitness (GBest) is reached^k) If the accuracy requirement is met, the iteration can be stopped, and the best position GBest corresponding to the group extremum is output^kAs the optimal parameters for training the SVM model.

is the Euclidean distance, | S, of each particle from the mean position (center of gravity) of the population_max-S_minI is the maximum diameter length of the solution space, w is the inertial weight, w_minLower bound of inertial weight, w_maxIs an upper bound on inertial weight. In an embodiment of the present invention, c₁And c₂The main influence is the balance between the individual memory and the population memory of the particles, and when the speed and the position of the particles are updated, c is set₁Is 1.6, c₂Is 1.5. The inertia weight w mainly influences the balance between the history memory and the current state of the particles, if the value is too large, when the particles approach the optimal solution, the particles still do not fall into the local optimal solution, the results of the global search are concerned, the influence of the local search is ignored, the optimal solution is crossed, otherwise, the moving speed is too slow, and the particles cannot approach the optimal solution as fast as possible, so the invention provides a dynamic inertia weight particle swarm optimization algorithm, the moving speed can be gradually reduced in the process that the population is quickly concentrated to the vicinity of the optimal solution, and each particle can be more refinedAnd the fitness of the surrounding space is accurately searched, and the overall performance of the standard PSO algorithm is enhanced.

Defining a variable close to represent the aggregation degree of the population, wherein the aggregation degree of the population at each iteration k is as follows:

wherein the value range of close is (0,1),

denotes the Euclidean distance, | S, of each particle from the average position (center of gravity position) of the population_max-S_minAnd | represents the maximum diameter length of the solution space, close describes the condition that the particle swarm is close to the optimal solution space after each iteration, and the larger the value of the value is, the more dispersed the particle swarm is, otherwise, the more concentrated the particle swarm is. After the particles are aggregated, the value of the inertial weight w needs to be gradually reduced, and by quantifying the aggregation degree of the particles, the aggregation degree can be mapped to a solution space of the inertial weight, so that the values of the inertial weight under different concentration degrees can be obtained. To achieve the above object, the calculation formula of w is set as follows:

the method can optimize local optimization after the method is quickly close to the optimal space so as to converge as soon as possible and obtain the optimal solution. Wherein w_minAnd w_maxIs the lower bound and lower bound of w, set to 0.8 and 1.2, respectively.

Preferably, in step 103, a defect prediction is performed on the prediction data by using an SVM model according to the optimized parameters, and a prediction result is obtained. Preferably, before the step 103, performing defect prediction on the prediction data by using an SVM model according to the optimized parameter, and obtaining a prediction result, the method further includes: and performing defect prediction on the test data set by utilizing an SVM model according to the optimized parameters, and verifying the accuracy of the optimized parameters. In the embodiment of the invention, in order to verify the performance of the defect prediction method provided by the invention, the advantages of the proposed model are illustrated from the performances of four indexes of accuracy, precision, recall ratio and F value on four data sets. The model proposed herein was implemented based on MATLAB and compared to LE-SVM and LE-KNN. We used 4 experimental data sets conforming to the CK metric to verify the effectiveness of the defect prediction method, one is Class-level data for KC1 provided by the national aerospace administration (NASA), and comprises 145 samples, 89 characteristic attributes and 60 defect-free samples in total, and 85 defect samples; the second is an eclipse2.0 dataset based on real data of open source eclipse, with 6728 different samples comprising 975 defective samples, 5753 non-defective samples, the third is an eclipse3.0 dataset comprising 9470 samples, 1522 defective samples; the end is the ant-1.7 dataset, which has 745 samples, and 166 samples without defects. For the eclipse and ant datasets, since defect and defect is represented by the number of bugs, we first need to update them to logical variables 1 and 0 representing defect and defect. Meanwhile, as the manifold learning algorithm has the problem of data point loss in the process of high-dimensional dimensionality reduction, 700 samples are randomly selected from the last three data sets and are randomly divided into two groups with equal number, and the two groups are respectively used as a training set and a testing set.

Table 1 is a cross matrix of actual defect status and predicted results. As shown in table 1, the total number of test samples is N1+ N2+ N3+ N4, the number of correctly predicted samples is N1+ N4, and the total number of incorrectly predicted samples is N2+ N3.

TABLE 1 intersection matrix of actual defect cases and predicted results

Accuracy (Accuracy) represents the ratio of the number of samples with correct prediction results (defective modules are successfully detected as defective, and non-defective modules are not misjudged) to the total number of samples to be predicted, and is calculated as follows:

the Precision (Precision) represents the ratio of the number of actually defective and predicted defective samples to the number of all predicted defective samples, and can be expressed as:

recall (Recall) represents the ratio of the number of samples that are actually defective and predicted to be defective to all of the actual defective samples, and is calculated as follows:

the F value is a harmonic average value of the precision and the recall ratio, and the calculation formula is as follows:

FIG. 4 is a graph comparing accuracy across data sets according to embodiments of the present invention. As shown in fig. 4, the prediction method of the present invention is superior to the comparison algorithm in four data sets, and mainly includes that the method can remove some attributes that are unfavorable for classification, such as attributes with too small numerical difference, through the Relief algorithm, so that the prediction result is more accurate, and meanwhile, the penalty factor and gaussian kernel bandwidth that optimize the performance of the SVM training model are obtained through the PSO, so as to further improve the accuracy of the prediction result. The parameters of the comparison algorithm can be obtained only through experience, and an optimization process is lacked, so that the result has a certain distance from the optimal solution, the prediction result has a certain difference relative to the extracted model, and the accuracy of the extracted model on four data sets is higher than that of the comparison model by 8.2-12.2%.

FIG. 5 is a graph comparing accuracy across data sets according to embodiments of the present invention. FIG. 6 is a graph comparing recall on various data sets, according to an embodiment of the invention. FIG. 7 is a graph comparing F-values on various data sets according to an embodiment of the present invention. As shown in fig. 5, fig. 6, and fig. 7, which respectively show the comparison graphs of the accuracy, the recall ratio, and the F value of the method and the comparison algorithm on four data sets, it can be seen that the three indexes of the LE-SVM and the LE-KNN algorithm are similar on the latter three data sets, and the difference on CL-KC1 is mainly caused by the recall ratio, which is mainly caused by the fact that after the manifold learning performs the feature dimension reduction processing, instead of retaining part of the original feature attributes, the main information in the original data set is stored in the newly generated low-dimensional data set, and for such data, the influence of the adopted prediction method on the prediction result is reduced. Similarly, the difference between the accuracy rates of the two algorithms in fig. 4 is relatively small, and this problem can also be explained. The prediction method provided by the invention utilizes a high-efficiency Relief algorithm which is extremely suitable for binary problems during dimension reduction, and the algorithm has corresponding optimal parameters when facing different test sets because of optimization processing of penalty factors and Gaussian kernel bandwidth solution space and the problem of fixed convergence speed and local optimization is avoided to the maximum extent through the improved dynamic inertia weight PSO algorithm, so that the optimal defect prediction result is obtained, and the prediction method can obtain 9.9% higher precision, 5.6% higher recall rate and 7.7% lead in F value compared with the LE-SVM algorithm by calculating the average value of indexes on four data sets through three methods.

FIG. 8 is a block diagram illustrating a system 800 for predicting object-oriented software bugs, according to an embodiment of the present invention. As shown in fig. 8, the system 800 for predicting object-oriented software defects includes: a data processing unit 801, a parameter optimization unit 802, and a defect prediction unit 803. Preferably, the data processing unit 801 processes a training data set to obtain effective feature attributes, and establishes a new training data set according to the effective feature attributes, where the training data set includes: a defective data set and a non-defective data set. Preferably, the processing the training data set in the data processing unit 801 to obtain effective feature attributes, and establishing a new training data set according to the effective feature attributes includes:

Preferably, the parameter optimization unit 802 trains a support vector machine SVM according to the new training data set, and performs parameter optimization through a particle swarm optimization PSO, where the parameters include: penalty factor and gaussian kernel bandwidth. Preferably, the training of the support vector machine SVM according to the new training data set in the parameter optimization unit 802 and the parameter optimization by the particle swarm optimization PSO include:

Preferably, the defect prediction unit 803 performs defect prediction on the prediction data by using an SVM model according to the optimized parameter, and obtains a prediction result. Preferably, wherein the system further comprises: the verification unit 804 is configured to perform defect prediction on the prediction data by using the SVM model according to the optimized parameter in the defect prediction unit 803, and before obtaining a prediction result, perform defect prediction on the test data set by using the SVM model according to the optimized parameter, and verify the accuracy of the optimized parameter.

The system 800 for predicting object-oriented software defects according to the embodiment of the present invention corresponds to the method 100 for predicting object-oriented software defects according to another embodiment of the present invention, and will not be described herein again.

The invention has been described with reference to a few embodiments. However, other embodiments of the invention than the one disclosed above are equally possible within the scope of the invention, as would be apparent to a person skilled in the art from the appended patent claims.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the [ device, component, etc ]" are to be interpreted openly as referring to at least one instance of said device, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

Claims

1. A method for predicting object-oriented software bugs, the method comprising:

processing an original training data set to obtain effective characteristic attributes, and establishing an updated training data set according to the effective characteristic attributes, wherein the updated training data set comprises: a defective data set and a non-defective data set;

training a Support Vector Machine (SVM) according to the updated training data set, and performing parameter optimization through a Particle Swarm Optimization (PSO), wherein the parameters comprise: penalty factor and gaussian kernel bandwidth; and

performing defect prediction on the prediction data by utilizing an SVM model according to the optimized parameters, and acquiring a prediction result;

wherein, the training of the SVM according to the updated training data set and the parameter optimization through PSO comprise:

step 1, initializing and setting data, wherein the data comprises: a first learning factor, a second learning factor, an inertial weight, an iteration number, and a particle swarm, the particle swarm comprising: the position and velocity of the particle;

step 2, calculating the prediction accuracy of the SVM model according to the position of each particle to serve as fitness;

step 3, comparing the fitness with an individual fitness extreme value, and if the fitness is better than the individual fitness extreme value, updating the individual fitness extreme value of the particle and the best position corresponding to the fitness by using the fitness; if the fitness is better than the individual fitness extreme value of all other particles and the group fitness extreme value in the previous iteration, updating the group fitness extreme value in the current iteration and the best position corresponding to the fitness extreme value by using the fitness;

step 4, judging whether the maximum iteration times is reached or the group fitness extreme value is larger than a preset group fitness extreme value,

if the maximum iteration times is reached or the population fitness extreme value is not less than the preset population fitness extreme value, entering the step 5;

if the maximum iteration times are not reached and the population fitness extreme value is smaller than the preset population fitness extreme value, entering step 6;

step 5, outputting the best position corresponding to the extreme value of the group fitness at the moment as an optimal parameter value; and

step 6, updating the position and the speed of each particle, and returning to the step 2;

the updating the position and the velocity of each particle comprises:

wherein, V_i ^k+1Is the velocity, w, of the updated (k + 1) th sub-particle i_kIs the inertial weight at the kth iteration, V_i ^kIs the velocity of particle i at the kth iteration, c₁And c₂In order to fix the parameters of the device,PBest_i ^kis the best position, S, corresponding to the individual fitness extremum of the particle i at the kth iteration_i ^kIs the position of particle i at the kth iteration, GBest^kIs the best position, S, corresponding to the extreme value of group fitness in the k iteration_i ^k+1Is the position of the particle i at the k +1 iteration, num is the number of particles, close_kThe degree of population clustering at the kth iteration,

is the Euclidean distance, | S, of each particle from the mean position (center of gravity) of the population_max-S_minI is the maximum diameter length of the solution space, w is the inertial weight, w_minLower bound of inertial weight, w_maxIs the upper bound of inertial weight; both rand1() and rand2() are random functions.

2. The method of claim 1, wherein the processing the original training data set to obtain valid feature attributes and establishing an updated training data set according to the valid feature attributes comprises:

step 1, normalizing the weight of the characteristic attribute corresponding to each sample data in the original training data set;

step 2, randomly selecting one sample data, and respectively selecting one sample data of the same type and one sample data of a different type with the minimum Euclidean distance from the sample data;

step 3, calculating and updating the weight of each characteristic attribute corresponding to the sample data according to a weight calculation formula, wherein the weight calculation formula is as follows:

wherein, t_kWeight (t) as a feature attribute_k) Is a characteristic attribute t_kCorresponding weight, T_iIs sample data, T_missIs T_iCorresponding Oldham's rayNon-homogeneous sample data with minimal distance, T_hitIs T_iCorresponding to the same type of sample data with the smallest Euclidean distance, D (T)_i,T_miss,t_k) Is T_iAnd T_missAt a characteristic attribute t_kAbove Euclidean distance, D (T)_i,T_hit,t_k) Is T_iAnd T_hitAt a characteristic attribute t_kAbove Euclidean distance, max (D (t)_k) For all samples at attribute t)_kThe maximum Euclidean distance above, n is the iteration number;

step 4, repeating the step 2 and the step 3 according to preset times, and calculating the average weight of each characteristic attribute;

step 5, comparing the average weight of each characteristic attribute with a preset threshold, and selecting the characteristic attribute with the average weight larger than the preset threshold as an effective characteristic attribute; and

and 6, selecting data corresponding to the effective characteristic attributes to establish an updated training data set.

3. The method of claim 1, further comprising, before said performing a defect prediction on the prediction data by using an SVM model according to the optimized parameters and obtaining a prediction result:

4. A system for predicting object-oriented software bugs, the system comprising: a data processing unit, a parameter optimization unit and a defect prediction unit,

the data processing unit is configured to process an original training data set, obtain an effective characteristic attribute, and establish an updated training data set according to the effective characteristic attribute, where the updated training data set includes: a defective data set and a non-defective data set;

the parameter optimization unit is configured to train a Support Vector Machine (SVM) according to the updated training data set, and perform parameter optimization through a Particle Swarm Optimization (PSO), where the parameters include: penalty factor and gaussian kernel bandwidth; and

the defect prediction unit is used for predicting defects of prediction data by utilizing an SVM model according to the optimized parameters and acquiring a prediction result;

the parameter optimization unit trains the SVM according to the updated training data set and performs parameter optimization through a Particle Swarm Optimization (PSO), and the parameter optimization method comprises the following steps:

if the maximum iteration times are not reached and the population fitness extreme value is smaller than the preset population fitness extreme value, updating the position and the speed of each particle, returning to the step to calculate the prediction accuracy of the SVM model according to the position of each particle to be used as the fitness until the maximum iteration times are reached or the population fitness extreme value is not smaller than the preset population fitness extreme value, and outputting the best position corresponding to the population fitness extreme value at the moment to be used as the optimal parameter value;

the updating the position and the velocity of each particle comprises:

wherein, V_i ^k+1Is the velocity, w, of the updated (k + 1) th sub-particle i_kIs the inertial weight at the kth iteration, V_i ^kIs the velocity of particle i at the kth iteration, c₁And c₂For fixed parameters, PBest_i ^kIs the best position, S, corresponding to the individual fitness extremum of the particle i at the kth iteration_i ^kIs the position of particle i at the kth iteration, GBest^kIs the best position, S, corresponding to the extreme value of group fitness in the k iteration_i ^k+1Is the position of the particle i at the k +1 iteration, num is the number of particles, close_kThe degree of population clustering at the kth iteration,

5. The system of claim 4, wherein the data processing unit processes the original training data set to obtain valid feature attributes, and establishes an updated training data set according to the valid feature attributes, comprising:

normalizing the weight of the characteristic attribute corresponding to each sample data in the original training data set;

wherein, t_kWeight (t) as a feature attribute_k) Is a characteristic attribute t_kCorresponding weight, T_iIs sample data, T_missIs T_iCorresponding to the sample data of the same type, T, with the smallest Euclidean distance_hitIs T_iCorresponding to the same type of sample data with the smallest Euclidean distance, D (T)_i,T_miss,t_k) Is T_iAnd T_missAt a characteristic attribute t_kAbove Euclidean distance, D (T)_i,T_hit,t_k) Is T_iAnd T_hitAt a characteristic attribute t_kAbove Euclidean distance, max (D (t)_k) For all samples at attribute t)_kThe maximum Euclidean distance above, n is the iteration number;

and selecting data corresponding to the effective characteristic attributes to establish an updated training data set.

6. The system of claim 4, further comprising:

and the verification unit is used for performing defect prediction on the prediction data by utilizing the SVM model according to the optimized parameters and verifying the accuracy of the optimized parameters before obtaining a prediction result by utilizing the SVM model according to the optimized parameters.