CN104899135A

CN104899135A - Software defect prediction method and system

Info

Publication number: CN104899135A
Application number: CN201510247157.9A
Authority: CN
Inventors: 杨春晖; 熊婧; 高岩; 林军; 李冬
Original assignee: Fifth Electronics Research Institute of Ministry of Industry and Information Technology
Current assignee: Fifth Electronics Research Institute of Ministry of Industry and Information Technology
Priority date: 2015-05-14
Filing date: 2015-05-14
Publication date: 2015-09-09
Anticipated expiration: 2035-05-14
Also published as: CN104899135B

Abstract

The invention relates to a software defect prediction method and system, and the method comprises the steps: obtaining sample software modules and clustering to obtain clustered subsets; calculating a Gauss parameter of the clustered subsets, generating a pseudo-defect sample according to the Gauss parameter, and obtaining an updated defect sample set according to the software defect sample set and the pseudo-defect sample; training according to the updated defect sample set to obtain a defect prediction model, performing defect prediction on software to be detected according to the defect prediction model and outputting a prediction result. The clustered subsets are formed by clustering the sample software modules, the clustered subsets are subject to Gauss analysis calculation to obtain the Gauss parameter, and then the pseudo-defect sample is generated according to the Gauss parameter. More defect data are increased to generate the updated defect sample set for training, the accuracy of the defect prediction model is improved, the defect prediction model can better estimate and fit the defect data and the prediction accuracy for software defects is improved.

Description

Software defect prediction method and system

Technical Field

The invention relates to the technical field of software security, in particular to a software defect prediction method and a software defect prediction system.

Background

With the development of information technology, the software complexity is continuously improved, the software scale is continuously increased, and a good software quality control and prediction mechanism not only can help enterprises to develop high-quality software products and reduce the production and maintenance cost of the products, but also has important significance in the aspects of improving the customer satisfaction, establishing good enterprise images, enhancing the competitiveness of the enterprises in the market and the like. Therefore, the quality of software is more and more emphasized, and how to predict and improve the quality of software becomes one of the hot spots of current research.

The traditional software defect prediction method adopts a software defect prediction model based on machine learning, the model takes a measurement data vector of a software module as input, and whether the software module has defects or not is predicted through the steps of preprocessing, feature extraction, model training, prediction and the like. Due to the inherent problems of evaluation criteria of the performance, induction bias and the like of the model, a software defective module and a software non-defective module are treated equally, the overall maximum prediction accuracy is taken as a target, but the detection rate of software defects is still not high. Therefore, the traditional software defect prediction method has the defect of low prediction accuracy.

Disclosure of Invention

In view of the above, it is necessary to provide a software defect prediction method and system with high prediction accuracy.

A software defect prediction method comprises the following steps:

acquiring a sample software module and carrying out clustering processing to obtain a clustering subset;

calculating Gaussian parameters of the clustering subsets, and generating pseudo-defect samples according to the Gaussian parameters;

obtaining an updated defect sample set according to the software defect sample set and the pseudo defect sample;

training according to the updated defect sample set to obtain a defect prediction model;

and performing defect prediction on the software module to be tested according to the defect prediction model, and outputting a prediction result.

A software bug prediction system comprising:

the clustering module is used for acquiring the sample software module and carrying out clustering processing to obtain a clustering subset;

the calculation module is used for calculating the Gaussian parameters of the clustering subsets and generating pseudo-defect samples according to the Gaussian parameters;

the updating module is used for obtaining an updated defect sample set according to the software defect sample set and the pseudo defect sample;

the training module is used for training according to the updated defect sample set to obtain a defect prediction model;

and the prediction module is used for predicting the defects of the software module to be tested according to the defect prediction model and outputting a prediction result.

According to the software defect prediction method and system, the sample software modules are obtained and subjected to clustering processing, and the clustering subset is obtained. And calculating Gaussian parameters of the clustering subset, generating a pseudo-defect sample according to the Gaussian parameters, and obtaining an updated defect sample set according to the software defect sample set and the pseudo-defect sample. And training according to the updated defect sample set to obtain a defect prediction model, performing defect prediction on the software module to be tested according to the defect prediction model, and outputting a prediction result. And forming a clustering subset for the sample software module in a clustering mode, carrying out Gaussian analysis calculation on the clustering subset to obtain a Gaussian parameter, and then generating a pseudo-defect sample according to the Gaussian parameter. The defect prediction model can better estimate and fit the defect data by increasing more defect data to generate an updated defect sample set for training, so that the accuracy of the defect prediction model is improved, and the prediction accuracy of software defects is improved.

Drawings

FIG. 1 is a flow diagram of a software bug prediction method in one embodiment;

FIG. 2 is a flow chart of obtaining sample software modules and performing clustering to obtain a cluster subset in one embodiment;

FIG. 3 is a flowchart illustrating an embodiment of obtaining a center point of a sample vector by using each sample vector as a starting point;

FIG. 4 is a flowchart illustrating the calculation of Gaussian parameters of a cluster subset and the generation of pseudo-defect samples based on the Gaussian parameters according to an embodiment;

FIG. 5 is a flowchart illustrating defect prediction for a software module under test according to the defect prediction model in an embodiment;

FIG. 6 is a flowchart of a software defect prediction method in another embodiment;

FIG. 7 is a block diagram of a software bug prediction system in accordance with one embodiment;

FIG. 8 is a block diagram of a clustering module in one embodiment;

FIG. 9 is a block diagram of a center point calculation unit in an embodiment;

FIG. 10 is a block diagram of a compute module in one embodiment;

FIG. 11 is a block diagram of a prediction module in one embodiment;

FIG. 12 is a block diagram of a software defect prediction system in accordance with another embodiment.

Detailed Description

A software defect prediction method, as shown in fig. 1, includes the following steps:

step S110: and acquiring a sample software module and carrying out clustering processing to obtain a clustering subset. The sample software module refers to a software module which is known to have defects or not, and the sample software module is classified through clustering processing to obtain a clustering subset. In one embodiment, as shown in fig. 2, step S110 includes steps S112 to S118.

Step S112: and respectively marking the sample software modules to obtain the defect marks of the sample software modules. For example, for the ith sample software module, i 1,2_i1 is ═ 1; if there is no defect, the defect mark y_i0. It can be understood that the marking mode of each sample software module and the value of the obtained defect mark are not unique, and in other embodiments, the defect of the sample software module with the defect may be marked as 0, and the defect of the sample software module without the defect may be marked as 1, etc.

Step S114: and respectively carrying out static measurement on the sample software modules to obtain sample vectors of the sample software modules. For the ith sample software module, i 1, 2. The static metric in this embodiment may specifically include a Halstead metric, a MaCabe metric, a khoshgoftar metric, a CK metric, and the likeObtaining n measurement values, and marking the measurement values as x_i1,x_i2,...,x_inSample vector x forming a sample software module_i＝{x_i1,x_i2,...,x_inAll sample vectors constitute a software defect sample set x_i|i＝1,2,...,M}。

Step S116: and respectively taking each sample vector as a starting point to obtain the central point of the sample vector. And respectively taking each sample vector as a starting point, and calculating the central point of the corresponding sample vector to be used as a basis for clustering. Further, in the embodiment, the MeanShift method is adopted for clustering, so that the calculated amount is small, and the clustering analysis speed can be improved. As shown in fig. 3, step S116 includes steps S1162 to S1166.

Step S1162: taking the sample vector as a starting point, calculating a meanshift vector of the sample vector. The method specifically comprises the following steps:

<math> <mrow> <msub> <mi>M</mi> <mi>h</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mi>K</mi> </mfrac> <munder> <mi>Σ</mi> <mrow> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>&Element;</mo> <msub> <mi>S</mi> <mi>k</mi> </msub> </mrow> </munder> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </math>

wherein M is_hMean shift vector, S, representing sample vector x_h(x) In a high-dimensional spherical region having a radius of a constant h, a relationship (x-x) is satisfied_i)^T(x-x_i)<h²Set of K sample vectors, x_iIs S_h(x) The sample vector in (1), T, denotes transpose.

Step S1164: judging whether the meanshift vector of the sample vector is larger than the preset valueAnd (4) a threshold value. If yes, taking the sum of the sample vector and the meanshift vector as a new sample vector, and returning to the step S1162; if not, go to step S1166. The preset threshold is preset and can be adjusted according to the actual situation, if the mean shift vector M_hGreater than, with x_i+M_hAs a new starting point, a new meanshift vector M is again calculated_h。

Step S1166: and taking the sum of the sample vector and the meanshift vector as the central point of the sample vector. If M is_hIf x is less than or equal to x is confirmed_i+M_hIs the center point.

Repeating S1162 to S1166 for each sample vector until all sample vectors are traversed, and generating M central points.

Step S118: and clustering the sample vectors according to the central points of the sample vectors to obtain a clustering subset. And dividing the sample vectors which tend to the same central point into a class to form L clustering subsets.

Step S120: and calculating Gaussian parameters of the clustering subsets, and generating a pseudo-defect sample according to the Gaussian parameters. By constructing a software defect distribution function of mixed gauss, the software defect distribution is better described. And then, calculating the Gaussian parameters by using the relation between the defect distribution and the sample vector, and laying a foundation for further pseudo-defect generation.

And performing Gaussian parameter estimation on each cluster subset. Assume the kth subset of clusters asNumber of samples M_k. In one embodiment, the gaussian parameters include mean and variance. As shown in fig. 4, step S120 includes steps S121 to S125.

Step S121: the mean of the cluster subsets is calculated. The method specifically comprises the following steps:

wherein, mu^kDenotes the mean value, M_kFor sample vectors in cluster subsetsThe number of the (c) component(s),represents the mean value μ^kThe metric value in the nth dimension.

Step S122: the variance of the cluster subsets in each dimension is calculated. The method specifically comprises the following steps:

wherein,representing the variance of the clustering subset in the j-th dimension, n being the dimension,representing a sample vectorThe value of the metric in the j-th dimension,represents the mean value μ^kMetric values in the j-th dimension.

Step S123: and correspondingly generating random numbers of the clustering subsets in each dimension according to the variance of the clustering subsets in each dimension.

In particular, for the j-th dimension, according to a Gaussian distributionGenerating random numbersSpecifically, 12 are selected as [0,1 ]]Random variable uniformly distributed onThenIt can be understood thatThe constant value adopted in the random number calculation is not unique and can be adjusted according to the actual situation.

Step S124: and obtaining a random vector of the clustering subset according to the random number of the clustering subset in each dimension. After calculating the random quantity according to step S123 for each dimension, a random vector is obtained, which specifically is:

wherein, Delta Lambda^kIn the form of a random vector, the vector is,is a random number in the nth dimension for the subset of clusters.

Step S125: and obtaining a pseudo defect sample according to the mean value and the random vector of the clustering subset. The calculation of the pseudo-defect sample is specifically as follows:

where t is a pseudo-defect sample, μ^kTo mean of the cluster subset, Δ Λ^kIs a random vector.

Repeating the steps S121 to S125 for each cluster subset to obtain L pseudo-defect samples, where the pseudo-defect samples represent sample vectors of the virtually obtained software modules with defects, and the defect flag corresponding to each pseudo-defect sample may be set to be 0.

Step S130: and obtaining an updated defect sample set according to the software defect sample set and the pseudo defect sample. Assume a set of pseudo-defect samples as t_iL, total L pseudo-defect samples. Original software defect sample set x_i1,2, M and a set of pseudo-defect samples { t |_iL | -1, 2.,. L } are merged to obtain an updated defect sample set { x'_i|i＝1,2,...,M+L}。

Step S140: and training according to the updated defect sample set to obtain a defect prediction model. And more defect data are added to generate an updated defect sample set for training, so that the accuracy of the defect prediction model is improved, and the defect prediction model can better estimate and fit the defect data.

In one embodiment, step S140 is to train a defect prediction model based on a support vector machine according to the updated defect sample set, specifically:

wherein x is_i、x_jRespectively for updating ith and j sample vectors, y in the defect sample set_i、y_jRespectively updating defect marks, lambda, corresponding to ith and j sample vectors in the defect sample set_i、λ_jRespectively representing the weights of ith and j sample vectors in the updated defect sample set; and S.T. represents a constraint condition, C is a constant, and M + L represents the number of sample vectors in the updated defect sample set. In this example, k (x)_i,x_j) Represents a pair x_iAnd x_jGet a dot product ofOther operations may be shown in other embodiments.

To representTaking the value of lambda when the maximum value is obtained; respectively updating sample vectors x in defect sample set_i、x_jSubstitution intoDetermining a sample vector x when taking the maximum value_iWeight of (lambda)_jAnd finally obtaining the weight of all sample vectors in the updated defect sample set.

Step S150: and performing defect prediction on the software module to be tested according to the defect prediction model, and outputting a prediction result. And performing defect prediction on the unknown software module to be tested by using the defect prediction model to obtain and output a prediction result, and informing a worker of completing the defect prediction on the software module to be tested.

In one embodiment, as shown in fig. 5, step S150 includes step S152 and step S154.

Step S152: and performing static measurement on the software module to be measured to obtain a sample vector of the software module to be measured. The specific process of performing static measurement on the software module to be tested is similar to step S114, and is not described herein again.

Step S154: and performing defect prediction on the software module to be tested according to the sample vector of the software module to be tested and the defect prediction model. The method specifically comprises the following steps:

wherein g (x) represents the defect mark of the software module to be tested, sgn represents the pairFind an integer variable, whenGreater than 0 is taken as 10 is selected when the value is less than or equal to 0; x is the number of_iFor updating the ith sample vector, y, in the defect sample set_iFor updating the defect mark, λ, corresponding to the ith sample vector in the defect sample set_iRepresenting the weight of the ith sample vector in the updated defect sample set obtained by the defect prediction model, M + L representing the number of the sample vectors in the updated defect sample set, x being the sample vector of the software module to be tested, b being a constant, and K (x) in the same sample embodiment_iAnd x) represents the pair x_iAnd x is dot multiplied. The way of shaping the variables in this embodiment and the defect mark y if there is a defect in step S112_i1 is ═ 1; defect mark y if no defect is present_iThe correspondence of 0 may be specifically adjusted according to the definition of the defect label.

In practical application, the software failure modules account for relatively less total number of software modules because the probability of software failure is lower than the normal probability. However, when these few software failure modules are mispredicted as being flawless, once put into practical use, the economic and social losses incurred are immeasurable.

According to the software defect prediction method, a sample software module forms a clustering subset in a clustering mode, Gaussian analysis calculation is carried out on the clustering subset to obtain a Gaussian parameter, and then a pseudo defect sample is generated according to the Gaussian parameter. The defect prediction model can better estimate and fit the defect data by increasing more defect data to generate an updated defect sample set for training, so that the accuracy of the defect prediction model is improved, and the prediction accuracy of software defects is improved.

In one embodiment, as shown in fig. 6, after step S150, the software defect prediction method further includes step S160.

Step S160: and outputting alarm information when the software module to be tested has defects. Output alarm information reports to the police, reminds the staff to know to the staff in time distinguishes the software module that has the defect, so that follow-up overhauls, improves the operation convenience. Specifically, the alarm can be performed through sound and light alarm, voice alarm, or alarm through displaying preset pictures or characters on a display screen.

The invention also provides a software defect prediction system, as shown in fig. 7, which comprises a clustering module 110, a calculating module 120, an updating module 130, a training module 140 and a prediction module 150.

The clustering module 110 is configured to obtain a sample software module and perform clustering processing to obtain a clustering subset. The sample software module refers to a software module which is known to have defects or not, and the sample software module is classified through clustering processing to obtain a clustering subset. In one embodiment, as shown in FIG. 8, clustering module 110 includes a labeling unit 112, a metric unit 114, a center point calculation unit 116, and a clustering unit 118.

The marking unit 112 is configured to mark the sample software modules respectively to obtain each sample software moduleAnd marking the defects of the blocks. For example, for the ith sample software module, i 1,2_i1 is ═ 1; if there is no defect, the defect mark y_i0. It can be understood that the marking mode of each sample software module and the value of the obtained defect mark are not unique, and in other embodiments, the defect of the sample software module with the defect may be marked as 0, and the defect of the sample software module without the defect may be marked as 1, etc.

The measurement unit 114 is configured to perform static measurement on the sample software modules, respectively, to obtain sample vectors of the sample software modules. For the ith sample software module, i 1, 2. In this embodiment, the static metric may specifically include Halstead metric, MaCabe metric, Khoshgoftaar metric, CK metric, and the like, n metric values are obtained, and the metric values are respectively marked as x_i1,x_i2,...,x_inSample vector x forming a sample software module_i＝{x_i1,x_i2,...,x_inAll sample vectors constitute a software defect sample set x_i|i＝1,2,...,M}。

The central point calculating unit 116 is configured to obtain a central point of each sample vector by using each sample vector as a starting point. And respectively taking each sample vector as a starting point, and calculating the central point of the corresponding sample vector to be used as a basis for clustering. Further, in the embodiment, the MeanShift method is adopted for clustering, so that the calculated amount is small, and the clustering analysis speed can be improved. As shown in fig. 9 in particular, the center point calculating unit 116 includes a first unit 1162, a second unit 1164, and a third unit 1166.

The first unit 1162 is configured to calculate a meanshift vector of the sample vector using the sample vector as a starting point. The method specifically comprises the following steps:

The second unit 1164 is configured to determine whether the meanshift vector of the sample vector is greater than a preset threshold. The preset threshold is preset and can be adjusted according to actual conditions.

The third unit 1166 is configured to, when the mean shift vector of the sample vector is greater than the preset threshold, take the sum of the sample vector and the mean shift vector as a new sample vector, and control the first unit 1162 to calculate the mean shift vector again; and when the mean shift vector of the sample vector is less than or equal to a preset threshold value, taking the sum of the sample vector and the mean shift vector as the central point of the sample vector.

If mean shift vector M_hGreater than, with x_i+M_hAs a new starting point, the first unit 1162 is controlled to calculate again a new meanshift vector M_h. If M is_hIf x is less than or equal to x is confirmed_i+M_hIs the center point.

The calculation is repeated for each sample vector until all sample vectors are traversed, generating M center points.

The clustering unit 118 is configured to cluster the sample vectors according to the central points of the sample vectors to obtain a cluster subset. And dividing the sample vectors which tend to the same central point into a class to form L clustering subsets.

The calculating module 120 is configured to calculate a gaussian parameter of the clustering subset, and generate a pseudo-defect sample according to the gaussian parameter. By constructing a software defect distribution function of mixed gauss, the software defect distribution is better described. And then, calculating the Gaussian parameters by using the relation between the defect distribution and the sample vector, and laying a foundation for further pseudo-defect generation.

And performing Gaussian parameter estimation on each cluster subset. Assume the kth subset of clusters asNumber of samples M_k. In one embodiment, the gaussian parameters include mean and variance. As shown in fig. 10, the calculation block 120 includes a mean calculation unit 121, a variance calculation unit 122, a random number generation unit 123, a random vector generation unit 124, and a pseudo-defect sample generation unit 125.

The mean calculation unit 121 is configured to calculate a mean of the cluster subset. The method specifically comprises the following steps:

The variance calculating unit 122 is used for calculating the variance of the clustering subset in each dimension. The method specifically comprises the following steps:

The random number generating unit 123 is configured to correspondingly generate a random number of the clustering subset in each dimension according to the variance of the clustering subset in each dimension.

In particular, for the j-th dimension, according to a Gaussian distributionGenerating random numbersSpecifically, 12 are selected as [0,1 ]]Random variable uniformly distributed onThenIt can be understood that the constant value adopted in the calculation of the random number is not unique and can be adjusted according to the actual situation.

The random vector generating unit 124 is configured to obtain a random vector of the cluster subset according to the random number of the cluster subset in each dimension. After the random quantity is calculated for each dimension, a random vector is obtained, which specifically comprises:

The pseudo-defect sample generating unit 125 is configured to obtain a pseudo-defect sample according to the mean value and the random vector of the clustering subset, and specifically includes:

And repeatedly calculating each clustering subset to obtain L pseudo-defect samples, wherein the pseudo-defect samples represent the virtually obtained sample vectors of the software modules with the defects, and the defect marks corresponding to the pseudo-defect samples can be set to be 0.

The updating module 130 is configured to obtain an updated defect sample set according to the software defect sample set and the pseudo defect sample. Assume a set of pseudo-defect samples as t_iL, total L pseudo-defect samples. Original software defect sample set x_i1,2, M and a set of pseudo-defect samples { t |_iL | -1, 2.,. L } are merged to obtain an updated defect sample set { x'_i|i＝1,2,...,M+L}。

The training module 140 is configured to perform training according to the updated defect sample set to obtain a defect prediction model. And more defect data are added to generate an updated defect sample set for training, so that the accuracy of the defect prediction model is improved, and the defect prediction model can better estimate and fit the defect data.

In one embodiment, the training module 140 trains the defect prediction model based on the support vector machine according to the updated defect sample set, specifically:

wherein x is_i、x_jRespectively for updating ith and j sample vectors, y in the defect sample set_i、y_jRespectively updating defect marks, lambda, corresponding to ith and j sample vectors in the defect sample set_i、λ_jRespectively representing the weights of ith and j sample vectors in the updated defect sample set; and S.T. represents a constraint condition, C is a constant, and M + L represents the number of sample vectors in the updated defect sample set. In this example, k (x)_i,x_j) Represents a pair x_iAnd x_jThe dot product may be calculated in other embodiments by other calculation methods.

To representTaking the value of lambda when the maximum value is obtained; respectively updating sample vectors x in defect sample set_i、x_jSubstitution intoDetermining a sample vector x when taking the maximum value_iWeight of (lambda)_jFinally obtaining the weight of updating all sample vectors in the defect sample setThe value is obtained.

The prediction module 150 is configured to perform defect prediction on the software module to be tested according to the defect prediction model, and output a prediction result. And performing defect prediction on the unknown software module to be tested by using the defect prediction model to obtain and output a prediction result, and informing a worker of completing the defect prediction on the software module to be tested.

In one embodiment, as shown in FIG. 11, prediction module 150 includes a processing unit 152 and a prediction unit 154.

The processing unit 152 is configured to perform static measurement on the software module to be tested, so as to obtain a sample vector of the software module to be tested. The specific process of performing static measurement on the software module to be tested is similar to the operation principle of the measurement unit 114, and is not described herein again.

The prediction unit 154 is configured to perform defect prediction on the software module to be tested according to the sample vector of the software module to be tested and the defect prediction model. The method specifically comprises the following steps:

wherein g (x) represents the defect mark of the software module to be tested, sgn represents the pairFind an integer variable, whenGreater than 0 is taken as 10 is selected when the value is less than or equal to 0; x is the number of_iFor updating the ith sample vector, y, in the defect sample set_iFor updating the defect mark, λ, corresponding to the ith sample vector in the defect sample set_iRepresenting the weight of the ith sample vector in the updated defect sample set obtained by the defect prediction model, M + L representing the number of the sample vectors in the updated defect sample set, x being the sample vector of the software module to be tested, b being a constant, and K (x) in the same sample embodiment_iAnd x) represents the pair x_iAnd x is dot multiplied. The method for shaping the variables and the marking unit 112 with defect in this embodiment are then the defect mark y_i1 is ═ 1; defect mark y if no defect is present_iThe correspondence of 0 may be specifically adjusted according to the definition of the defect label.

According to the software defect prediction system, the sample software module forms the clustering subset in a clustering mode, Gaussian analysis calculation is carried out on the clustering subset to obtain the Gaussian parameter, and then the pseudo defect sample is generated according to the Gaussian parameter. The defect prediction model can better estimate and fit the defect data by increasing more defect data to generate an updated defect sample set for training, so that the accuracy of the defect prediction model is improved, and the prediction accuracy of software defects is improved.

In one embodiment, as shown in FIG. 12, the software bug prediction system may further include an alarm module 160.

The alarm module 160 is used for outputting alarm information when the software module to be tested has defects. Output alarm information reports to the police, reminds the staff to know to the staff in time distinguishes the software module that has the defect, so that follow-up overhauls, improves the operation convenience. Specifically, the alarm can be performed through sound and light alarm, voice alarm, or alarm through displaying preset pictures or characters on a display screen.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A software defect prediction method is characterized by comprising the following steps:

2. The software defect prediction method of claim 1, wherein the step of obtaining sample software modules and performing clustering to obtain a cluster subset comprises the steps of:

respectively marking the sample software modules to obtain defect marks of the sample software modules;

respectively carrying out static measurement on the sample software modules to obtain sample vectors of all the sample software modules;

respectively taking each sample vector as a starting point, and acquiring a central point of the sample vector;

and clustering the sample vectors according to the central points of the sample vectors to obtain a clustering subset.

3. The software defect prediction method of claim 2, wherein the step of calculating the center point of the sample vector using each sample vector as a starting point comprises the steps of:

taking the sample vector as a starting point, calculating a meanshift vector of the sample vector, specifically:

wherein M is_hMean shift vector, S, representing sample vector x_h(x) Representing radius by a constant hIn the high-dimensional sphere region, the relationship (x-x) is satisfied_i)^T(x-x_i)<h²Set of K sample vectors, x_iIs S_h(x) Sample vector of (1), T denotes transpose;

judging whether the meanshift vector of the sample vector is larger than a preset threshold value or not;

if not, taking the sum of the sample vector and the meanshift vector as the central point of the sample vector;

and if so, taking the sum of the sample vector and the meanshift vector as a new sample vector, and returning to the step of calculating the meanshift vector of the sample vector by taking the sample vector as a starting point.

4. The software defect prediction method of claim 1, wherein the gaussian parameters comprise a mean and a variance; the method comprises the steps of calculating Gaussian parameters of the clustering subsets and generating pseudo-defect samples according to the Gaussian parameters, and comprises the following steps:

calculating the mean value of the clustering subset, specifically:

wherein，μ^kDenotes the mean value, M_kFor sample vectors in cluster subsetsThe number of the (c) component(s),represents the mean value μ^kA metric value in the nth dimension;

calculating the variance of the clustering subset in each dimension, specifically:

wherein,representing the variance of the clustering subset in the j-th dimension, n being the dimension,representing a sample vectorThe value of the metric in the j-th dimension,represents the mean value μ^kA metric value in the j-th dimension;

according to the variance of the clustering subsets in each dimension, correspondingly generating random numbers of the clustering subsets in each dimension, which specifically comprises the following steps: according to a Gaussian distribution for dimension jGenerating random numbersTaking 12 as the number of [0,1 ]]Random variable uniformly distributed onThen

Obtaining a random vector of the clustering subset according to the random number of the clustering subset in each dimension, which specifically comprises the following steps:

wherein, Delta Lambda^kIn the form of a random vector, the vector is,random numbers of the clustering subsets in the nth dimension;

obtaining a pseudo-defect sample according to the mean value and the random vector of the clustering subset, specifically:

5. The software defect prediction method of claim 1, wherein the step of training to obtain a defect prediction model according to the updated defect sample set specifically comprises:

wherein,to representTaking the value of lambda when the maximum value is obtained;

x_i、x_jrespectively for updating ith and j sample vectors, y in the defect sample set_i、y_jRespectively updating defect marks, lambda, corresponding to ith and j sample vectors in the defect sample set_i、λ_jRespectively representing the weights of ith and j sample vectors in the updated defect sample set; and S.T. represents a constraint condition, C is a constant, and M + L represents the number of sample vectors in the updated defect sample set.

6. The software defect prediction method of claim 1, wherein the step of performing defect prediction on the software module to be tested according to the defect prediction model comprises:

performing static measurement on the software module to be measured to obtain a sample vector of the software module to be measured;

performing defect prediction on the software module to be tested according to the sample vector of the software module to be tested and a defect prediction model, specifically:

wherein g (x) represents the defect mark of the software module to be tested, sgn represents the pairFind an integer variable, whenGreater than 0 is taken as 10 is selected when the value is less than or equal to 0; x is the number of_iFor updating the ith sample vector, y, in the defect sample set_iFor updating the defect mark, λ, corresponding to the ith sample vector in the defect sample set_iAnd representing the weight of the ith sample vector in the updated defect sample set obtained by the defect prediction model, wherein M + L represents the number of the sample vectors in the updated defect sample set, x is the sample vector of the software module to be tested, and b is a constant.

7. The software defect prediction method of claim 1, further comprising a step of outputting alarm information when the software module to be tested has a defect after the step of performing defect prediction on the software module to be tested according to the defect prediction model.

8. A software bug prediction system, comprising:

9. The software defect prediction system of claim 8, wherein the clustering module comprises:

the marking unit is used for respectively marking the sample software modules to obtain defect marks of the sample software modules;

the measurement unit is used for respectively carrying out static measurement on the sample software modules to obtain sample vectors of all the sample software modules;

the central point calculating unit is used for respectively taking each sample vector as a starting point and acquiring the central point of the sample vector;

and the clustering unit is used for clustering the sample vectors according to the central points of the sample vectors to obtain a clustering subset.

10. The software defect prediction system of claim 8, wherein the gaussian parameters include mean and variance; the calculation module comprises:

the mean value calculating unit is used for calculating a mean value of the clustering subset, and specifically comprises the following steps:

wherein, mu^kDenotes the mean value, M_kFor sample vectors in cluster subsetsThe number of the (c) component(s),represents the mean value μ^kA metric value in the nth dimension;

the variance calculating unit is used for calculating the variance of the clustering subset in each dimension, and specifically comprises the following steps:

a random number generating unit, configured to correspondingly generate a random number of the clustering subset in each dimension according to the variance of the clustering subset in each dimension, where: according to a Gaussian distribution for dimension jGenerating random numbersTaking 12 as the number of [0,1 ]]Random variable uniformly distributed onThen

A random vector generating unit, configured to obtain a random vector of the clustering subset according to the random number of the clustering subset in each dimension, specifically to obtain a random vector of the clustering subset

a pseudo-defect sample generating unit, configured to obtain a pseudo-defect sample according to the mean value and the random vector of the clustering subset, specifically: