CN108647138A

CN108647138A - A kind of Software Defects Predict Methods, device, storage medium and electronic equipment

Info

Publication number: CN108647138A
Application number: CN201810162377.5A
Authority: CN
Inventors: 张雪莹; 李瑞贤; 杨云祥; 郭静; 吉祥; 胡校成; 唐先超; 宋超; 江逸楠; 段锐; 阳兵
Original assignee: China Electronics Technology Group Corp CETC
Current assignee: China Electronics Technology Group Corp CETC; Electronic Science Research Institute of CTEC
Priority date: 2018-02-27
Filing date: 2018-02-27
Publication date: 2018-10-12

Abstract

The invention discloses a kind of Software Defects Predict Methods, device, storage medium and electronic equipment, this method includes：The sample data for concentrating selection predetermined number in the first default initial data according to the first default selection rule, to obtain the first prototype data collection；The first data set of dissimilarity between the first default raw data set and the first prototype data collection is calculated according to the first pre-determined distance algorithm；Data in first data set are input to predetermined software bug prediction model, to obtain the software defect prediction result of the first default raw data set corresponding software, wherein shown predetermined software bug prediction model is according to the model for presetting dissimilarity structure.By that with the present invention, can determine that software whether there is defect according to dissimilarity, fundamentally promote estimated performance, accuracy rate is higher, and better user experience solves problem of the prior art.

Description

A kind of Software Defects Predict Methods, device, storage medium and electronic equipment

Technical field

The present invention relates to data processing field, more particularly to a kind of Software Defects Predict Methods, device, storage medium and Electronic equipment.

Background technology

Defective sample size is therefore, soft often than flawless sample size much less in software defect data set Part failure prediction can be considered a unbalanced problem concerning study of class.During class unbalanced study study, different classes of mistake Divide cost also each unequal, wherein the accidentally point cost of minority class (defective) accidentally divides cost far above most classes (zero defect), To reduce accidentally point cost to the maximum extent, prediction algorithm more focuses on being promoted the predictablity rate of defective minority class sample.It is real On border, traditional sorting algorithm is typically established at class distributing equilibrium and under the premise of accidentally point cost is equal, is missed with minimizing classification Difference is final goal, therefore directly uses decision tree classification, neural network, Bayes's classification, support vector machines and k- arest neighbors Traditional machine learning algorithms such as classification can not obtain preferable software defect estimated performance.

In recent years, the unbalanced problem concerning study of class receives the extensive concern of academia, machine learning and Data Mining Brainstrust proposes many effective solutions in terms of data Layer and algorithm layer two.

About data layer method, mainly by way of sampling or generating new samples, class distribution is made to restore balanced, it is such as random Sub- sampling (RUS) and random oversampling (ROS).Duplicate sampling can be distributed with balanced class, but sub- sampling often ignore it is certain heavy Sample is wanted, loss of learning is caused；Conversely, oversampling can introduce a large amount of copies, redundancy is generated, over-fitting is caused.

About algorithm layer method, lays particular emphasis on and improve existing sorting algorithm or study new sorting algorithm, preferably to solve The unbalanced problem concerning study of class." One-Class Learning " method, this method only build disaggregated model in most classes, it is difficult to Accurate Prediction minority class；Ensemble of learning methods builds multiple disaggregated models by duplicate sampling, iteration updates the power of training sample Weight or the mode for combining multiple decision trees, obtain stable nicety of grading, such as Bagging, Boosting and Random Forest Scheduling algorithm.In particular, when, there are when significant difference, assembled classification model is more more acurrate than basic classification model between disaggregated model, but Its computationally intensive and complexity is higher；Cost-sensitive is analyzed, and accidentally cost is divided to be disobeyed as learning objective, such as MetaCost to minimize Rely in sorting algorithm, and can be applied on any form of cost matrix, but how to determine cost matrix at present is still one Problem.

Therefore, how the unbalanced learning method of existing class adjusts class distribution or innovatory algorithm if being laid particular emphasis on, can not be from basic The upper estimated performance for promoting such problem, predictablity rate is relatively low, and user experience is poor.

Invention content

A kind of Software Defects Predict Methods of present invention offer, device, storage medium and electronic equipment, to solve existing skill The following problem of art：How the existing unbalanced learning method of class adjusts class distribution or innovatory algorithm if being laid particular emphasis on, can not be from basic The upper estimated performance for promoting such problem, predictablity rate is relatively low, and user experience is poor.

In order to solve the above technical problems, on the one hand, the present invention provides a kind of Software Defects Predict Methods, including：According to One default selection rule concentrates the sample data of selection predetermined number in the first default initial data, to obtain the first prototype data Collection；Not phase is calculated between the first default raw data set and the first prototype data collection according to the first pre-determined distance algorithm Like the first data set of property；Data in first data set are input to predetermined software bug prediction model, to obtain State the software defect prediction result of the first default raw data set corresponding software, wherein shown predetermined software bug prediction model For according to the model for presetting dissimilarity structure.

Optionally, the sample number of selection predetermined number is concentrated in the first default initial data according to the first default selection rule According to before obtaining the first prototype data collection, to further include：It is concentrated in the second default initial data according to the second default selection rule The sample data for selecting predetermined number, to obtain the second prototype data collection；Described second is calculated according to the second pre-determined distance algorithm Second data set of default dissimilarity between raw data set and the second prototype data collection；According to second data set In data and default sorting algorithm build the predetermined software bug prediction model.

Optionally, the described first default selection rule or the second default selection rule include at least following one：With Machine selection method, the linear programming method based on cluster.

Optionally, the first pre-determined distance algorithm or the second pre-determined distance algorithm include at least following one：Europe Distance algorithm, manhatton distance algorithm, Minkowski Distance algorithm are obtained in several.

On the other hand, the present invention also provides a kind of software defect prediction meanss, including：First choice module, for according to First default selection rule concentrates the sample data of selection predetermined number in the first default initial data, to obtain the first modal number According to collection；

First computing module, for according to the first pre-determined distance algorithm calculate the first default raw data set with it is described First data set of dissimilarity between first prototype data collection；Prediction module is used for the data in first data set It is input to predetermined software bug prediction model, the software defect to obtain the described first default raw data set corresponding software is predicted As a result, wherein shown predetermined software bug prediction model is according to the model for presetting dissimilarity structure.

Optionally, further include：Second selecting module is used for according to the second default selection rule in the second default initial data The sample data for concentrating selection predetermined number, to obtain the second prototype data collection；Second computing module, for default according to second Distance algorithm calculates the second data of dissimilarity between the described second default raw data set and the second prototype data collection Collection；Module is built, for according to the data and the default sorting algorithm structure predetermined software defect in second data set Prediction model.

On the other hand, the present invention also provides a kind of storage mediums, are stored with computer program, and the computer program is located The step of reason device realizes above-mentioned Software Defects Predict Methods when executing.

On the other hand, memory, processor are included at least the present invention also provides a kind of electronic equipment, on the memory It is stored with computer program, the processor realizes that above-mentioned software defect is pre- in the computer program on executing the memory The step of survey method.

The present invention considers the discriminating power of attributive character in class imbalanced data sets, therefore, the original that selection is obtained Type data set raw data set corresponding with the prototype data collection carries out dissimilarity calculating, and then determines that has a dissmilarity Property data data set, and the data in the data set are input to the software defect prediction built previously according to dissimilarity In model, it will be able to determine that software whether there is defect according to dissimilarity, fundamentally promote estimated performance, accuracy rate compared with Height, better user experience solve the problems, such as the as follows of the prior art：How the existing unbalanced learning method of class adjusts if being laid particular emphasis on Class be distributed or innovatory algorithm, can not fundamentally promote the estimated performance of such problem, predictablity rate is relatively low, user experience compared with Difference.

Description of the drawings

Fig. 1 is the flow chart of Software Defects Predict Methods in first embodiment of the invention；

Fig. 2 is the structural schematic diagram of software defect prediction meanss in second embodiment of the invention；

Fig. 3 is the configuration diagram of software defect Prediction program in third embodiment of the invention.

Specific implementation mode

In order to solve the problems, such as the as follows of the prior art：How the existing unbalanced learning method of class adjusts class distribution if being laid particular emphasis on Or innovatory algorithm, the estimated performance of such problem can not be fundamentally promoted, predictablity rate is relatively low, and user experience is poor；This Invention provides a kind of Software Defects Predict Methods, device, storage medium and electronic equipment, below in conjunction with attached drawing and implementation Example, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used to explain this hair It is bright, do not limit the present invention.

First embodiment of the invention provides a kind of Software Defects Predict Methods, and the flow of this method is as shown in Figure 1, include Step S101 to S103：

S101 concentrates the sample number of selection predetermined number according to the first default selection rule in the first default initial data According to obtain the first prototype data collection；

S102 is calculated between the first default raw data set and the first prototype data collection not according to the first pre-determined distance algorithm First data set of similitude；

Data in first data set are input to predetermined software bug prediction model by S103, to obtain the first default original The software defect prediction result of beginning data set corresponding software, wherein shown predetermined software bug prediction model is that basis is default not The model of similitude structure.

The embodiment of the present invention considers the discriminating power of attributive character in class imbalanced data sets, therefore, to selecting The prototype data collection arrived raw data set corresponding with the prototype data collection carries out dissimilarity calculating, and then determining one has The data set of dissimilarity data, and the data in the data set are input to the software built previously according to dissimilarity and are lacked It falls into prediction model, it will be able to determine that software whether there is defect according to dissimilarity, fundamentally promote estimated performance, accurately Rate is higher, better user experience, solves the problems, such as the as follows of the prior art：The existing unbalanced learning method of class lay particular emphasis on how Class distribution or innovatory algorithm are adjusted, can not fundamentally promote the estimated performance of such problem, predictablity rate is relatively low, user's body It tests poor.

The sample data for selecting predetermined number is being concentrated in the first default initial data according to the first default selection rule, with Before obtaining the first prototype data collection, need first to build above-mentioned predetermined software bug prediction model, specific implementation process is also It is calculated, is specifically included according to dissimilarity：It is concentrated in the second default initial data according to the second default selection rule The sample data for selecting predetermined number, to obtain the second prototype data collection；It is default that second is calculated according to the second pre-determined distance algorithm Second data set of dissimilarity between raw data set and the second prototype data collection；According to data in the second data set and pre- If sorting algorithm builds predetermined software bug prediction model.By the above process, that is, predetermined software bug prediction model is completed Structure.

It is predicted due to the building process of predetermined software bug prediction model and using predetermined software bug prediction model Principle be identical, therefore, all there is similarity in the various rules and algorithms of above-mentioned use, for example, first default chooses Rule or the second default selection rule include at least following one：Random selection method, the linear programming method based on cluster；The One pre-determined distance algorithm or the second pre-determined distance algorithm include at least following one：Euclidean distance algorithm, manhatton distance Algorithm, Minkowski Distance algorithm.

Second embodiment of the invention provides a kind of software defect prediction meanss, the structural representation of the device as shown in Fig. 2, Including：

First choice module 10, for concentrating selection predetermined in the first default initial data according to the first default selection rule The sample data of number, to obtain the first prototype data collection；First computing module 20, couples with first choice module 10, is used for First of dissimilarity between the first default raw data set and the first prototype data collection is calculated according to the first pre-determined distance algorithm Data set；Prediction module 30 is coupled with the first computing module 20, for the data in the first data set to be input to predetermined software Bug prediction model, to obtain the software defect prediction result of the first default raw data set corresponding software, wherein shown default Software defect prediction model is according to the model for presetting dissimilarity structure.

Before above-mentioned module carries out software defect prediction, it is also necessary to above-mentioned predetermined software bug prediction model is first built, Therefore, above-mentioned apparatus can also include the following modules coupled successively：Second selecting module, for according to the second default choosing Rule is taken to concentrate the sample data of selection predetermined number in the second default initial data, to obtain the second prototype data collection；Second Computing module, for being calculated between the second default raw data set and the second prototype data collection not according to the second pre-determined distance algorithm Second data set of similitude；Build module, for according in the second data set data and default sorting algorithm build it is default Software defect prediction model.

In order to promote the discriminating power of software defect data set attribute feature, third embodiment of the invention provides a kind of electricity There is on the storage medium of the electronic equipment equipment a set of software defect Prediction program, the program to use a kind of based on not The representation of similitude substitutes primitive attribute feature with dissimilarity between sample, not only remains the original statistical information of data set, Also can get in data set structural information, and then promote the estimated performance of software defect.

When using machine learning algorithm forecasting software defect, the foundation of prediction model is typically based on the soft of staticametric member On part defective data collection, and the present invention is then that raw data set is mapped to dissimilarity space in advance, then in dissimilarity Software defect prediction model is built in space.The embodiment of the present invention is mainly converted by prototype selection, dissimilarity and classification three It is grouped as.Fig. 3 gives the overall framework of the embodiment of the present invention, mainly by structure and software based on dissimilarity prediction model Two big link of failure prediction forms.

(1) structure based on dissimilarity prediction model.

The building process of software defect prediction algorithm based on dissimilarity mainly by prototype selection, dissimilarity conversion and Build prediction model three parts composition.First, it is filtered out from initial data concentration using prototype selection method representative Sample creates prototype collection as prototype；Then, the dissimilarity between raw data set and prototype collection sample is calculated, thus by it It is mapped in corresponding dissimilarity space；Finally, using traditional classification algorithm software defect is built in dissimilarity space Prediction model.

1, prototype selection.

Prototype selection, which is intended to concentrate from initial data, chooses representative sample as prototype, turns as dissimilarity Reference when changing.In order to preferably choose prototype, scholars propose based on shared arest neighbors (Shared Nearest Neighbors, SNN) Jarvis-Patrick clustering (JPC) algorithm, random selection (RandomC, RC), linear Plan (LinPro), Attributions selection FeaSel), pattern search (ModeSeek), the linear programming (KCenters- based on cluster ) and the methods of editing compressed (EdiCon) LP.Wherein, random choice method (RC) randomly concentrates extraction to refer to from initial data The sample of fixed number amount is a kind of most simple and effective prototype selection method as prototype.Pekalska et al. comparative analyses Above-mentioned prototype method to the influence based on dissimilarity sorting technique performance, the experimental results showed that：The total body surfaces of RC and KCenters Now preferably, but RC is more convenient.

In order to ensure the estimated performance of the present invention, select RC as prototype selection method, from priginal soft defective data collection It is middle to extract representative sample, create prototype collection.Assuming that D represents a software defect data set, belongs to two classification and ask Topic, i.e. C={ c₁, c₂, D_tFor training set, D₁And D₂Respectively represent defective and zero defect class training set.R sample is extracted from D This is as prototype set P, using randomly selected method respectively from D₁And D₂In randomly select r₁And r₂A sample makes prototype collection P ={ r₁,r₂}。

2, dissimilarity is converted.

Dissimilarity conversion is intended to raw data set being mapped to dissimilarity space.

Assuming that D={ x₁,x₂,…,x_nRepresentative sample quantity be n software defect data set.Wherein, x_i={ a_i1, a_i2,…,a_im,c_iRepresent i-th of sample in data set D；x_iIt is made of m independent attribute and a generic attribute.P={ p₁, p₂,…,p_rIndicate the prototype collection being made of r representative samples, wherein pi={ a '_i1,a′_i2,…,a′_im,c′_iGeneration I-th of prototype of table.

With reference to prototype collection P, by calculating the dissimilarity in original software defect data set D and prototype collection P between sample, Data set D is mapped to corresponding dissimilarity space, obtains the data set based on dissimilarity It is made of n sample.Each sampleBelonged to by r+1 Property composition, whereinRepresentative sample x_iWith prototype p_jBetween dissimilarity.

When dissimilarity between assessing intensive, continuous type sample, the dissimilarity between the distance sample based on measurement is logical Common distance measures to describe, and distance is bigger, more dissimilar；Conversely, then more similar.Currently, most common distance metric has Europe A few Reed distances, manhatton distance and Minkowski Distance.Wherein, Minkowski Distance is Euclidean distance and Man Ha The popularization for distance of pausing, computational methods are shown in formula：

In formula, l is real number, l >=1.

As l=1, manhatton distance, i.e. L₁Norm；

As l=2, Euclidean distance, i.e. L₂Norm is usually used in measuring in intensive, continuous data set between sample Dissimilarity；

As l=∞, supremum distance, also known as L_∞Norm and Chebyshev's distance measure the maximum value difference between sample.

Since most of attributive character are intensive continuous in software defect data set, so selecting the Europe based on measurement several Reed distance measures the dissimilarity between sample, to realize that software defect data set is empty from original feature space to dissimilarity Between conversion.

(2) software defect is predicted.

When unknown sample arrives, calculates unknown sample first and prototype concentrates the dissimilarity between each sample, reflected It is mapped to dissimilarity space；Then, using the software defect prediction model built to the unknown sample in dissimilarity space This is predicted, you can learns whether unknown sample is defective.

The present invention is converted by prototype selection, dissimilarity and three links of classification form, Algorithms T-cbmplexity, that is, each The summation of link time complexity.Give software defect imbalanced data sets D, sample size n, a number of attributes m, profit With the present invention when carrying out failure prediction on D, the computational methods of each link time complexity are as described below：

1) time complexity of prototype selection：When choosing r prototype from the imbalanced data sets that sample size is n, no The time complexity of same prototype selection method also differs.Random selection method sampling without replacement repeats r time complexity For T_RC=O (r).

2) time complexity of dissimilarity conversion：Dissimilarity conversion be intended to calculate original software defect data set with Prototype concentrates the dissimilarity between sample, to convert it to dissimilarity space, the time complexity of dissimilarity conversion For T_DT=O (nr).

3) time complexity of failure prediction：It is n in sample size, number of attributes is r+1 based on the soft of dissimilarity When being predicted on part defective data collection, time complexity depends on selected machine learning algorithm, T_C=O (C (n, r+ 1))。

By the above process it was determined that the time complexity T of the present invention_DSDPA=T_RC+T_DT+T_C, i.e. T_DSDPA=O (r)+O (n·r)+O(C(n,r+1)).By the formula can be seen that using the program of the embodiment of the present invention on time complexity compared with Low, realization is relatively easy to, and processing procedure is faster.

Comparative analysis of the embodiment of the present invention is using on the present invention and raw data set, arest neighbors (IB1), decision tree (J48), 6 kinds of neural network (MLP), naive Bayesian (NB), random forest (RF) and support vector machines (SVM) machine learning sides Estimated performance (AUC) of the method on 18 software defect data sets.Shown in table 1 specific as follows, compare for each algorithm performance.

Table 1

By table 1 as it can be seen that the present invention can to effectively improve conventional machines learning method pre- on software defect data set Performance is surveyed, especially when carrying out software defect prediction using IB1 and support vector machines algorithm, IB1 algorithms are in software defect Classification performance on data set averagely improves 11.11%；Classification performance of the SVM algorithm on software defect data set averagely carries Rise 12%.Average classification performance of J48, MLP, NB algorithm on software defect data set is also improved, enhancing rate point It Wei 3.12%, 2.5%, 2.6% and 2.7%.

In order to improve the discriminating power of attributive character in software defect data set, the embodiment of the present invention proposes a kind of by base In the software defect prediction algorithm of dissimilarity, primitive attribute feature is substituted with dissimilarity between sample, not only remains data Collect original statistical information, can also get in data set structural information, inherently improve to promote data set The discriminating power of attributive character, and then improve learning performance of the conventional machines learning algorithm in software defect forecasting problem.

The embodiment of the present invention carries out prototype selection using random choice method, and Euclidean distance weighs the dissmilarity between sample Property, 18 software defect data sets are transformed into dissimilarity space；Then, the embodiment of the present invention uses arest neighbors sorting algorithm (1-NN, IB1), decision tree (DecisionTree, DT), neural network (Multi-layer Perceptron, MLP), simplicity Bayes (Bayes, NB), random forest (Random Forest, RF) and support vector machines (Support Vector Machine, SVM) 6 kinds of conventional machines learning algorithms, based on building prediction on the software defect data set in dissimilar space Model；It was verified that the present invention can effectively improve the accuracy of software defect prediction.

The embodiment of the present invention is when assessing the performance of unbalanced learning method, using 10 × 10 folding cross validations, fully profit While with data information, it is reduced as far as the accidental error of random sequence generation.

Although being example purpose, the preferred embodiment of the present invention is had been disclosed for, those skilled in the art will recognize Various improvement, increase and substitution are also possible, and therefore, the scope of the present invention should be not limited to the above embodiments.

Claims

1. a kind of Software Defects Predict Methods, which is characterized in that including：

The sample data for concentrating selection predetermined number in the first default initial data according to the first default selection rule, to obtain the One prototype data collection；

It is calculated between the first default raw data set and the first prototype data collection not according to the first pre-determined distance algorithm First data set of similitude；

Data in first data set are input to predetermined software bug prediction model, it is default original to obtain described first The software defect prediction result of data set corresponding software, wherein shown predetermined software bug prediction model is according to default not phase The model built like property.

2. Software Defects Predict Methods as described in claim 1, which is characterized in that according to the first default selection rule first Default initial data concentrates the sample data of selection predetermined number, before obtaining the first prototype data collection, to further include：

The sample data for concentrating selection predetermined number in the second default initial data according to the second default selection rule, to obtain the Diarch data set；

It is calculated between the second default raw data set and the second prototype data collection not according to the second pre-determined distance algorithm Second data set of similitude；

According to the data and the default sorting algorithm structure predetermined software bug prediction model in second data set.

3. Software Defects Predict Methods as claimed in claim 1 or 2, which is characterized in that the first default selection rule or The second default selection rule includes at least following one：Random selection method, the linear programming method based on cluster.

4. Software Defects Predict Methods as claimed in claim 1 or 2, which is characterized in that the first pre-determined distance algorithm or The second pre-determined distance algorithm includes at least following one：Euclidean distance algorithm, manhatton distance algorithm, Min can husbands Si Base distance algorithm.

5. a kind of software defect prediction meanss, which is characterized in that including：

First choice module, for concentrating selection predetermined number in the first default initial data according to the first default selection rule Sample data, to obtain the first prototype data collection；

First computing module, for calculating the first default raw data set and described first according to the first pre-determined distance algorithm First data set of dissimilarity between prototype data collection；

Prediction module, for the data in first data set to be input to predetermined software bug prediction model, to obtain State the software defect prediction result of the first default raw data set corresponding software, wherein shown predetermined software bug prediction model For according to the model for presetting dissimilarity structure.

6. software defect prediction meanss as claimed in claim 5, which is characterized in that further include：

Second selecting module, for concentrating selection predetermined number in the second default initial data according to the second default selection rule Sample data, to obtain the second prototype data collection；

Second computing module, for calculating the second default raw data set and described second according to the second pre-determined distance algorithm Second data set of dissimilarity between prototype data collection；

Module is built, for according to the data and the default sorting algorithm structure predetermined software defect in second data set Prediction model.

7. such as software defect prediction meanss described in claim 5 or 6, which is characterized in that the first default selection rule or The second default selection rule includes at least following one：Random selection method, the linear programming method based on cluster.

8. such as software defect prediction meanss described in claim 5 or 6, which is characterized in that the first pre-determined distance algorithm or The second pre-determined distance algorithm includes at least following one：Euclidean distance algorithm, manhatton distance algorithm, Min can husbands Si Base distance algorithm.

9. a kind of storage medium, is stored with computer program, which is characterized in that real when the computer program is executed by processor Described in existing any one of claims 1 to 4 the step of Software Defects Predict Methods.

10. a kind of electronic equipment includes at least memory, processor, computer program, feature is stored on the memory It is, the processor is realized soft described in any one of claims 1 to 4 in the computer program on executing the memory The step of part failure prediction method.