CN110147321A  A kind of recognition methods of the defect high risk module based on software network  Google Patents
A kind of recognition methods of the defect high risk module based on software network Download PDFInfo
 Publication number
 CN110147321A CN110147321A CN201910318037.1A CN201910318037A CN110147321A CN 110147321 A CN110147321 A CN 110147321A CN 201910318037 A CN201910318037 A CN 201910318037A CN 110147321 A CN110147321 A CN 110147321A
 Authority
 CN
 China
 Prior art keywords
 prediction
 value
 classifier
 adaptive
 defect
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Granted
Links
 230000003044 adaptive Effects 0.000 claims abstract description 54
 238000004422 calculation algorithm Methods 0.000 claims description 32
 238000004458 analytical method Methods 0.000 claims description 14
 HUTDUHSNJYTCARUHFFFAOYSAN ancymidol Chemical compound   C1=CC(OC)=CC=C1C(O)(C=1C=NC=NC=1)C1CC1 HUTDUHSNJYTCARUHFFFAOYSAN 0.000 claims description 10
 238000000034 method Methods 0.000 claims description 10
 238000003066 decision tree Methods 0.000 claims description 9
 238000000546 chisquare test Methods 0.000 claims description 7
 230000000875 corresponding Effects 0.000 claims description 6
 238000003379 elimination reaction Methods 0.000 claims description 5
 238000007477 logistic regression Methods 0.000 claims description 5
 238000007637 random forest analysis Methods 0.000 claims description 5
 238000004220 aggregation Methods 0.000 claims description 2
 230000002776 aggregation Effects 0.000 claims description 2
 230000035533 AUC Effects 0.000 claims 8
 235000015170 shellfish Nutrition 0.000 claims 1
 230000002950 deficient Effects 0.000 abstract description 15
 238000010801 machine learning Methods 0.000 description 23
 230000000694 effects Effects 0.000 description 18
 238000011160 research Methods 0.000 description 10
 238000011156 evaluation Methods 0.000 description 5
 238000005457 optimization Methods 0.000 description 4
 238000010187 selection method Methods 0.000 description 4
 238000007619 statistical method Methods 0.000 description 4
 238000010586 diagram Methods 0.000 description 3
 238000009826 distribution Methods 0.000 description 3
 238000005516 engineering process Methods 0.000 description 3
 RZVAJINKPMORJFUHFFFAOYSAN pacetaminophenol Chemical compound   CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJFUHFFFAOYSAN 0.000 description 3
 238000005070 sampling Methods 0.000 description 3
 238000002790 crossvalidation Methods 0.000 description 2
 238000001514 detection method Methods 0.000 description 2
 238000011161 development Methods 0.000 description 2
 238000001914 filtration Methods 0.000 description 2
 238000007667 floating Methods 0.000 description 2
 239000000203 mixture Substances 0.000 description 2
 238000000513 principal component analysis Methods 0.000 description 2
 241001269238 Data Species 0.000 description 1
 238000000540 analysis of variance Methods 0.000 description 1
 238000004364 calculation method Methods 0.000 description 1
 239000012141 concentrate Substances 0.000 description 1
 238000010276 construction Methods 0.000 description 1
 230000001419 dependent Effects 0.000 description 1
 238000000605 extraction Methods 0.000 description 1
 230000002068 genetic Effects 0.000 description 1
 238000007689 inspection Methods 0.000 description 1
 238000004519 manufacturing process Methods 0.000 description 1
 239000000463 material Substances 0.000 description 1
 238000005259 measurement Methods 0.000 description 1
 230000001537 neural Effects 0.000 description 1
 230000001151 other effect Effects 0.000 description 1
 230000000644 propagated Effects 0.000 description 1
 238000000611 regression analysis Methods 0.000 description 1
 230000000717 retained Effects 0.000 description 1
 239000002965 rope Substances 0.000 description 1
 230000000153 supplemental Effects 0.000 description 1
 239000002699 waste material Substances 0.000 description 1
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F11/00—Error detection; Error correction; Monitoring
 G06F11/36—Preventing errors by testing or debugging software
 G06F11/3604—Software analysis for verifying properties of programs
 G06F11/3608—Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
 G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
 G06K9/62—Methods or arrangements for recognition using electronic means
 G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
 G06K9/6256—Obtaining sets of training patterns; Bootstrap methods, e.g. bagging, boosting

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
 G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
 G06K9/62—Methods or arrangements for recognition using electronic means
 G06K9/6267—Classification techniques
Abstract
The present invention proposes a kind of defect high risk module recognition method based on software network, belongs to complex software network field.Include: step 1, construct adaptive classifier, includes multiple classifiers in adaptive classifier；Step 2 carries out adaptive features select；Step 3 carries out adaptive threshold optimizing；Step 4 carries out adaptive classifier inner parameter tuning；Step 5 carries out the selection of adaptive optimal prediction model, then carries out defect high risk module to software network to be measured using optimum prediction model and identifies.No matter the method for the present invention is directed to what type of defective data collection, it can be according to completing the building of adaptive classifier, adaptive features select, adaptive threshold optimizing, adaptive classifier inner parameter tuning, the content of five aspects such as selection of adaptive optimal prediction model the characteristics of data set itself, best failure prediction is obtained as a result, identifying high risk software module.
Description
Technical field
The present invention is applied to complex software network field, is a kind of defect high risk module identification side based on software network
Method.
Background technique
With the rapid development of Internet, software has more and more abundant function, helps people to carry out every production and live
It is dynamic.Whether software is securely and reliably more paid attention to, how as early as possible identified the defects of software high risk module
As popular research field.Accurately identifying the module with defect high risk can be improved the quality of software, reduces exploitation
Cost.
A large amount of research is it has been shown that the defect of software 80% is present in 20% software code both at home and abroad.But
In actual software test, the principle of uniform fold is often used, it is desirable that demand and sentence 100% cover, and substantially waste
A large amount of test resource.In some software tripartites test, the hit rate of test case is even lower often less than 1%.And with
The continuous improvement of soft project degree and the continuous functionization of big data technology, a large amount of enterprise and user have accumulated
More and more historical product defect cases, the development test job then to shoot the arrow at the target again have become domestic improve and survey
Try a kind of trend of efficiency, the more substantive defects of discovery.External is some research shows that using bug prediction model acquisition
Defects detection probability (PD) can achieve 71%, and higher than the inspection probability (60%) that software code examines, higher than black box is surveyed
The verification and measurement ratio of examination.By studying a large amount of example discovery, mould is established according to classifier algorithm, the correlation rule etc. in machine learning
Type achieves good defect high risk prediction effect, and establishes the software test on failure prediction and common software
Test is compared, and defects detection rate is significantly improved, and is reduced the quantity of test case, is shortened the testing time, for improving
Software reliability has obvious action.
Technology used in bug prediction model mainly has univariate statistics analysis, multivariate statistical analysis, statistics at present
(bibliography [1]) such as analysis joint analysis expert, machine learning and machine learning joint statistical analysis.Univariate statistics
Analysis considers that feature is less, such as whether only focus on lines of code related to defect；Multivariate statistical analysis can analyze multiple spies
Levy the correlation with defect.Machine learning method mainly has classifier algorithm, clustering algorithm, correlation rule etc..According to Catal's
Analysis is shown on a large scale after 2005 based on the bug prediction model of machine learning algorithm, by the end of 2009, is based on machine
The 66% of the total Quantity of Papers of correlative theses Zhan of device learning method.
According to the difference of failure prediction target, the machine learning method of use difference.Chen Xiang cares for celebrating etc. in reference text
It offers in 1 and good summary has been carried out to the method for machine learning, they point out, if program module is set as fine granularity, such as class
Rank or filelevel are returned then using the defect tendentiousness of prediction module as target frequently with classification method, including Logistic
Return, Bayesian network and decision tree etc.；If being set as coarseness, such as package level or subsystem level, then to predict defect number
Or defect concentration is target, is often used regression analysis.Other than common machine learning method, for example, semisupervised learning and
Active Learning etc. also starts gradually to apply.
Elish et al. uses NASA data set, by SVM (support vector machines) and LR (logistic regression), KNN (K neighbour),
MLP (multilayer perceptron), RBF (radial basis function), BBN (Bayesian network), NB (naive Bayesian), RF (random forest),
DT (decision tree) has carried out systematic comparison, it is believed that SVM is better than other 8 kinds of methods (bibliography [2]) on the whole.
Lessmann et al. also uses disclosed NASA data set, using AUC (Area under the Curve of ROC) value conduct
Model performance parameter completely compared 6 classes  integrated approach, statistical method, arest neighbors method, decision tree method, support to
Amount machine method and neural network totally 22 kinds of different machine learning methods, the results showed that the difference very little between most of method.
After the noise of removal NASA and Promise data set, different machine learning methods has biggish for Ghotra et al. discovery
Performance difference (bibliography [3]).Shepperd et al. is by the ANOVA (variance analysis) of stochastic effects to influence failure prediction
The influence factor of model performance is analyzed, it is found that influence of the selection of machine learning method to performance is smaller, but difference is ground
Study carefully and but there is larger difference between group.Panichella A et al. trains prediction model with Genetic Algorithms s, research shows that
Traditional opponent is significantly better than by the regression model that GAs is trained.Fu Yi figured woven silk material and Fu Y et al. are studies have shown that optimal machine learning is calculated
Method is not fixed and invariable under different interpretational criterias, therefore it is pre to combine different machine learning algorithms to construct a kind of defect
Survey improved model, and by with Eclipse data set the experiment proves that the superiority (bibliography [4]) of the model.
Very few problem is marked for software defect data set, Liao Sheng equality people proposes a kind of semisupervised branch based on sampling
Hold vector machine Software Defects Predict Methods, it is ensured that defect sample quantity will not be too low in tape label sample data, experimental result table
It is bright, compared with supervised learning method, comparable estimated performance (bibliography can be obtained in the case where learning sample is less
[5]).Chang Ruihua et al. proposes a kind of new Software Defects Predict Methods propagated based on super Euclidean distance neighbour, compensates for
Neighbour's propagation algorithm indicates that data similarity is difficult to that the deficiency of complex types of data is effectively treated using Euclidean distance before, improves
The precision (bibliography [6]) of no marks software defect data prediction.Yang C et al. proposes a kind of new based on sub empty
Between learning algorithm DLA, it overcomes the Singular Value problem and small sample size issue in software defect prediction, and passes through reality
It is more preferable in terms of information extraction and estimated performance to verify clear DLA ratio LDA (linear discriminant analysis) and PCA (principal component analysis)
(bibliography [7]).
It is increasingly mature but not all for certain type of data set for the correlation classifier of machine learning
Classifier algorithm be all suitable for, such as linear regression algorithm, such classifier meets those data sets point of linear separability
Class effect is preferable, but but performs poor on linear inseparable data set.To find out its cause, being the defect established because of it
Prediction model is to data deficiency adaptivity.The characteristics of defective data collection be it is diversified, type has continuous or discrete
, it is distributed with and meets Gaussian Profile or ungratified, its ratio data is balance or unbalanced etc..At present for difference
The defective data collection of type, classifying quality acquired by every kind of classifier algorithm are different.
Bibliography is as follows:
[1] software defect Predicting Technique research [D] University of Electronic Science and Technology .2012. of the Ma Ying based on machine learning
[2]Elish KO,Elish MO.Predicting defectprone software modules using
support vector machines.Journal of Systems and Software,2008,81(5):649660.
[3]Ghotra B,McIntosh S,Hassan AE.Revisiting the impact of
classification techniques on the performance of defect prediction models.In:
Proc.of the Int’l Conf.on Software Engineering.2015.789800.
[4] Fu Yiqi, Dong Wei, Yin Liangze wait software defect prediction model [C] of based on ensemble machine learning algorithm complete
State's software and application academic conference .2015.
[5] Liao Shengping, Xu Ling, Yan Meng is based on semisupervised support vector machines Software Defects Predict Methods [J] of sampling
Calculation machine engineering and application, 2017,53 (14): 161166.
[6] Chang Ruihua, Shen Xiaowei are answered based on Software Defects Predict Methods [J] computer that super Euclidean distance neighbour propagates
With research, 2017,34 (05): 13841387.
[7]Yang C,Yang C.Software defect prediction based on manifold
learning in subspace selection[C].International Conference on Intelligent
Information Processing.ACM,2016:17.
Summary of the invention
The needs of good prediction effect are obtained to meet different types of data collection in failure prediction, the present invention proposes one
Defect high risk module recognition method of the kind based on software network, research are suitable for lacking for software network model metrics parameter characteristic
Prediction framework is fallen into, by establishing the feedback mechanism of machine learning model prediction effect, combined data feature selects mature machine
Learn prediction model and realizes failure prediction.
A kind of defect high risk module recognition method based on software network provided by the invention, includes the following steps:
Step 1 constructs adaptive classifier, includes multiple classifiers in adaptive classifier；
Step 2, adaptive features select method, comprising: (1) preprocess, if in data set a certain feature 80% with
On example value it is all identical, then delete this feature；(2) to penalty factor classifier using recursive feature elimination algorithm into
The selection of row feature；(3) classifier without penalty factor is carried out using the Chisquare Test method of single argument feature selecting special
Sign selection；
Step 3 carries out adaptive threshold optimizing；To each classifier, prediction model is constructed based on training set, will be verified
Collection input prediction model obtains prediction value set, constantly traverses the threshold value for predicting the predicted value in value set to replace classifier,
The performance indicator AUC value of prediction model is calculated according to prediction label set and true tag set every time, selecting makes AUC value highest
Optimal threshold of the threshold value as classifier；
Step 4 carries out adaptive classifier inner parameter tuning；Classifier is returned for ridge regression and lasso trick, using with
Machine searching method finds optimum stepsize, selects optimal sample quantity k using trellis search method for K arest neighbors disaggregated model
Value；
Step 5 carries out the selection of adaptive optimal prediction model；Building is closed in training set using different classifiers
Bug prediction model calculates AUC value of each bug prediction model on multiple verifying collection, takes the maximum failure prediction of AUC mean value
Model is optimum prediction model, then carries out defect high risk module to software network to be measured using optimum prediction model and identifies.
Compared with prior art, the present invention having the advantage that
(1) recognition methods of a kind of defect high risk module based on software network proposed by the present invention, for engineering
The different difficult points that encounter when practising modeling automatically complete the building of adaptive classifier, adaptive features select method, adaptive
Threshold value optimizing algorithm, adaptive classifier inner parameter tuning method, the selection method of adaptive optimal prediction model totally five
The content of aspect realizes the identification of the defect high risk module of software network, can obtain better defect high risk module and know
Other effect；
(2) entire defect high risk module identification process can all be realized on backstage in the method for the present invention, establish completely certainly
In the process of dynamicization, it is ensured that lower manpower and time cost to greatest extent.
Detailed description of the invention
Fig. 1 is the overall flow figure of the recognition methods of defect high risk module of the invention；
Fig. 2 is the classifier schematic diagram that present invention building model is used；
Fig. 3 is adaptive features select method schematic diagram in the method for the present invention；
Fig. 4 is adaptive threshold optimizing algorithm flow chart in the method for the present invention；
Fig. 5 is the selection method schematic diagram of adaptive optimal prediction model in the method for the present invention.
Specific embodiment
The present invention is understood and implemented for the ease of those of ordinary skill in the art, and the present invention is made into one with reference to the accompanying drawing
The detailed and deep description of step.
The needs of good prediction effect are obtained to meet different types of data collection in failure prediction, the present invention proposes one
Kind of the defect high risk module recognition method based on software network, comprising: building adaptive classifier, selection selfadaptive features,
Adaptive threshold optimizing, the selection of adaptive classifier inner parameter tuning and adaptive optimal prediction model.The present invention is based on
Machine learning model optimization and preferred method, according to the best of completion defective data collection adaptive the characteristics of data set itself
The processes such as the selection of feature, the setting of classifier threshold value, inner parameter tuning select optimal bug prediction model, obtain most
Good failure prediction effect, improves the hit rate of software test.Defect high risk module recognition method of the invention, such as Fig. 1 institute
Show, illustrates each step below.
Step 1 constructs adaptive classifier.
In machine learning field, these four sides of main supervised learning, unsupervised learning, semisupervised learning and intensified learning
Formula.Mainly research is unfolded to the classifier of supervised learning in the present invention, and according to the application range of different classifications device itself and excellent lacks
Point etc., constructs 16 kinds of different classifiers of total 7 major class, this 16 kinds of different classifiers can be according to the spy of input data set
Point obtains best result when establishing model and algorithms selection adaptively to select most suitable classifier.Such as figure
It is 16 kinds of the present invention different classifier/classifier algorithms shown in 2, this 16 kinds of different classifiers not only include
Current most popular classifier algorithm, further comprises the innovatory algorithm of partial classifier.Such as: in generalized linear model
Linear regression and ridge regression are current most popular classifier algorithms, and lasso trick returns, and minimum angular convolution is returned, logistic regression and
Stochastic gradient descent is the innovatory algorithm to linear regression and ridge regression.16 kinds of classifiers that the present invention uses include: broad sense line
Property the linear regression of model, ridge regression, lasso trick returns, minimum angular convolution is returned, logistic regression and stochastic gradient descent；Vector machine model
Support vector machines；The K arest neighbors of arest neighbors model；The Gauss naive Bayesian of Bayesian model；The decision of decisiontree model
Tree；The random forest of aggregation model, extreme random tree, selfadaptive enhancement algorithm and gradient promote decision tree；Discriminant analysis model
Linear discriminant analysis and quadratic discriminatory analysis.
Due to the difference of metric parameter and the difference of preconditioning technique, the influence to prediction effect will be far longer than engineering
The influence of prediction model is practised, therefore selects properly to take effective removal noise data, redundancy dimension with accurate metric parameter
It would be more advantageous in the performance for improving software defect prediction.
Step 2 carries out adaptive features select.
Different defective datas concentrate structure feature relevant to defect to be different.Spy can be enhanced in reasonable feature selecting
Correlation between characteristic value of seeking peace removes redundancy feature.The method of feature selecting is broadly divided into three kinds of (reference papers at present
[8]: the design of P2P Financial Risk Control system of the Gao Haoyang based on big data and realization [D] Beijing Jiaotong University,
2018.): (1) filtration method, according to diversity perhaps correlation to each feature score given threshold or selection threshold value
Number, then select feature.(2) pack is modeled using machine learning algorithm, according to objective function, is selected every time
Several features, or several features are excluded, objective function is usually prediction effect scoring.(3) embedding inlay technique first uses certain machines
The algorithm and model of study is trained, and obtains the weight coefficient of each feature, selects feature from big to small further according to coefficient.But
It is the data set given for one, a kind of feature selection approach is often difficult to be completed at the same time two above purpose.Present invention warp
Decision is mixed using various features selection method come the correlation between Enhanced feature and characteristic value after crossing analysis, is selected suitable
The best features of defective data collection currently entered.
Adaptive features select method of the invention is moved as shown in figure 3, selecting filtration method to remove low Variance feature first
Except those act on little feature, then, itself selects recursive feature removing method with penalty factor classifier, without punishment
The classifier of the factor selects the Chisquare Test method of single argument feature selecting, the specific steps are as follows:
Step 201, data set is preprocessed using removal low Variance feature method first.If a certain in data set
The value of 80% or more feature is all identical, then can think that this feature effect is too small can remove.So the present invention selects to move
Except pretreatment of the method as feature selecting of low Variance feature, the upper limit values of variance threshold values is set as 0.16, then calculates number
According to the variance yields for concentrating each single item feature, all identical feature of 80% example value is then removed, that is, eliminates variance yields and is not higher than
0.16 feature retains the feature that variance is greater than 0.16.
Step 202, the selection of feature is then carried out using recursive feature elimination algorithm to the classifier with penalty factor.
Recursive feature elimination algorithm is a kind of greedy algorithm for finding optimal feature subset, and main thought is come using a basic mode type
More wheel training are carried out, after every wheel training, remove the feature of several weight coefficients, then next round training is carried out based on new feature set
(reference paper [9]: backbone road net traffic state prediction and method for visualizing research [D] of the Chen Shan based on machine learning
.2017.).The present invention is used for having the classifier such as random forests algorithm, linear regression algorithm etc. of included penalty factor
Then the classifier algorithm of itself establishes bug prediction model using 10 folding cross validations in training set as basic mode type, and
Using AUC value as scoring criteria, the lower feature of successive elimination score is retained in the feature of highest scoring in data set, that is, selects
The corresponding optimal feature subset of the data set.
Step 203, then to without penalty factor classifier using single argument feature selecting Chisquare Test method into
The selection of row feature.Single argument feature selection approach is individually counted to each variable, is removed according to certain standard incoherent
Variable.The method is not based on classifier algorithm and establishes model, but one of statistics method, fairly simple, is easy to transport
Row, generally for understanding that data have preferable effect.Classical Chisquare Test is to examine qualitative independent variable to qualitative dependent variable
Correlation can select and the maximally related several Xiang Tezheng of defect according to related coefficient p value.As p < 0.05, show feature and spy
Value indicative has correlation；As p < 0.01, show that feature is extremely related to characteristic value.The present invention is for itself without penalty factor
Classifier such as K nearest neighbor algorithm the phase of each feature with characteristic value is calculated using Chisquare Test method analyzing defect data set
Relationship number sets p value as 0.05, and removal related coefficient is not less than the feature of p value, retains the feature that related coefficient is less than p value, i.e.,
Retain the feature in this data set with characteristic value with correlation.
Step 3 carries out adaptive threshold optimizing.
Threshold value is also known as threshold value, is the highest or minimum of some effect.In machine learning, threshold value is referred to sample
Originally it is divided into different classes of critical value, the sample greater than this threshold value is divided into one kind, and the sample less than this threshold value is divided into separately
It is a kind of.In failure prediction, due to the diversity of defective data collection, classifier can be made different using different size of threshold value
Score.For different types of defective data collection, the optimal threshold of classifier highest scoring can be made to be not necessarily the silent of classifier
Recognize value, optimal threshold should be that dynamic adjusts with the characteristics of data set itself.Present invention employs a kind of adaptive thresholdings
It is worth optimization method, as shown in figure 4, it is more preferable to obtain to go out optimal threshold value for the diversity dynamic select of defective data collection
Classifying quality.
The main thought of adaptive threshold optimization method of the invention is that prediction is constructed based on the data in training set
Model obtains prediction label value set to verify the characteristic value data input prediction model concentrated, with prediction label value set and
True tag value set computation model performance indicator AUC value constantly traverses the data in prediction label value set and replaces threshold value, choosing
The highest threshold value of AUC value of sening as an envoy to is as optimal threshold.To each classifier, below step 301~307 is executed, best threshold is found
Value.
Step 301: inputting the characteristic set G1 and true tag L1 of training set, verify the characteristic set G2 and true mark of collection
Sign L2；The present invention is directed to software network, and the characteristic set G1 and G2 of input are the node diagnostic set of software network, label L1 and
L2 is for marking whether the node is malfunctioning node.Module in each node corresponding software.
Step 302: bug prediction model being constructed according to G1 and L1, G2 is input to bug prediction model and obtains prediction result
Set S1；It is the predicted value that node is malfunctioning node in prediction result set S1, usual predicted value is that have just to have negative floating number,
Predicted value is compared with the threshold value of the classifier of setting, to identify defect high risk node.
Step 303: by S1 gather in predicted value be ranked up according to the sequence from small to large of value, obtain set S2；
Step 304: take the threshold value threshold that the median of S2 is initial as classifier, again to characteristic set G2 into
Row prediction obtains prediction label set P1；Prediction label set P1 be flag node whether be faulty node set, work as prediction
When value is less than threshold value, judgement is normal node, and otherwise judgement is faulty node.
Step 305: according to true tag L2 and prediction label P1, calculating the value of AUC；
Step 306: since S2 median, S2 being traversed along the direction that predicted value increases, takes the prediction in S2
Value constantly updates threshold, repeats step 304 and 305, constantly calculates new AUC value；
Step 307: the threshold value for taking threshold value threshold corresponding to maximum AUC value optimal as classifier and output.
By abovementioned threshold value searching process, each classifier can the optimal threshold of adaptive setting according to the input data
Value.
Step 4 carries out adaptive classifier inner parameter tuning.
Inner parameter for certain classifiers is the adjusting needed according to parameter is carried out the characteristics of data set itself, otherwise
It will affect the accuracy of established bug prediction model.Such as common ridge regression classifier inner parameter step value alpha
It acts on particularly significant, if step size settings are excessive, model may be made accidentally to miss optimal solution；But if steplength mistake
If small, and it is too long to will lead to failure prediction runing time.Established defect can be improved in the inner parameter of Optimum Classification device
The accuracy of prediction model.
After the present invention is for the inner parameter research of 16 kinds of classifiers of building, it is found that most of classifier can be taken
The parameter value of default, but sorting algorithm is returned for ridge regression and lasso trick, since prediction of its step value alpha to model is quasi
True property has larger impact, and its value is the floating number between 0~1, can be selected most preferably in training set with stochastic search methods
Alpha value；For K arest neighbors model, the setting of sample size k value, there is larger shadow to the accuracy rate of established prediction model
It rings, and its value is positive integer, is suitble on training set select optimal k value using trellis search method.
(401) classifier is returned for ridge regression and lasso trick, the present invention adjusts its inner parameter using stochastic search methods.
Random search is also a kind of parameter regulation means, under certain the number of iterations, Selecting All Parameters of sampling from a random distribution, and root
Model construction and assessment are carried out according to each parameter combination.Classifier is returned for ridge regression and lasso trick, the present invention to search at random
The mode of rope on training dataset by the way of cross validation, under 100 the number of iterations, using AUC value as interpretational criteria,
The value being distributed between 0~1 is randomly selected as step value, prediction model is established with each classifier, chooses the highest mould of AUC value
Step value corresponding to type is as optimal step value.
(402) for K arest neighbors disaggregated model, the present invention adjusts its inner parameter using trellis search method.Grid is searched
Suo Fangfa is also a kind of parameter tuning algorithm, and combination foundation and evaluation model to all parameters in network are selected model and obtained
Divide best parameter.Selection for parameter k in K arest neighbors disaggregated model, set k value range be between 1~13 just
Integer uses the mode of cross validation on training dataset, using AUC value as model evaluation criterion, traverses in k value range
All positive integers establish bug prediction model with k nearest neighbor algorithm, choose the corresponding k value of the highest model of AUC value as optimal
K value.
Step 5 carries out the selection of adaptive optimal prediction model.
Different classifications device acquired prediction effect on the same defective data collection is different, pre between individual classification device
It is also larger to survey effect gap.Since different classifiers has different sensibility to different types of data, so some classification
Device prediction effect on certain type of defective data collection is preferable, but on other types of defective data collection prediction effect compared with
Difference.When carrying out the failure prediction between version, the distribution situation of old version defect is only known, the software for prediction is simultaneously unclear
The distribution situation of its defect of Chu also can not just determine that the prediction effect of which classifier is most credible, so to carry out optimal classification
The selection of device.
In order to solve the problem abovementioned, the present invention uses the selection method of adaptive optimal prediction model, the master of this method
Wanting thought is to take out a part first from the data set of input to be used as verifying set, and it is pre to establish defect with remaining data set
Model is surveyed, selects optimal sorter model further according to prediction effect quality of the prediction model in verifying set, finally again
It is predicted for the defects of test set (test set is exactly collection of network to be predicted).
Machine learning algorithm commonly reflects that the index of model quality mainly has accurate rate P, recall rate R, overall merit to refer to
Mark F1 and AUC etc..When machine learning solves two classification problems, defective class is divided into positive class, nondefective class is divided into negative
Class, it may appear that following four kinds of situations: positive class sample predictions are positive the positive sample tp (True Positive) of class, by negative class sample
This prediction is negative the negative sample tn (True Negative) of class, and negative class sample predictions are positive the negative sample fp (False of class
Positive), positive class sample predictions are negative the positive sample fn (False Negative) of class.
Accurate rate P is for for prediction result, and what it was indicated is that prediction is positive in the sample of class how many is to predict just
True, it can be by formulaIt calculates.What recall rate R was indicated is that how many is pre for positive class in sample
Survey it is correct, can be by formulaIt calculates.Tp_num is tp quantity, and fp_num is fp quantity, fn_
Num is fn quantity.Both certainly for the bug prediction model of foundation, it is desirable to all the higher the better for the value of P and R, but in fact
At this moment the case where sometimes will appear contradiction, just needs to comprehensively consider them, using comprehensive evaluation index F1.F1 index is P and R
Weighted harmonic arerage, can be by formulaIt calculates.Then illustrate that the model established is more effective as F1 higher.It is right
In the classification problem of unbalanced dataset, frequently with another evaluation index be AUC.AUC is defined as under ROC curve and seat
The area that parameter surrounds, a positive sample and a negative sample ought be selected at random by being meant that, be calculated according to current class device
This positive sample is come the probability before negative sample by obtained fractional value.When AUC value is bigger, the classification currently established is indicated
Device model has better classifying quality.
As shown in figure 5, carrying out one embodiment of adaptive optimal prediction model selection for the present invention, classify for every kind
Device establishes bug prediction model on multiple verifyings collection, obtains multiple performance indicator AUC value, chooses the maximum prediction of AUC mean value
Model is optimal bug prediction model.The present embodiment carry out the selection of adaptive optimal prediction model include the following steps 5.1~
5.5。
Step 5.1: setting and obtain training set L={ G_{1},G_{2},…,G_{m}, m represents the software network quantity in training set, software
Network network of faulty node comprising defect network and not, if obtaining test set G_{t}；Establish the set L for being initially empty set_{1}With
L_{2}；Training set L is traversed, if network G therein_{r}In there is no faulty node, by G_{r}Set L is added_{1}, set L is otherwise added_{2}；R=1,
2,…,m。
Step 5.2: by set L_{2}In all software defect networks sort from small to large according to version, select last K
Network constitutes verifying version set VD, then will set L_{2}In remaining network and set L_{1}It merges and constitutes new set H；K
For positive integer.
Step 5.3: set of computations H, VD, G_{t}In each network node supplemental characteristic, i.e. the characteristic value of node, building
Complete training set, verifying set and test set；
Step 5.4: the different classifier of application closes building prediction model in training set, using AUC value as model evaluation mark
Standard calculates and collects the AUC value of upper different models in K verifying；
Step 5.5: calculate different models verifying set in obtain AUC mean value and as index, select optimal
Bug prediction model.Then test set is predicted using the optimum prediction model selected.
By previous step, no matter the method for the present invention is directed to what type of defective data collection, can be according to data set certainly
The characteristics of body complete the building of adaptive classifier, adaptive features select, adaptive threshold optimizing, inside adaptive classifier
The content of five aspects such as selection of arameter optimization, adaptive optimal prediction model obtains best failure prediction as a result, identification
High risk software module out.
Claims (5)
1. a kind of defect high risk module recognition method based on software network, comprising:
Step 1 constructs adaptive classifier, includes multiple classifiers in adaptive classifier；
Step 2, adaptive features select method, comprising: (1) preprocess, if 80% or more of a certain feature in data set
Example value is all identical, then deletes this feature；(2) classifier with penalty factor is carried out using recursive feature elimination algorithm special
The selection of sign；(3) feature choosing is carried out using the Chisquare Test method of single argument feature selecting to the classifier without penalty factor
It selects；
Step 3 carries out adaptive threshold optimizing；To each classifier, prediction model is constructed based on training set, verifying collection is defeated
Enter prediction model and obtain prediction value set, constantly traversal predicts the predicted value in value set to replace the threshold value of classifier, every time
The performance indicator AUC value of prediction model is calculated according to prediction label set and true tag set, selecting makes the highest threshold of AUC value
It is worth the optimal threshold as classifier；
Step 4 carries out adaptive classifier inner parameter tuning；Classifier is returned for ridge regression and lasso trick, using searching at random
Suo Fangfa finds optimum stepsize, selects optimal sample quantity k value using trellis search method for K arest neighbors disaggregated model；
Step 5 carries out the selection of adaptive optimal prediction model；Building defect is closed in training set using different classifiers
Prediction model calculates AUC value of each bug prediction model on multiple verifying collection, takes the maximum bug prediction model of AUC mean value
For optimum prediction model, defect high risk module then is carried out to software network to be measured using optimum prediction model and is identified.
2. the method according to claim 1, wherein including 16 in adaptive classifier in the step one
Kind different classifiers: the linear regression of generalized linear model, ridge regression, lasso trick returns, minimum angular convolution is returned, logistic regression and with
The decline of machine gradient；The support vector machines of vector machine model；The K arest neighbors of arest neighbors model；The Gauss simplicity shellfish of Bayesian model
Ye Si；The decision tree of decisiontree model；The random forest of aggregation model, extreme random tree, selfadaptive enhancement algorithm and gradient mention
Rise decision tree；The linear discriminant analysis of discriminant analysis model and quadratic discriminatory analysis.
3. the method according to claim 1, wherein being carried out in the step two using Chisquare Test method
When feature selecting, if calculate the related coefficient of feature and characteristic value less than 0.05, retain this feature.
4. the method according to claim 1, wherein in the step three, realization includes:
Firstly, constructing bug prediction model using training set, verifying is collected into input bug prediction model and obtains prediction value set S1,
It will predict that the data in value set S1 are ranked up according to the sequence of value from small to large, obtain set S2；
Secondly, since the median of S2, set S2 is traversed along the direction that predicted value increases, takes in S2 predicted value more
Change threshold value；After replacement threshold value every time, prediction label set P1 is obtained to verifying collection prediction again, in conjunction with the true of verifying collection
Tag set calculates AUC value；
Finally, choosing the threshold value optimal as classifier of threshold value corresponding to maximum AUC value after terminating traversal.
5. the method according to claim 1, wherein realizing that step includes: in the step five
Step 5.1: setting and obtain training set L={ G_{1},G_{2},…,G_{m}, m represents the software network quantity in training set, software network
Comprising the defect network and not network of faulty node, if obtaining test set G_{t}；Establish the set L for being initially empty set_{1}And L_{2}；Time
Training set L is gone through, if network G therein_{r}In there is no faulty node, by G_{r}Set L is added_{1}, set L is otherwise added_{2}, r=1,
2,…,m；
Step 5.2: by set L_{2}In all software defect networks sort from small to large according to version, select K last network structure
It, then will set L at verifying version sets VD_{2}In remaining network and set L_{1}It merges and constitutes new set H；K is positive whole
Number；
Step 5.3: set of computations H, VD, G_{t}In in each network each node characteristic value, obtain training set, verifying set and
Test set；
Step 5.4: each classifier closes building bug prediction model in training set in applying step one, calculates and collects in K verifying
The AUC value of upper bug prediction model；
Step 5.5: selecting the maximum bug prediction model of AUC mean value obtained in K verifying set as optimum prediction mould
Type predicts test set using optimum prediction model.
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

CN201910318037.1A CN110147321B (en)  20190419  20190419  Software networkbased method for identifying defect highrisk module 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

CN201910318037.1A CN110147321B (en)  20190419  20190419  Software networkbased method for identifying defect highrisk module 
Publications (2)
Publication Number  Publication Date 

CN110147321A true CN110147321A (en)  20190820 
CN110147321B CN110147321B (en)  20201124 
Family
ID=67588480
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

CN201910318037.1A Active CN110147321B (en)  20190419  20190419  Software networkbased method for identifying defect highrisk module 
Country Status (1)
Country  Link 

CN (1)  CN110147321B (en) 
Cited By (7)
Publication number  Priority date  Publication date  Assignee  Title 

CN110688152A (en) *  20190927  20200114  厦门大学  Software reliability quantitative evaluation method combining software development quality information 
CN110866030A (en) *  20191023  20200306  中国科学院信息工程研究所  Database abnormal access detection method based on unsupervised learning 
CN111143222A (en) *  20191230  20200512  军事科学院系统工程研究院系统总体研究所  Software evaluation method based on defect prediction 
CN111782512A (en) *  20200623  20201016  北京高质系统科技有限公司  Multifeature software defect comprehensive prediction method based on unbalanced noise set 
CN111782548A (en) *  20200728  20201016  南京航空航天大学  Software defect prediction data processing method and device and storage medium 
CN112580268A (en) *  20210225  20210330  上海冰鉴信息科技有限公司  Method and device for selecting machine learning model based on business processing 
WO2021143175A1 (en) *  20200114  20210722  华为技术有限公司  Test case screening method and device, and medium 
Citations (10)
Publication number  Priority date  Publication date  Assignee  Title 

CN103810101A (en) *  20140219  20140521  北京理工大学  Software defect prediction method and system 
CN103810102A (en) *  20140219  20140521  北京理工大学  Method and system for predicting software defects 
CN105389598A (en) *  20151228  20160309  中国石油大学(华东)  Feature selecting and classifying method for software defect data 
CN105677564A (en) *  20160104  20160615  中国石油大学(华东)  Adaboost software defect unbalanced data classification method based on improvement 
CN105701013A (en) *  20160104  20160622  中国石油大学(华东)  Software defect data feature selection method based on mutual information 
CN106203534A (en) *  20160726  20161207  南京航空航天大学  A kind of costsensitive Software Defects Predict Methods based on Boosting 
WO2018175496A1 (en) *  20170320  20180927  Versata Development Group, Inc.  Code defect prediction by training a system to identify defect patterns in code history 
CN108664402A (en) *  20180514  20181016  北京航空航天大学  A kind of failure prediction method based on software network feature learning 
CN109165160A (en) *  20180828  20190108  北京理工大学  Software defect prediction model design method based on core principle component analysis algorithm 
CN109325543A (en) *  20181010  20190212  南京邮电大学  Software Defects Predict Methods, readable storage medium storing program for executing and terminal 

2019
 20190419 CN CN201910318037.1A patent/CN110147321B/en active Active
Patent Citations (10)
Publication number  Priority date  Publication date  Assignee  Title 

CN103810101A (en) *  20140219  20140521  北京理工大学  Software defect prediction method and system 
CN103810102A (en) *  20140219  20140521  北京理工大学  Method and system for predicting software defects 
CN105389598A (en) *  20151228  20160309  中国石油大学(华东)  Feature selecting and classifying method for software defect data 
CN105677564A (en) *  20160104  20160615  中国石油大学(华东)  Adaboost software defect unbalanced data classification method based on improvement 
CN105701013A (en) *  20160104  20160622  中国石油大学(华东)  Software defect data feature selection method based on mutual information 
CN106203534A (en) *  20160726  20161207  南京航空航天大学  A kind of costsensitive Software Defects Predict Methods based on Boosting 
WO2018175496A1 (en) *  20170320  20180927  Versata Development Group, Inc.  Code defect prediction by training a system to identify defect patterns in code history 
CN108664402A (en) *  20180514  20181016  北京航空航天大学  A kind of failure prediction method based on software network feature learning 
CN109165160A (en) *  20180828  20190108  北京理工大学  Software defect prediction model design method based on core principle component analysis algorithm 
CN109325543A (en) *  20181010  20190212  南京邮电大学  Software Defects Predict Methods, readable storage medium storing program for executing and terminal 
NonPatent Citations (1)
Title 

傅艺绮: "基于机器学习的软件缺陷预测方法与工具", 《中国优秀硕士学位论文全文数据库信息科技辑》 * 
Cited By (9)
Publication number  Priority date  Publication date  Assignee  Title 

CN110688152A (en) *  20190927  20200114  厦门大学  Software reliability quantitative evaluation method combining software development quality information 
CN110688152B (en) *  20190927  20210101  厦门大学  Software reliability quantitative evaluation method combining software development quality information 
CN110866030A (en) *  20191023  20200306  中国科学院信息工程研究所  Database abnormal access detection method based on unsupervised learning 
CN111143222A (en) *  20191230  20200512  军事科学院系统工程研究院系统总体研究所  Software evaluation method based on defect prediction 
WO2021143175A1 (en) *  20200114  20210722  华为技术有限公司  Test case screening method and device, and medium 
CN111782512A (en) *  20200623  20201016  北京高质系统科技有限公司  Multifeature software defect comprehensive prediction method based on unbalanced noise set 
CN111782512B (en) *  20200623  20210709  北京高质系统科技有限公司  Multifeature software defect comprehensive prediction method based on unbalanced noise set 
CN111782548A (en) *  20200728  20201016  南京航空航天大学  Software defect prediction data processing method and device and storage medium 
CN112580268A (en) *  20210225  20210330  上海冰鉴信息科技有限公司  Method and device for selecting machine learning model based on business processing 
Also Published As
Publication number  Publication date 

CN110147321B (en)  20201124 
Similar Documents
Publication  Publication Date  Title 

CN110147321A (en)  A kind of recognition methods of the defect high risk module based on software network  
Zhang et al.  Multiobjective particle swarm optimization approach for costbased feature selection in classification  
CN107103332B (en)  A kind of Method Using Relevance Vector Machine sorting technique towards largescale dataset  
Gaber et al.  A survey of classification methods in data streams  
Zhan et al.  Consensusdriven propagation in massive unlabeled data for face recognition  
Mantovani et al.  To tune or not to tune: recommending when to adjust SVM hyperparameters via metalearning  
Packianather et al.  A wrapperbased feature selection approach using Bees Algorithm for a wood defect classification system  
Kuo et al.  Integration of artificial immune network and Kmeans for cluster analysis  
CN107577605A (en)  A kind of feature clustering system of selection of softwareoriented failure prediction  
das Dôres et al.  A metalearning framework for algorithm recommendation in software fault prediction  
Lin et al.  A new densitybased scheme for clustering based on genetic algorithm  
Bisht et al.  Review Study on Software Defect Prediction Models premised upon Various Data Mining Approaches  
Gao et al.  An ensemble classifier learning approach to ROC optimization  
Li et al.  A fuzzy linear programmingbased classification method  
He et al.  Ensemble multiboost based on ripper classifier for prediction of imbalanced software defect data  
KR101085066B1 (en)  An Associative Classification Method for detecting useful knowledge from huge multiattributes dataset  
Kumar et al.  Classification of faults in web applications using machine learning  
CN109086291A (en)  A kind of parallel method for detecting abnormality and system based on MapReduce  
CN112784881B (en)  Network abnormal flow detection method, model and system  
Choirunnisa et al.  Software Defect Prediction using Oversampling Algorithm: ASUWO  
Kim et al.  Optimization of average precision with maximal figureofmerit learning  
Ibrahim et al.  LLAC: Lazy Learning in Associative Classification  
Raamesh et al.  Data mining based optimization of test cases to enhance the reliability of the testing  
KR102134324B1 (en)  Apparatus and method for extracting rules of artficial neural network  
Li et al.  A novel kmeans classification method with genetic algorithm 
Legal Events
Date  Code  Title  Description 

PB01  Publication  
PB01  Publication  
SE01  Entry into force of request for substantive examination  
SE01  Entry into force of request for substantive examination  
GR01  Patent grant  
GR01  Patent grant 