CN110147321A - A kind of recognition methods of the defect high risk module based on software network - Google Patents

A kind of recognition methods of the defect high risk module based on software network Download PDF

Info

Publication number
CN110147321A
CN110147321A CN201910318037.1A CN201910318037A CN110147321A CN 110147321 A CN110147321 A CN 110147321A CN 201910318037 A CN201910318037 A CN 201910318037A CN 110147321 A CN110147321 A CN 110147321A
Authority
CN
China
Prior art keywords
prediction
value
classifier
adaptive
defect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910318037.1A
Other languages
Chinese (zh)
Other versions
CN110147321B (en
Inventor
艾骏
杨益文
苏文翥
王飞
郭皓然
邹卓良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201910318037.1A priority Critical patent/CN110147321B/en
Publication of CN110147321A publication Critical patent/CN110147321A/en
Application granted granted Critical
Publication of CN110147321B publication Critical patent/CN110147321B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/62Methods or arrangements for recognition using electronic means
    • G06K9/6217Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06K9/6256Obtaining sets of training patterns; Bootstrap methods, e.g. bagging, boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/62Methods or arrangements for recognition using electronic means
    • G06K9/6267Classification techniques

Abstract

The present invention proposes a kind of defect high risk module recognition method based on software network, belongs to complex software network field.Include: step 1, construct adaptive classifier, includes multiple classifiers in adaptive classifier;Step 2 carries out adaptive features select;Step 3 carries out adaptive threshold optimizing;Step 4 carries out adaptive classifier inner parameter tuning;Step 5 carries out the selection of adaptive optimal prediction model, then carries out defect high risk module to software network to be measured using optimum prediction model and identifies.No matter the method for the present invention is directed to what type of defective data collection, it can be according to completing the building of adaptive classifier, adaptive features select, adaptive threshold optimizing, adaptive classifier inner parameter tuning, the content of five aspects such as selection of adaptive optimal prediction model the characteristics of data set itself, best failure prediction is obtained as a result, identifying high risk software module.

Description

A kind of recognition methods of the defect high risk module based on software network
Technical field
The present invention is applied to complex software network field, is a kind of defect high risk module identification side based on software network Method.
Background technique
With the rapid development of Internet, software has more and more abundant function, helps people to carry out every production and live It is dynamic.Whether software is securely and reliably more paid attention to, how as early as possible identified the defects of software high risk module As popular research field.Accurately identifying the module with defect high risk can be improved the quality of software, reduces exploitation Cost.
A large amount of research is it has been shown that the defect of software 80% is present in 20% software code both at home and abroad.But In actual software test, the principle of uniform fold is often used, it is desirable that demand and sentence 100% cover, and substantially waste A large amount of test resource.In some software tripartites test, the hit rate of test case is even lower often less than 1%.And with The continuous improvement of soft project degree and the continuous functionization of big data technology, a large amount of enterprise and user have accumulated More and more historical product defect cases, the development test job then to shoot the arrow at the target again have become domestic improve and survey Try a kind of trend of efficiency, the more substantive defects of discovery.External is some research shows that using bug prediction model acquisition Defects detection probability (PD) can achieve 71%, and higher than the inspection probability (60%) that software code examines, higher than black box is surveyed The verification and measurement ratio of examination.By studying a large amount of example discovery, mould is established according to classifier algorithm, the correlation rule etc. in machine learning Type achieves good defect high risk prediction effect, and establishes the software test on failure prediction and common software Test is compared, and defects detection rate is significantly improved, and is reduced the quantity of test case, is shortened the testing time, for improving Software reliability has obvious action.
Technology used in bug prediction model mainly has univariate statistics analysis, multivariate statistical analysis, statistics at present (bibliography [1]) such as analysis joint analysis expert, machine learning and machine learning joint statistical analysis.Univariate statistics Analysis considers that feature is less, such as whether only focus on lines of code related to defect;Multivariate statistical analysis can analyze multiple spies Levy the correlation with defect.Machine learning method mainly has classifier algorithm, clustering algorithm, correlation rule etc..According to Catal's Analysis is shown on a large scale after 2005 based on the bug prediction model of machine learning algorithm, by the end of 2009, is based on machine The 66% of the total Quantity of Papers of correlative theses Zhan of device learning method.
According to the difference of failure prediction target, the machine learning method of use difference.Chen Xiang cares for celebrating etc. in reference text It offers in 1 and good summary has been carried out to the method for machine learning, they point out, if program module is set as fine granularity, such as class Rank or file-level are returned then using the defect tendentiousness of prediction module as target frequently with classification method, including Logistic Return, Bayesian network and decision tree etc.;If being set as coarseness, such as package level or subsystem level, then to predict defect number Or defect concentration is target, is often used regression analysis.Other than common machine learning method, for example, semi-supervised learning and Active Learning etc. also starts gradually to apply.
Elish et al. uses NASA data set, by SVM (support vector machines) and LR (logistic regression), KNN (K- neighbour), MLP (multilayer perceptron), RBF (radial basis function), BBN (Bayesian network), NB (naive Bayesian), RF (random forest), DT (decision tree) has carried out systematic comparison, it is believed that SVM is better than other 8 kinds of methods (bibliography [2]) on the whole. Lessmann et al. also uses disclosed NASA data set, using AUC (Area under the Curve of ROC) value conduct Model performance parameter completely compared 6 classes --- integrated approach, statistical method, arest neighbors method, decision tree method, support to Amount machine method and neural network totally 22 kinds of different machine learning methods, the results showed that the difference very little between most of method. After the noise of removal NASA and Promise data set, different machine learning methods has biggish for Ghotra et al. discovery Performance difference (bibliography [3]).Shepperd et al. is by the ANOVA (variance analysis) of stochastic effects to influence failure prediction The influence factor of model performance is analyzed, it is found that influence of the selection of machine learning method to performance is smaller, but difference is ground Study carefully and but there is larger difference between group.Panichella A et al. trains prediction model with Genetic Algorithms s, research shows that Traditional opponent is significantly better than by the regression model that GAs is trained.Fu Yi figured woven silk material and Fu Y et al. are studies have shown that optimal machine learning is calculated Method is not fixed and invariable under different interpretational criterias, therefore it is pre- to combine different machine learning algorithms to construct a kind of defect Survey improved model, and by with Eclipse data set the experiment proves that the superiority (bibliography [4]) of the model.
Very few problem is marked for software defect data set, Liao Sheng equality people proposes a kind of semi-supervised branch based on sampling Hold vector machine Software Defects Predict Methods, it is ensured that defect sample quantity will not be too low in tape label sample data, experimental result table It is bright, compared with supervised learning method, comparable estimated performance (bibliography can be obtained in the case where learning sample is less [5]).Chang Ruihua et al. proposes a kind of new Software Defects Predict Methods propagated based on super Euclidean distance neighbour, compensates for Neighbour's propagation algorithm indicates that data similarity is difficult to that the deficiency of complex types of data is effectively treated using Euclidean distance before, improves The precision (bibliography [6]) of no marks software defect data prediction.Yang C et al. proposes a kind of new based on sub empty Between learning algorithm DLA, it overcomes the Singular Value problem and small sample size issue in software defect prediction, and passes through reality It is more preferable in terms of information extraction and estimated performance to verify clear DLA ratio LDA (linear discriminant analysis) and PCA (principal component analysis) (bibliography [7]).
It is increasingly mature but not all for certain type of data set for the correlation classifier of machine learning Classifier algorithm be all suitable for, such as linear regression algorithm, such classifier meets those data sets point of linear separability Class effect is preferable, but but performs poor on linear inseparable data set.To find out its cause, being the defect established because of it Prediction model is to data deficiency adaptivity.The characteristics of defective data collection be it is diversified, type has continuous or discrete , it is distributed with and meets Gaussian Profile or ungratified, its ratio data is balance or unbalanced etc..At present for difference The defective data collection of type, classifying quality acquired by every kind of classifier algorithm are different.
Bibliography is as follows:
[1] software defect Predicting Technique research [D] University of Electronic Science and Technology .2012. of the Ma Ying based on machine learning
[2]Elish KO,Elish MO.Predicting defect-prone software modules using support vector machines.Journal of Systems and Software,2008,81(5):649-660.
[3]Ghotra B,McIntosh S,Hassan AE.Revisiting the impact of classification techniques on the performance of defect prediction models.In: Proc.of the Int’l Conf.on Software Engineering.2015.789-800.
[4] Fu Yiqi, Dong Wei, Yin Liangze wait software defect prediction model [C] of based on ensemble machine learning algorithm complete State's software and application academic conference .2015.
[5] Liao Shengping, Xu Ling, Yan Meng is based on semisupervised support vector machines Software Defects Predict Methods [J] of sampling Calculation machine engineering and application, 2017,53 (14): 161-166.
[6] Chang Ruihua, Shen Xiaowei are answered based on Software Defects Predict Methods [J] computer that super Euclidean distance neighbour propagates With research, 2017,34 (05): 1384-1387.
[7]Yang C,Yang C.Software defect prediction based on manifold learning in subspace selection[C].International Conference on Intelligent Information Processing.ACM,2016:17.
Summary of the invention
The needs of good prediction effect are obtained to meet different types of data collection in failure prediction, the present invention proposes one Defect high risk module recognition method of the kind based on software network, research are suitable for lacking for software network model metrics parameter characteristic Prediction framework is fallen into, by establishing the feedback mechanism of machine learning model prediction effect, combined data feature selects mature machine Learn prediction model and realizes failure prediction.
A kind of defect high risk module recognition method based on software network provided by the invention, includes the following steps:
Step 1 constructs adaptive classifier, includes multiple classifiers in adaptive classifier;
Step 2, adaptive features select method, comprising: (1) pre-process, if in data set a certain feature 80% with On example value it is all identical, then delete this feature;(2) to penalty factor classifier using recursive feature elimination algorithm into The selection of row feature;(3) classifier without penalty factor is carried out using the Chi-square Test method of single argument feature selecting special Sign selection;
Step 3 carries out adaptive threshold optimizing;To each classifier, prediction model is constructed based on training set, will be verified Collection input prediction model obtains prediction value set, constantly traverses the threshold value for predicting the predicted value in value set to replace classifier, The performance indicator AUC value of prediction model is calculated according to prediction label set and true tag set every time, selecting makes AUC value highest Optimal threshold of the threshold value as classifier;
Step 4 carries out adaptive classifier inner parameter tuning;Classifier is returned for ridge regression and lasso trick, using with Machine searching method finds optimum stepsize, selects optimal sample quantity k using trellis search method for K arest neighbors disaggregated model Value;
Step 5 carries out the selection of adaptive optimal prediction model;Building is closed in training set using different classifiers Bug prediction model calculates AUC value of each bug prediction model on multiple verifying collection, takes the maximum failure prediction of AUC mean value Model is optimum prediction model, then carries out defect high risk module to software network to be measured using optimum prediction model and identifies.
Compared with prior art, the present invention having the advantage that
(1) recognition methods of a kind of defect high risk module based on software network proposed by the present invention, for engineering The different difficult points that encounter when practising modeling automatically complete the building of adaptive classifier, adaptive features select method, adaptive Threshold value optimizing algorithm, adaptive classifier inner parameter tuning method, the selection method of adaptive optimal prediction model totally five The content of aspect realizes the identification of the defect high risk module of software network, can obtain better defect high risk module and know Other effect;
(2) entire defect high risk module identification process can all be realized on backstage in the method for the present invention, establish completely certainly In the process of dynamicization, it is ensured that lower manpower and time cost to greatest extent.
Detailed description of the invention
Fig. 1 is the overall flow figure of the recognition methods of defect high risk module of the invention;
Fig. 2 is the classifier schematic diagram that present invention building model is used;
Fig. 3 is adaptive features select method schematic diagram in the method for the present invention;
Fig. 4 is adaptive threshold optimizing algorithm flow chart in the method for the present invention;
Fig. 5 is the selection method schematic diagram of adaptive optimal prediction model in the method for the present invention.
Specific embodiment
The present invention is understood and implemented for the ease of those of ordinary skill in the art, and the present invention is made into one with reference to the accompanying drawing The detailed and deep description of step.
The needs of good prediction effect are obtained to meet different types of data collection in failure prediction, the present invention proposes one Kind of the defect high risk module recognition method based on software network, comprising: building adaptive classifier, selection self-adaptive features, Adaptive threshold optimizing, the selection of adaptive classifier inner parameter tuning and adaptive optimal prediction model.The present invention is based on Machine learning model optimization and preferred method, according to the best of completion defective data collection adaptive the characteristics of data set itself The processes such as the selection of feature, the setting of classifier threshold value, inner parameter tuning select optimal bug prediction model, obtain most Good failure prediction effect, improves the hit rate of software test.Defect high risk module recognition method of the invention, such as Fig. 1 institute Show, illustrates each step below.
Step 1 constructs adaptive classifier.
In machine learning field, these four sides of main supervised learning, unsupervised learning, semi-supervised learning and intensified learning Formula.Mainly research is unfolded to the classifier of supervised learning in the present invention, and according to the application range of different classifications device itself and excellent lacks Point etc., constructs 16 kinds of different classifiers of total 7 major class, this 16 kinds of different classifiers can be according to the spy of input data set Point obtains best result when establishing model and algorithms selection adaptively to select most suitable classifier.Such as figure It is 16 kinds of the present invention different classifier/classifier algorithms shown in 2, this 16 kinds of different classifiers not only include Current most popular classifier algorithm, further comprises the innovatory algorithm of partial classifier.Such as: in generalized linear model Linear regression and ridge regression are current most popular classifier algorithms, and lasso trick returns, and minimum angular convolution is returned, logistic regression and Stochastic gradient descent is the innovatory algorithm to linear regression and ridge regression.16 kinds of classifiers that the present invention uses include: broad sense line Property the linear regression of model, ridge regression, lasso trick returns, minimum angular convolution is returned, logistic regression and stochastic gradient descent;Vector machine model Support vector machines;The K arest neighbors of arest neighbors model;The Gauss naive Bayesian of Bayesian model;The decision of decision-tree model Tree;The random forest of aggregation model, extreme random tree, self-adaptive enhancement algorithm and gradient promote decision tree;Discriminant analysis model Linear discriminant analysis and quadratic discriminatory analysis.
Due to the difference of metric parameter and the difference of preconditioning technique, the influence to prediction effect will be far longer than engineering The influence of prediction model is practised, therefore selects properly to take effective removal noise data, redundancy dimension with accurate metric parameter It would be more advantageous in the performance for improving software defect prediction.
Step 2 carries out adaptive features select.
Different defective datas concentrate structure feature relevant to defect to be different.Spy can be enhanced in reasonable feature selecting Correlation between characteristic value of seeking peace removes redundancy feature.The method of feature selecting is broadly divided into three kinds of (reference papers at present [8]: the design of P2P Financial Risk Control system of the Gao Haoyang based on big data and realization [D] Beijing Jiaotong University, 2018.): (1) filtration method, according to diversity perhaps correlation to each feature score given threshold or selection threshold value Number, then select feature.(2) pack is modeled using machine learning algorithm, according to objective function, is selected every time Several features, or several features are excluded, objective function is usually prediction effect scoring.(3) embedding inlay technique first uses certain machines The algorithm and model of study is trained, and obtains the weight coefficient of each feature, selects feature from big to small further according to coefficient.But It is the data set given for one, a kind of feature selection approach is often difficult to be completed at the same time two above purpose.Present invention warp Decision is mixed using various features selection method come the correlation between Enhanced feature and characteristic value after crossing analysis, is selected suitable The best features of defective data collection currently entered.
Adaptive features select method of the invention is moved as shown in figure 3, selecting filtration method to remove low Variance feature first Except those act on little feature, then, itself selects recursive feature removing method with penalty factor classifier, without punishment The classifier of the factor selects the Chi-square Test method of single argument feature selecting, the specific steps are as follows:
Step 201, data set is pre-processed using removal low Variance feature method first.If a certain in data set The value of 80% or more feature is all identical, then can think that this feature effect is too small can remove.So the present invention selects to move Except pretreatment of the method as feature selecting of low Variance feature, the upper limit values of variance threshold values is set as 0.16, then calculates number According to the variance yields for concentrating each single item feature, all identical feature of 80% example value is then removed, that is, eliminates variance yields and is not higher than 0.16 feature retains the feature that variance is greater than 0.16.
Step 202, the selection of feature is then carried out using recursive feature elimination algorithm to the classifier with penalty factor. Recursive feature elimination algorithm is a kind of greedy algorithm for finding optimal feature subset, and main thought is come using a basic mode type More wheel training are carried out, after every wheel training, remove the feature of several weight coefficients, then next round training is carried out based on new feature set (reference paper [9]: backbone road net traffic state prediction and method for visualizing research [D] of the Chen Shan based on machine learning .2017.).The present invention is used for having the classifier such as random forests algorithm, linear regression algorithm etc. of included penalty factor Then the classifier algorithm of itself establishes bug prediction model using 10 folding cross validations in training set as basic mode type, and Using AUC value as scoring criteria, the lower feature of successive elimination score is retained in the feature of highest scoring in data set, that is, selects The corresponding optimal feature subset of the data set.
Step 203, then to without penalty factor classifier using single argument feature selecting Chi-square Test method into The selection of row feature.Single argument feature selection approach is individually counted to each variable, is removed according to certain standard incoherent Variable.The method is not based on classifier algorithm and establishes model, but one of statistics method, fairly simple, is easy to transport Row, generally for understanding that data have preferable effect.Classical Chi-square Test is to examine qualitative independent variable to qualitative dependent variable Correlation can select and the maximally related several Xiang Tezheng of defect according to related coefficient p value.As p < 0.05, show feature and spy Value indicative has correlation;As p < 0.01, show that feature is extremely related to characteristic value.The present invention is for itself without penalty factor Classifier such as K- nearest neighbor algorithm the phase of each feature with characteristic value is calculated using Chi-square Test method analyzing defect data set Relationship number sets p value as 0.05, and removal related coefficient is not less than the feature of p value, retains the feature that related coefficient is less than p value, i.e., Retain the feature in this data set with characteristic value with correlation.
Step 3 carries out adaptive threshold optimizing.
Threshold value is also known as threshold value, is the highest or minimum of some effect.In machine learning, threshold value is referred to sample Originally it is divided into different classes of critical value, the sample greater than this threshold value is divided into one kind, and the sample less than this threshold value is divided into separately It is a kind of.In failure prediction, due to the diversity of defective data collection, classifier can be made different using different size of threshold value Score.For different types of defective data collection, the optimal threshold of classifier highest scoring can be made to be not necessarily the silent of classifier Recognize value, optimal threshold should be that dynamic adjusts with the characteristics of data set itself.Present invention employs a kind of adaptive thresholdings It is worth optimization method, as shown in figure 4, it is more preferable to obtain to go out optimal threshold value for the diversity dynamic select of defective data collection Classifying quality.
The main thought of adaptive threshold optimization method of the invention is that prediction is constructed based on the data in training set Model obtains prediction label value set to verify the characteristic value data input prediction model concentrated, with prediction label value set and True tag value set computation model performance indicator AUC value constantly traverses the data in prediction label value set and replaces threshold value, choosing The highest threshold value of AUC value of sening as an envoy to is as optimal threshold.To each classifier, below step 301~307 is executed, best threshold is found Value.
Step 301: inputting the characteristic set G1 and true tag L1 of training set, verify the characteristic set G2 and true mark of collection Sign L2;The present invention is directed to software network, and the characteristic set G1 and G2 of input are the node diagnostic set of software network, label L1 and L2 is for marking whether the node is malfunctioning node.Module in each node corresponding software.
Step 302: bug prediction model being constructed according to G1 and L1, G2 is input to bug prediction model and obtains prediction result Set S1;It is the predicted value that node is malfunctioning node in prediction result set S1, usual predicted value is that have just to have negative floating number, Predicted value is compared with the threshold value of the classifier of setting, to identify defect high risk node.
Step 303: by S1 gather in predicted value be ranked up according to the sequence from small to large of value, obtain set S2;
Step 304: take the threshold value threshold that the median of S2 is initial as classifier, again to characteristic set G2 into Row prediction obtains prediction label set P1;Prediction label set P1 be flag node whether be faulty node set, work as prediction When value is less than threshold value, judgement is normal node, and otherwise judgement is faulty node.
Step 305: according to true tag L2 and prediction label P1, calculating the value of AUC;
Step 306: since S2 median, S2 being traversed along the direction that predicted value increases, takes the prediction in S2 Value constantly updates threshold, repeats step 304 and 305, constantly calculates new AUC value;
Step 307: the threshold value for taking threshold value threshold corresponding to maximum AUC value optimal as classifier and output.
By above-mentioned threshold value searching process, each classifier can the optimal threshold of adaptive setting according to the input data Value.
Step 4 carries out adaptive classifier inner parameter tuning.
Inner parameter for certain classifiers is the adjusting needed according to parameter is carried out the characteristics of data set itself, otherwise It will affect the accuracy of established bug prediction model.Such as common ridge regression classifier inner parameter step value alpha It acts on particularly significant, if step size settings are excessive, model may be made accidentally to miss optimal solution;But if step-length mistake If small, and it is too long to will lead to failure prediction runing time.Established defect can be improved in the inner parameter of Optimum Classification device The accuracy of prediction model.
After the present invention is for the inner parameter research of 16 kinds of classifiers of building, it is found that most of classifier can be taken The parameter value of default, but sorting algorithm is returned for ridge regression and lasso trick, since prediction of its step value alpha to model is quasi- True property has larger impact, and its value is the floating number between 0~1, can be selected most preferably in training set with stochastic search methods Alpha value;For K arest neighbors model, the setting of sample size k value, there is larger shadow to the accuracy rate of established prediction model It rings, and its value is positive integer, is suitble on training set select optimal k value using trellis search method.
(401) classifier is returned for ridge regression and lasso trick, the present invention adjusts its inner parameter using stochastic search methods. Random search is also a kind of parameter regulation means, under certain the number of iterations, Selecting All Parameters of sampling from a random distribution, and root Model construction and assessment are carried out according to each parameter combination.Classifier is returned for ridge regression and lasso trick, the present invention to search at random The mode of rope on training dataset by the way of cross validation, under 100 the number of iterations, using AUC value as interpretational criteria, The value being distributed between 0~1 is randomly selected as step value, prediction model is established with each classifier, chooses the highest mould of AUC value Step value corresponding to type is as optimal step value.
(402) for K arest neighbors disaggregated model, the present invention adjusts its inner parameter using trellis search method.Grid is searched Suo Fangfa is also a kind of parameter tuning algorithm, and combination foundation and evaluation model to all parameters in network are selected model and obtained Divide best parameter.Selection for parameter k in K arest neighbors disaggregated model, set k value range be between 1~13 just Integer uses the mode of cross validation on training dataset, using AUC value as model evaluation criterion, traverses in k value range All positive integers establish bug prediction model with k- nearest neighbor algorithm, choose the corresponding k value of the highest model of AUC value as optimal K value.
Step 5 carries out the selection of adaptive optimal prediction model.
Different classifications device acquired prediction effect on the same defective data collection is different, pre- between individual classification device It is also larger to survey effect gap.Since different classifiers has different sensibility to different types of data, so some classification Device prediction effect on certain type of defective data collection is preferable, but on other types of defective data collection prediction effect compared with Difference.When carrying out the failure prediction between version, the distribution situation of old version defect is only known, the software for prediction is simultaneously unclear The distribution situation of its defect of Chu also can not just determine that the prediction effect of which classifier is most credible, so to carry out optimal classification The selection of device.
In order to solve the problem above-mentioned, the present invention uses the selection method of adaptive optimal prediction model, the master of this method Wanting thought is to take out a part first from the data set of input to be used as verifying set, and it is pre- to establish defect with remaining data set Model is surveyed, selects optimal sorter model further according to prediction effect quality of the prediction model in verifying set, finally again It is predicted for the defects of test set (test set is exactly collection of network to be predicted).
Machine learning algorithm commonly reflects that the index of model quality mainly has accurate rate P, recall rate R, overall merit to refer to Mark F1 and AUC etc..When machine learning solves two classification problems, defective class is divided into positive class, non-defective class is divided into negative Class, it may appear that following four kinds of situations: positive class sample predictions are positive the positive sample tp (True Positive) of class, by negative class sample This prediction is negative the negative sample tn (True Negative) of class, and negative class sample predictions are positive the negative sample fp (False of class Positive), positive class sample predictions are negative the positive sample fn (False Negative) of class.
Accurate rate P is for for prediction result, and what it was indicated is that prediction is positive in the sample of class how many is to predict just True, it can be by formulaIt calculates.What recall rate R was indicated is that how many is pre- for positive class in sample Survey it is correct, can be by formulaIt calculates.Tp_num is tp quantity, and fp_num is fp quantity, fn_ Num is fn quantity.Both certainly for the bug prediction model of foundation, it is desirable to all the higher the better for the value of P and R, but in fact At this moment the case where sometimes will appear contradiction, just needs to comprehensively consider them, using comprehensive evaluation index F1.F1 index is P and R Weighted harmonic arerage, can be by formulaIt calculates.Then illustrate that the model established is more effective as F1 higher.It is right In the classification problem of unbalanced dataset, frequently with another evaluation index be AUC.AUC is defined as under ROC curve and seat The area that parameter surrounds, a positive sample and a negative sample ought be selected at random by being meant that, be calculated according to current class device This positive sample is come the probability before negative sample by obtained fractional value.When AUC value is bigger, the classification currently established is indicated Device model has better classifying quality.
As shown in figure 5, carrying out one embodiment of adaptive optimal prediction model selection for the present invention, classify for every kind Device establishes bug prediction model on multiple verifyings collection, obtains multiple performance indicator AUC value, chooses the maximum prediction of AUC mean value Model is optimal bug prediction model.The present embodiment carry out the selection of adaptive optimal prediction model include the following steps 5.1~ 5.5。
Step 5.1: setting and obtain training set L={ G1,G2,…,Gm, m represents the software network quantity in training set, software Network network of faulty node comprising defect network and not, if obtaining test set Gt;Establish the set L for being initially empty set1With L2;Training set L is traversed, if network G thereinrIn there is no faulty node, by GrSet L is added1, set L is otherwise added2;R=1, 2,…,m。
Step 5.2: by set L2In all software defect networks sort from small to large according to version, select last K Network constitutes verifying version set VD, then will set L2In remaining network and set L1It merges and constitutes new set H;K For positive integer.
Step 5.3: set of computations H, VD, GtIn each network node supplemental characteristic, i.e. the characteristic value of node, building Complete training set, verifying set and test set;
Step 5.4: the different classifier of application closes building prediction model in training set, using AUC value as model evaluation mark Standard calculates and collects the AUC value of upper different models in K verifying;
Step 5.5: calculate different models verifying set in obtain AUC mean value and as index, select optimal Bug prediction model.Then test set is predicted using the optimum prediction model selected.
By previous step, no matter the method for the present invention is directed to what type of defective data collection, can be according to data set certainly The characteristics of body complete the building of adaptive classifier, adaptive features select, adaptive threshold optimizing, inside adaptive classifier The content of five aspects such as selection of arameter optimization, adaptive optimal prediction model obtains best failure prediction as a result, identification High risk software module out.

Claims (5)

1. a kind of defect high risk module recognition method based on software network, comprising:
Step 1 constructs adaptive classifier, includes multiple classifiers in adaptive classifier;
Step 2, adaptive features select method, comprising: (1) pre-process, if 80% or more of a certain feature in data set Example value is all identical, then deletes this feature;(2) classifier with penalty factor is carried out using recursive feature elimination algorithm special The selection of sign;(3) feature choosing is carried out using the Chi-square Test method of single argument feature selecting to the classifier without penalty factor It selects;
Step 3 carries out adaptive threshold optimizing;To each classifier, prediction model is constructed based on training set, verifying collection is defeated Enter prediction model and obtain prediction value set, constantly traversal predicts the predicted value in value set to replace the threshold value of classifier, every time The performance indicator AUC value of prediction model is calculated according to prediction label set and true tag set, selecting makes the highest threshold of AUC value It is worth the optimal threshold as classifier;
Step 4 carries out adaptive classifier inner parameter tuning;Classifier is returned for ridge regression and lasso trick, using searching at random Suo Fangfa finds optimum stepsize, selects optimal sample quantity k value using trellis search method for K arest neighbors disaggregated model;
Step 5 carries out the selection of adaptive optimal prediction model;Building defect is closed in training set using different classifiers Prediction model calculates AUC value of each bug prediction model on multiple verifying collection, takes the maximum bug prediction model of AUC mean value For optimum prediction model, defect high risk module then is carried out to software network to be measured using optimum prediction model and is identified.
2. the method according to claim 1, wherein including 16 in adaptive classifier in the step one Kind different classifiers: the linear regression of generalized linear model, ridge regression, lasso trick returns, minimum angular convolution is returned, logistic regression and with The decline of machine gradient;The support vector machines of vector machine model;The K arest neighbors of arest neighbors model;The Gauss simplicity shellfish of Bayesian model Ye Si;The decision tree of decision-tree model;The random forest of aggregation model, extreme random tree, self-adaptive enhancement algorithm and gradient mention Rise decision tree;The linear discriminant analysis of discriminant analysis model and quadratic discriminatory analysis.
3. the method according to claim 1, wherein being carried out in the step two using Chi-square Test method When feature selecting, if calculate the related coefficient of feature and characteristic value less than 0.05, retain this feature.
4. the method according to claim 1, wherein in the step three, realization includes:
Firstly, constructing bug prediction model using training set, verifying is collected into input bug prediction model and obtains prediction value set S1, It will predict that the data in value set S1 are ranked up according to the sequence of value from small to large, obtain set S2;
Secondly, since the median of S2, set S2 is traversed along the direction that predicted value increases, takes in S2 predicted value more Change threshold value;After replacement threshold value every time, prediction label set P1 is obtained to verifying collection prediction again, in conjunction with the true of verifying collection Tag set calculates AUC value;
Finally, choosing the threshold value optimal as classifier of threshold value corresponding to maximum AUC value after terminating traversal.
5. the method according to claim 1, wherein realizing that step includes: in the step five
Step 5.1: setting and obtain training set L={ G1,G2,…,Gm, m represents the software network quantity in training set, software network Comprising the defect network and not network of faulty node, if obtaining test set Gt;Establish the set L for being initially empty set1And L2;Time Training set L is gone through, if network G thereinrIn there is no faulty node, by GrSet L is added1, set L is otherwise added2, r=1, 2,…,m;
Step 5.2: by set L2In all software defect networks sort from small to large according to version, select K last network structure It, then will set L at verifying version sets VD2In remaining network and set L1It merges and constitutes new set H;K is positive whole Number;
Step 5.3: set of computations H, VD, GtIn in each network each node characteristic value, obtain training set, verifying set and Test set;
Step 5.4: each classifier closes building bug prediction model in training set in applying step one, calculates and collects in K verifying The AUC value of upper bug prediction model;
Step 5.5: selecting the maximum bug prediction model of AUC mean value obtained in K verifying set as optimum prediction mould Type predicts test set using optimum prediction model.
CN201910318037.1A 2019-04-19 2019-04-19 Software network-based method for identifying defect high-risk module Active CN110147321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910318037.1A CN110147321B (en) 2019-04-19 2019-04-19 Software network-based method for identifying defect high-risk module

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910318037.1A CN110147321B (en) 2019-04-19 2019-04-19 Software network-based method for identifying defect high-risk module

Publications (2)

Publication Number Publication Date
CN110147321A true CN110147321A (en) 2019-08-20
CN110147321B CN110147321B (en) 2020-11-24

Family

ID=67588480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910318037.1A Active CN110147321B (en) 2019-04-19 2019-04-19 Software network-based method for identifying defect high-risk module

Country Status (1)

Country Link
CN (1) CN110147321B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688152A (en) * 2019-09-27 2020-01-14 厦门大学 Software reliability quantitative evaluation method combining software development quality information
CN110866030A (en) * 2019-10-23 2020-03-06 中国科学院信息工程研究所 Database abnormal access detection method based on unsupervised learning
CN111143222A (en) * 2019-12-30 2020-05-12 军事科学院系统工程研究院系统总体研究所 Software evaluation method based on defect prediction
CN111782512A (en) * 2020-06-23 2020-10-16 北京高质系统科技有限公司 Multi-feature software defect comprehensive prediction method based on unbalanced noise set
CN111782548A (en) * 2020-07-28 2020-10-16 南京航空航天大学 Software defect prediction data processing method and device and storage medium
CN112580268A (en) * 2021-02-25 2021-03-30 上海冰鉴信息科技有限公司 Method and device for selecting machine learning model based on business processing
WO2021143175A1 (en) * 2020-01-14 2021-07-22 华为技术有限公司 Test case screening method and device, and medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810101A (en) * 2014-02-19 2014-05-21 北京理工大学 Software defect prediction method and system
CN103810102A (en) * 2014-02-19 2014-05-21 北京理工大学 Method and system for predicting software defects
CN105389598A (en) * 2015-12-28 2016-03-09 中国石油大学(华东) Feature selecting and classifying method for software defect data
CN105677564A (en) * 2016-01-04 2016-06-15 中国石油大学(华东) Adaboost software defect unbalanced data classification method based on improvement
CN105701013A (en) * 2016-01-04 2016-06-22 中国石油大学(华东) Software defect data feature selection method based on mutual information
CN106203534A (en) * 2016-07-26 2016-12-07 南京航空航天大学 A kind of cost-sensitive Software Defects Predict Methods based on Boosting
WO2018175496A1 (en) * 2017-03-20 2018-09-27 Versata Development Group, Inc. Code defect prediction by training a system to identify defect patterns in code history
CN108664402A (en) * 2018-05-14 2018-10-16 北京航空航天大学 A kind of failure prediction method based on software network feature learning
CN109165160A (en) * 2018-08-28 2019-01-08 北京理工大学 Software defect prediction model design method based on core principle component analysis algorithm
CN109325543A (en) * 2018-10-10 2019-02-12 南京邮电大学 Software Defects Predict Methods, readable storage medium storing program for executing and terminal

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810101A (en) * 2014-02-19 2014-05-21 北京理工大学 Software defect prediction method and system
CN103810102A (en) * 2014-02-19 2014-05-21 北京理工大学 Method and system for predicting software defects
CN105389598A (en) * 2015-12-28 2016-03-09 中国石油大学(华东) Feature selecting and classifying method for software defect data
CN105677564A (en) * 2016-01-04 2016-06-15 中国石油大学(华东) Adaboost software defect unbalanced data classification method based on improvement
CN105701013A (en) * 2016-01-04 2016-06-22 中国石油大学(华东) Software defect data feature selection method based on mutual information
CN106203534A (en) * 2016-07-26 2016-12-07 南京航空航天大学 A kind of cost-sensitive Software Defects Predict Methods based on Boosting
WO2018175496A1 (en) * 2017-03-20 2018-09-27 Versata Development Group, Inc. Code defect prediction by training a system to identify defect patterns in code history
CN108664402A (en) * 2018-05-14 2018-10-16 北京航空航天大学 A kind of failure prediction method based on software network feature learning
CN109165160A (en) * 2018-08-28 2019-01-08 北京理工大学 Software defect prediction model design method based on core principle component analysis algorithm
CN109325543A (en) * 2018-10-10 2019-02-12 南京邮电大学 Software Defects Predict Methods, readable storage medium storing program for executing and terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
傅艺绮: "基于机器学习的软件缺陷预测方法与工具", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688152A (en) * 2019-09-27 2020-01-14 厦门大学 Software reliability quantitative evaluation method combining software development quality information
CN110688152B (en) * 2019-09-27 2021-01-01 厦门大学 Software reliability quantitative evaluation method combining software development quality information
CN110866030A (en) * 2019-10-23 2020-03-06 中国科学院信息工程研究所 Database abnormal access detection method based on unsupervised learning
CN111143222A (en) * 2019-12-30 2020-05-12 军事科学院系统工程研究院系统总体研究所 Software evaluation method based on defect prediction
WO2021143175A1 (en) * 2020-01-14 2021-07-22 华为技术有限公司 Test case screening method and device, and medium
CN111782512A (en) * 2020-06-23 2020-10-16 北京高质系统科技有限公司 Multi-feature software defect comprehensive prediction method based on unbalanced noise set
CN111782512B (en) * 2020-06-23 2021-07-09 北京高质系统科技有限公司 Multi-feature software defect comprehensive prediction method based on unbalanced noise set
CN111782548A (en) * 2020-07-28 2020-10-16 南京航空航天大学 Software defect prediction data processing method and device and storage medium
CN112580268A (en) * 2021-02-25 2021-03-30 上海冰鉴信息科技有限公司 Method and device for selecting machine learning model based on business processing

Also Published As

Publication number Publication date
CN110147321B (en) 2020-11-24

Similar Documents

Publication Publication Date Title
CN110147321A (en) A kind of recognition methods of the defect high risk module based on software network
Zhang et al. Multi-objective particle swarm optimization approach for cost-based feature selection in classification
CN107103332B (en) A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
Gaber et al. A survey of classification methods in data streams
Zhan et al. Consensus-driven propagation in massive unlabeled data for face recognition
Mantovani et al. To tune or not to tune: recommending when to adjust SVM hyper-parameters via meta-learning
Packianather et al. A wrapper-based feature selection approach using Bees Algorithm for a wood defect classification system
Kuo et al. Integration of artificial immune network and K-means for cluster analysis
CN107577605A (en) A kind of feature clustering system of selection of software-oriented failure prediction
das Dôres et al. A meta-learning framework for algorithm recommendation in software fault prediction
Lin et al. A new density-based scheme for clustering based on genetic algorithm
Bisht et al. Review Study on Software Defect Prediction Models premised upon Various Data Mining Approaches
Gao et al. An ensemble classifier learning approach to ROC optimization
Li et al. A fuzzy linear programming-based classification method
He et al. Ensemble multiboost based on ripper classifier for prediction of imbalanced software defect data
KR101085066B1 (en) An Associative Classification Method for detecting useful knowledge from huge multi-attributes dataset
Kumar et al. Classification of faults in web applications using machine learning
CN109086291A (en) A kind of parallel method for detecting abnormality and system based on MapReduce
CN112784881B (en) Network abnormal flow detection method, model and system
Choirunnisa et al. Software Defect Prediction using Oversampling Algorithm: A-SUWO
Kim et al. Optimization of average precision with maximal figure-of-merit learning
Ibrahim et al. LLAC: Lazy Learning in Associative Classification
Raamesh et al. Data mining based optimization of test cases to enhance the reliability of the testing
KR102134324B1 (en) Apparatus and method for extracting rules of artficial neural network
Li et al. A novel k-means classification method with genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant