CN108304316A

CN108304316A - A kind of Software Defects Predict Methods based on collaboration migration

Info

Publication number: CN108304316A
Application number: CN201711417594.6A
Authority: CN
Inventors: 陈晋音; 胡可科; 杨奕涛; 方航
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2018-07-20
Anticipated expiration: 2037-12-25
Also published as: CN108304316B

Abstract

A kind of Software Defects Predict Methods based on collaboration migration, include the following steps：1) by four kinds of different standardized methods and TCA transfer learnings method in combination with former source item data set is expanded the new same size source item data set into four；2) synergetic classification device is built to destination item using the software defect prediction algorithm based on collaboration migration；3) failure prediction is carried out to sample to be predicted new in destination item.The present invention chooses four kinds of different standardized methods and is combined with TCA transfer learning methods to expand source item data set simultaneously, enrich the information representation of source item data, a sub-classifier is generated for each source item, and adaptive weighting distribution is carried out to sub-classifier according to PSO algorithms, to build synergetic classification device, failure prediction is carried out to the sample to be tested in destination item.

Description

A kind of Software Defects Predict Methods based on collaboration migration

Technical field

The invention belongs to software defect prediction algorithm fields, and in particular to a kind of software defect prediction based on collaboration migration Method.

Background technology

Software defect prediction can be divided into failure prediction and spanned item mesh failure prediction in project.Failure prediction needs big in project Measure in the project it is known whether defective sample, such as file, class and function, as training set, in conjunction with machine learning Method generate grader after target sample is predicted.The failure prediction of spanned item mesh then can be according to the sample of other relevant items This carries out failure prediction to destination item.Since destination item is too new or obtains the cost mistake of label in real development process Height causes training sample in destination item very few, it is often necessary to carry out spanned item mesh failure prediction.It is pre- in most of spanned item mesh defects In method of determining and calculating, due to the difference of destination item and source item development process, the two sample distribution often has differences, and becomes use Biggest obstacle when conventional machines learning algorithm, directly affects prediction effect.

In order to solve the problems, such as that source item differs greatly with destination item sample distribution in spanned item mesh failure prediction, migration is learned Habit is introduced in software defect prediction.Mainly have based on sample and based on spy currently based on the failure prediction algorithm of transfer learning Both are levied, the former selects the sample for contributing to destination item to predict in source item, and the latter is by the sample of source item and destination item Originally it is mapped to the expression again that the same potential feature space carries out feature, both can solve source item and destination item Sample distribution different problems.Turhan et al. using K arest neighbors methods be in destination item without category sample from source item The middle training sample for selecting ten most similar samples as prediction model；It is similar to the method that Turhan is proposed, Peters etc. People also utilizes arest neighbors method to select training sample, but it selects tactful difference；Ma et al. proposes one kind TransferBayes (TNB) method reduces source item and target item by distributing weight to the sample in training set Then data distribution difference between mesh builds prediction model using the training sample after weighting；Ryul et al. is by Boosting- SVM is combined with class imbalance problem solution, and the performance of TNB is improved with this；In addition to the above-mentioned migration based on sample Outside practising, Pan et al. proposes a kind of transfer learning method TransferComponent Analysis (TCA) of feature based, Source item and destination item are mapped to a latent space by it by learning a transformational relation so that both in the space Apart from as small as possible；On the basis of TCA, Nam et al. observes that different standardized methods is affected to migration effect, because This devises set of rule to select suitable standardized method to be combined with TCA, it is proposed that TCA+ methods.But the above migration is learned It practises and is all directed to one-to-one spanned item mesh failure prediction method, can not determine which relevant source item has most destination item It is great for other source items if only predicted using a source item under the premise of good prediction effect Waste, so how to efficiently use the sample information of other source items, i.e. multi-source transfer learning and one very important asks Topic.

For multi-source transfer learning, most efficient method is to generate them each source item after one grader at present In conjunction with to complete migration task.Schweikert et al. utilizes a kind of side of entitled Multiple Convex Combination Method is combined each source domain and aiming field with each SVM classifier that category data generate；Sun et al. proposes one kind and does not need Method with category target sample, this method are based on Bayesian learning principle, weigh the adaptedness of source domain and aiming field to divide With weight, which is indicated with the Euclidean distance average value of the k of source domain and aiming field closest samples；Yang et al. bases In support vector machines (SVM), combining adaptive function, it is proposed that a kind of adaptive support vector regression can be used for aiming field On-line monitoring, but be all equal for the weight of each subclassification.But above-mentioned algorithm is not by with pre- with software defect It surveys.

Generally speaking, there are following problems for current software defect prediction algorithm：In software defect prediction, migration Study is particularly significant for spanned item mesh failure prediction, how transfer learning algorithm to be made to make full use of the useful information of source item, To promote the failure prediction performance to destination item；Different source items has destination item different prediction effects, Under the premise of can not determine which source item estimated performance is best, compared with one-to-one spanned item mesh failure prediction, how Consider that other all relevant source items could improve estimated performance simultaneously.For the Railway Project present on, set forth herein A kind of software defect prediction algorithm based on collaboration migration.

Invention content

There are following problems for current software defect prediction algorithm：Software defect prediction in, transfer learning for Spanned item mesh failure prediction is particularly significant, how so that transfer learning algorithm makes full use of the useful information of source item to be promoted pair The failure prediction performance of destination item；Different source items has different prediction effects for destination item, with it is one-to-one across Project failure prediction is compared, and how to consider that other all relevant source items could improve estimated performance simultaneously.The present invention provides A kind of Software Defects Predict Methods based on collaboration migration, choose four kinds of different standardized methods simultaneously with TCA transfer learnings Method enriches the information representation of source item data in conjunction with source item data set is expanded, and one is generated for each source item Sub-classifier, and adaptive weighting distribution is carried out to sub-classifier according to PSO algorithms, to build synergetic classification device, to target Sample to be tested in project carries out failure prediction.

The technical solution adopted by the present invention to solve the technical problems is：

A kind of Software Defects Predict Methods based on collaboration migration, the described method comprises the following steps：

1) by former source item data set by the standardized method different from four kinds of TCA transfer learnings method in combination with rear Source item data set expansion is carried out, process is as follows：

1.1) known class target sample mean in destination item first, is divided into target training set and target detection collection, It is required that including the defective sample of equivalent amount；

1.2) all relevant source item data set combination target detection collection are subjected to four kinds of standardizations, wherein four kinds Standardized method is maxmin criterion, the Z-score standards based on the common average value of source domain and aiming field and standard deviation Change, marked based on source domain average value and the Z-score of standard deviation standardization, based on aiming field average value and the Z-score of standard deviation Standardization；

1.3) call TCA algorithms respectively by the source item data set after four kinds of standardizations and the former source item before processing Data set carries out transfer learning, the new source item data set after being expanded and target detection collection for target detection collection；

2) synergetic classification device is built to destination item using the software defect prediction algorithm based on collaboration migration, process is such as Under：

2.1) respectively in data set after expansion each source item data set and target training set utilize machine learning In decision Tree algorithms generate a sub-classifier；

2.2) be each subclassification self-adjusted block weight to obtain a synergetic classification device；

3) it is that sample to be predicted new in destination item carries out failure prediction, process is as follows：

3.1) pretreatment being made of standardization and transfer learning is passed through to new sample；

3.2) it calls trained synergetic classification device to classify pretreated each new samples, predicts whether it contains It is defective.

Further, the process of the step 1.3) is as follows：

1.3.1 the heavy expression characteristic dimension obtained after TCA migrations, i.e., the dimension of potential feature space) are determined；

1.3.2) according to identified latent space dimension, a kind of transformational relation is determined by gaussian kernel function so that source After former feature space is transformed into potential feature space, distributional difference between the two subtracts for project data collection and target data set It is small；

1.3.3 former N number of source item data set and 1 target detection collection) are extended for 5*N source item data set and relatively The 4*N+1 target detection collection answered.

Further, the process of the step 2.2) is as follows：

2.2.1) synergetic classification device and object function are defined first：

Define 1 (synergetic classification device)：All subclassifications are had according to respective contribution to point obtained after being combined with stressing Class device is synergetic classification device.Synergetic classification device classifies in the following manner for a new samples j：

Wherein Score_i(j) each sub-classifier C is indicated_iThe confidence level provided, i.e. sample j are defective sample Possibility, between the section of confidence level is 0 to 1.w_iFor the weight of each sub-classifier, for indicating the sub-classifier for association With the contribution of grader.M is the number of sub-classifier, and threshold is to judge whether the sample contains defective confidence Spend threshold value.The sum of the weighting confidence level of all sub-classifiers Comp (j) if more than the threshold value then by the sample classification be it is defective, Otherwise it is zero defect.

Define 2 (object functions)：This optimization process is distributed for adaptive weighting, using F-measure as target letter Number, computational methods are：

F=(2 × P × R)/(P+R) (3)

P=TP/ (TP+FP) (4)

R=TP/ (TP+FN) (5)

Wherein, TP is real sample number, and representative is predicted as really containing defective sample number in defective sample；FP is False positive sample number represents the sample number for being predicted as that defect is actually free of in defective sample；FN is false anti-sample number, is represented It is predicted as actually containing the sample number for going defect in flawless sample.On this basis it can be calculated that P is the standard of classification True rate refers to and is predicted as in defective sample being really defective sample proportion, and the value is higher, and to represent grader more accurate；R is The recall rate of classification, it is really to be predicted to be defective sample proportion in defective sample to refer to, and the higher representative of the value has scarce Sunken sample is found more.F-measure is the harmonic-mean of accuracy rate and recall rate, and the value the high, illustrates to utilize The synergetic classification device that this group of weight and threshold value are constituted gets not only defective sample more but also accurate, i.e., estimated performance is better.

2.2.2 PSO algorithms) are introduced into weight self-adjusted block process herein, are first generated at random according to population scale A series of particles carry out population initialization.One of weight and threshold value is combined as a solution, and the disaggregation of all solution compositions are by table The population being shown as in a search space.Position where one particle is described by a series of coordinate values, and each value represents A part for one solution, i.e. weighted value or threshold value.

2.2.3 the fitness of each particle) is calculated, fitness herein is the association formed using this group of weight and threshold value With grader to the prediction effect of target detection collection, weighed with F-measure.

2.2.4 the desired positions that the desired positions and entire population) lived through according to the particle are lived through, i.e. institute Weight distribution and threshold value when obtaining F-measure maximums are arranged, and to update the position and speed of the particle, speed indicates the particle Mobile distance and direction.

2.2.5) return to step 2.2.2), until greatest iteration number, exports in population and obtain maximum F-measure values Particle position, as optimal weight and threshold value.

2.2.6) according to optimal weight and threshold value, all sub-classifiers are built into a final synergetic classification jointly Device；

In the step 1.3), choose herein four kinds of different standardized methods simultaneously with TCA transfer learning method knots It closes to expand source item data set, enriches the information representation of source item data.This be for the first time by multiple standards method with move It moves study to be applied in combination with after in software defect prediction algorithm, migration performance is substantially better than other methods.

In the step 2), the software defect prediction algorithm based on collaboration migration is used herein, which can be abundant The different information expressed after the processing of multiple standards method using each source item, to structure more fully sub-classifier, and And it is each sub-classifier self-adjusted block weight to combine the estimated performance to target detection collection, and synergetic classification device is built with this, from And achieve the purpose that optimize multi-source transfer learning, it can finally optimize spanned item mesh software defect estimated performance

In the step 3), herein when each new sample to be tested carries out failure prediction in for destination item, Other related source items are combined to pre-process in the sample first, wherein pre-processing by multiple standards method and TCA migrations Study composition then combines the threshold of trained synergetic classification device to provide new training set for each sub-classifier Value and weight classify new samples, realize task of spanned item mesh failure prediction is carried out to each target sample to be tested.

The present invention technical concept be：This paper presents the software defect prediction algorithm based on collaboration migration, the algorithm is first First by TCA algorithms and multiple standards method in combination with fully extracting the abundant information in source item data set, and subtract The data distribution difference of few source item and destination item, expands source item data set with this；Then, to source item after expansion Mesh number trains a sub-classifier according to each data set of concentration using decision Tree algorithms, and test sample is waited for for the same target This, each sub-classifier provides the whether defective confidence level of the sample；Then, the software based on collaboration migration is called Failure prediction algorithm obtains a synergetic classification device, can be had to the contribution of synergetic classification device according to each sub-classifier and stress ground They are combined；Finally, after the pretreatment for carrying out being combined by standardized method and TCA to target sample to be tested, instruction is called The synergetic classification device perfected carries out failure prediction.

Beneficial effects of the present invention are mainly manifested in：By by TCA algorithms and multiple standards method in combination with to source All information for fully utilizing source item that can provide while project is reduced with destination item sample distribution difference as far as possible, and Newly-generated data set is expanded into source item data set, and one is obtained by the software defect prediction algorithm based on collaboration migration Synergetic classification device can carry out failure prediction using all relevant source items to destination item, and to target sample to be tested It carries out calling synergetic classification device to carry out spanned item mesh failure prediction when failure prediction.5 Item Sets of the software defect prediction algorithm pair, Amount to 28 software projects, test result show that the failure prediction algorithm can make full use of all source item information, and have Effect improves the effect of prediction.

Description of the drawings

Fig. 1 is the structure chart of the Software Defects Predict Methods based on collaboration migration.

Fig. 2 is the flow chart of the Software Defects Predict Methods based on collaboration migration.

Specific implementation mode

The invention will be further described below in conjunction with the accompanying drawings.

Referring to Figures 1 and 2, a kind of Software Defects Predict Methods based on collaboration migration, include the following steps：

1) by four kinds of different standardized methods and TCA transfer learnings method in combination with by former source item data set Expand the new same size source item data set into four, process is as follows：

1.1) known class target sample in destination item first, is divided into two parts according to category：Target training set and Target detection collection, wherein require the similar mark sample number of the two identical and must all contain defective sample, its in destination item His sample without category is as target sample to be tested；

1.2) for current all and relevant set of source data of destination item, combining target test set is standardized place Reason, using with 4 kinds of standardization processing methods：

First method is maxmin criterion, and computational methods are as follows：

Second method is to be standardized based on the common average value of source item and destination item and the Z-score of standard deviation, Computational methods are as follows：

The third method is to be standardized based on source item average value and the Z-score of standard deviation, and computational methods are as follows：

Fourth method is to be standardized based on destination item average value and the Z-score of standard deviation, and computational methods are as follows：

Wherein, x represents the vector expression of certain one-dimensional characteristic in the data set after source item merges with target training set, x_iGeneration The value of i-th of sample in table x, min () are to be minimized, and max () is to be maximized, and mean () is to be averaged, and std () is Take standard deviation, x'_iFor x_iIt is normalized treated value, four kinds of methods to former data carry out again express after, abundant information has Institute is different；

1.3) call TCA algorithms respectively to the source item data set after above-mentioned 4 kinds of standardizations and the original before processing Source item data set carries out transfer learning for corresponding target detection collection, obtains new source item data set and target is surveyed Examination collection, process are as follows：

1.3.1 the data set weight expression characteristic dimension obtained after TCA migrations) is determined, i.e., the dimension of potential feature space will It is set as original half；

1.3.2) according to set latent space dimension, a kind of transformational relation is determined by gaussian kernel function so that source After former feature space is transformed into potential feature space, the two Largest Mean difference is minimum for project data collection and target data set, Largest Mean difference calculation is：

Wherein src is source item data set, and tar is destination item data set, n₁For source item data set sample number, n₂For Destination item data set sample number, src_iFor i-th of sample, tar in source item_iFor i-th of sample in destination item；

1.3.3 former N number of source item data set and 1 target detection collection) are extended for 5*N source item data set and relatively The 4*N+1 target detection collection answered；

2.1) to after each expansion in data set source item data set and target training set using in machine learning Decision Tree algorithms generate a sub-classifier respectively, and the decision Tree algorithms select the J48 algorithms in WEKA；

2.2) performance for combining synergetic classification device carries out adaptive weighting distribution for each sub-classifier, and process is as follows：

Define 1 (synergetic classification device)：All sub-classifiers are had according to respective contribution and are obtained after being combined with stressing Grader is synergetic classification device, and synergetic classification device classifies in the following manner for a new samples j：

Wherein Score_i(j) each sub-classifier C is indicated_iThe confidence level provided, i.e. sample j are defective sample Possibility, between the section of confidence level is 0 to 1, w_iFor the weight of each sub-classifier, for indicating the sub-classifier for association With the contribution of grader.M is the number of sub-classifier, and threshold is to judge whether the sample contains defective confidence Spend threshold value, the sum of weighting confidence level of all sub-classifiers Comp (j) if more than the threshold value then by the sample classification be it is defective, Otherwise it is flawless；

F=(2 × P × R)/(P+R) (8)

P=TP/ (TP+FP) (9)

R=TP/ (TP+FN) (10)

2.2.2 PSO algorithms) are used when carrying out weight self-adjusted block to sub-classifier herein, for all subclassifications Device is assigned with a series of weight (w₁,w₂,..,w_n) and a defect estimation threshold value threshold.Population scale is set first And greatest iteration number, a series of particles are then generated according to population scale at random and carry out population initialization.Weight and threshold One of value is combined as a solution, and the disaggregation of all solution compositions is represented as the population in a search space.One particle The position at place is described by a series of coordinate values, and each value represents a part for a solution, i.e. weighted value or threshold value.

2.2.3 the fitness of each particle) is calculated, fitness herein is the association formed using this group of weight and threshold value With grader to the prediction effect of target detection collection, indicated with F-measure, computational methods such as 2.2.1) defined in shown in 2.

2.2.4 the desired positions that the desired positions and entire population) and then according to the particle lived through are lived through, Weight distribution obtained by i.e. when F-measure maximums and threshold value setting, to update the position and speed of the particle, speed indicates should The distance of particle movement and direction.

2.2.5) return to step 2.2.3) it is iterated, until greatest iteration number, exports in population and obtain maximum F- The particle position of measure values, as optimal weight and threshold value.

2.2.6) according to optimal weight and threshold value, all sub-classifiers are built into a final synergetic classification jointly Device.

3) failure prediction is carried out to sample to be predicted new in destination item, process is as follows：

3.1) new sample is pre-processed, pretreatment is made of four kinds of standardized methods and TCA transfer learnings；

Claims

1. a kind of Software Defects Predict Methods based on collaboration migration, it is characterised in that：It the described method comprises the following steps：

1) by four kinds of different standardized methods and TCA transfer learnings method in combination with by the expansion of former source item data set Into four new same size source item data sets, process is as follows：

1.1) known class target sample in destination item first, is divided into two parts according to category：Target training set and target Test set, wherein require the similar mark sample number of the two identical and must all contain defective sample, other nothings in destination item The sample of category is as target sample to be tested；

1.2) for current all and relevant set of source data of destination item, combining target test set is standardized, adopts With 4 kinds of standardization processing methods：

First method is maxmin criterion, and computational methods are as follows：

Second method is to be standardized based on the common average value of source item and destination item and the Z-score of standard deviation, is calculated Method is as follows：

Wherein, x represents the vector expression of certain one-dimensional characteristic in the data set after source item merges with target training set, x_iIt represents in x The value of i-th of sample, min () are to be minimized, and max () is to be maximized, and mean () is to be averaged, and std () is to take mark Poor, the x of standard_i' it is x_iIt is normalized treated value；

1.3) call TCA algorithms respectively to the source item data set after above-mentioned 4 kinds of standardizations and the former source item before processing Mesh data set carries out transfer learning for corresponding target detection collection, obtains new source item data set and target detection Collection；

2) synergetic classification device is built to destination item using the software defect prediction algorithm based on collaboration migration, process is as follows：

2.1) to after each expansion in data set source item data set and target training set utilize the decision in machine learning Tree algorithm generates a sub-classifier respectively, and the plan tree algorithm selects the J48 algorithms in WEKA；

2.2) performance for combining synergetic classification device carries out adaptive weighting distribution for each sub-classifier；

3.2) it calls trained synergetic classification device to classify pretreated each new samples, predicts it whether containing scarce It falls into.

2. the Software Defects Predict Methods as described in claim 1 based on collaboration migration, it is characterised in that：The step 1.3) Process it is as follows：

1.3.1) determine that the data set weight expression characteristic dimension obtained after TCA migrations, i.e., the dimension of potential feature space are set It is set to original half；

1.3.2) according to set latent space dimension, a kind of transformational relation is determined by gaussian kernel function so that source item After former feature space is transformed into potential feature space, the two Largest Mean difference is minimum for data set and target data set, maximum Mean value difference calculation is：

Wherein src is source item data set, and tar is destination item data set, n₁For source item data set sample number, n₂For target Project data collection sample number, src_iFor i-th of sample, tar in source item_iFor i-th of sample in destination item；

1.3.3 former N number of source item data set and 1 target detection collection) are extended for 5*N source item data set and corresponding 4*N+1 target detection collection.

3. the Software Defects Predict Methods as claimed in claim 1 or 2 based on collaboration migration, it is characterised in that：The step 2.2) process is as follows：

2.2.1) to synergetic classification device, index F-meaure good and bad with it is evaluated is defined first：

Define 1：It is synergetic classification that all sub-classifiers are had according to respective contribution to the grader obtained after being combined with stressing Device, synergetic classification device classify in the following manner for a new samples j：

Wherein Score_i(j) each sub-classifier C is indicated_iThe confidence level provided, i.e. sample j are the possibility of defective sample Property, between the section of confidence level is 0 to 1, w_iFor the weight of each sub-classifier, for indicating the sub-classifier for collaboration point The contribution of class device, M are the number of sub-classifier, and threshold is to judge whether the sample contains defective confidence level threshold Value, the sum of weighting confidence level of all sub-classifiers Comp (j) if more than the threshold value then by the sample classification be it is defective, otherwise It is flawless；

Define 2：This optimization process is distributed for adaptive weighting, using F-measure as object function, computational methods For：

F=(2 × P × R)/(P+R) (8)

P=TP/ (TP+FP) (9)

R=TP/ (TP+FN) (10)

Wherein, TP is real sample number, and representative is predicted as really containing defective sample number in defective sample；FP be it is false just Sample number represents the sample number for being predicted as that defect is actually free of in defective sample；FN is false anti-sample number, represents prediction Actually to contain the sample number for going defect in flawless sample；On this basis it can be calculated that P is the accurate of classification Rate refers to and is predicted as in defective sample being really defective sample proportion, and the value is higher, and to represent grader more accurate；R is point The recall rate of class, it is really to be predicted to be defective sample proportion in defective sample to refer to；F-measure be accuracy rate and The harmonic-mean of recall rate；

2.2.2 PSO algorithms) are used when carrying out weight self-adjusted block to sub-classifier, are assigned with for all sub-classifiers A series of weight (w₁,w₂,..,w_n) and a defect estimation threshold value threshold, population scale and maximum are set first Then number of iterations generates a series of particles according to population scale and carries out population initialization at random；One of weight and threshold value It is combined as a solution, the disaggregation of all solution compositions is represented as the population in a search space；Position where one particle It sets and is described by a series of coordinate values, each value represents a part for a solution, i.e. weighted value or threshold value；

2.2.3 the fitness of each particle) is calculated, fitness is the synergetic classification device pair formed using this group of weight and threshold value The prediction effect of target detection collection, is indicated with F-measure；

2.2.4 the desired positions that the desired positions and entire population) and then according to the particle lived through are lived through, i.e. institute Weight distribution and threshold value when obtaining F-measure maximums are arranged, and to update the position and speed of the particle, speed indicates the particle Mobile distance and direction；

2.2.5) return to step 2.2.3) it is iterated, until greatest iteration number, exports in population and obtain maximum F-measure The particle position of value, as optimal weight and threshold value；

2.2.6) according to optimal weight and threshold value, all sub-classifiers are built into a final synergetic classification device jointly.