Disclosure of Invention
According to the problems in the prior art, the invention discloses a Bug report severity identification method based on text feature extraction and unbalance processing strategies, which specifically comprises the following steps:
s1, collecting repaired bugs from a bug warehouse as an original data set, and preprocessing the original data set as follows: removing invalid bug reports in the data set, extracting the text information of the reserved bug reports, processing the text information into a text vector matrix by using a text preprocessing method, removing low-frequency words in the bug report description information, and marking the severity of the bug reports;
s2, aiming at the problem of unbalanced category of original data, carrying out unbalanced processing on a text matrix in the original data set by adopting four unbalanced processing strategies such as a cost matrix, random undersampling, random oversampling and synthesis of a few types of oversampling to obtain a balanced data set;
and S3, aiming at two data noise problems in the balanced data set, namely the self-contained data noise of the original data set and the newly introduced noise after the balanced operation. And (3) combining the genetic algorithm with feature extraction, example extraction and feature and example simultaneous extraction to reduce the data set, and taking the reduced data set matrix as a final training set.
S4, learning based on a small-scale high-quality data set by adopting four classification algorithms of naive Bayes, polynomial naive Bayes, K neighbor and a support vector machine, and establishing a classification model;
and S5, predicting the severity of the newly submitted bug report.
The following method is specifically adopted in S1:
s11: removing the bug report with the bug state of normal and enhancement in the original dataset;
s12: setting the severity of the bug report corresponding to the bug report state of major, critical and blocker labels as severe, and setting the severity of the bug report corresponding to the state of trivisual and minor labels as non-severe;
s13: extracting short description and long description from the bug reports reserved in the data set as description information of the bug reports, and performing word segmentation, word deactivation and word stem processing on the description information of each bug report to form a text matrix;
s14: deleting the characteristic columns corresponding to the low-frequency words in the text matrix, and only keeping the characteristic columns with high word frequency;
s3, the following method is specifically adopted:
s31: initializing a population, wherein the population comprises NP individuals, and each individual represents an extraction scheme of a data set;
s32, calculating a fitness function value for each individual in the population respectively, and recording an extraction scheme corresponding to the individual with the maximum function value;
s33, respectively carrying out selection, crossing and mutation operations on the binary strings corresponding to each individual in the population;
s34, taking the NP individuals processed in the S33 as a new population, calculating the fitness value of each individual in the new population, selecting the individual with the maximum fitness value from the new population, comparing the individual with the maximum fitness value with the function value recorded in the S32, and recording an extraction scheme corresponding to the individual with the larger function value;
and S35, returning to S32 to continuously iterate until the set maximum iteration number is reached, stopping the algorithm, and taking the extraction scheme corresponding to the individual with the maximum recorded fitness value as the final extraction scheme for the data set.
By adopting the technical scheme, the Bug report severity identification method based on the text feature extraction and imbalance processing strategy provided by the invention has the advantages that through carrying out imbalance and genetic algorithm-based text feature extraction on the Bug report data set, the generated classification model has no deviation when classifying the newly submitted Bug report, the fitting tends to be balanced, and the limitation during classification is avoided; through extraction operation, the characteristics and the examples can be simultaneously extracted, a data set with smaller scale and higher quality is obtained, the accuracy of bug report severity identification is improved, the time cost and the labor cost of bug severity identification are saved, the working efficiency is improved, and developers can conveniently process bug reports with higher severity preferentially.
Detailed Description
In order to make the technical solutions and advantages of the present invention clearer, the following describes the technical solutions in the embodiments of the present invention clearly and completely with reference to the drawings in the embodiments of the present invention:
as shown in fig. 1, a method for identifying a Bug report severity based on text feature extraction and imbalance processing strategy specifically includes the following steps:
s1, collecting repaired bugs from a bug warehouse as an original data set, and preprocessing the original data set as follows: removing invalid bug reports in the data set, extracting the text information of the reserved bug reports, processing the text information into a text vector matrix by using a text preprocessing method, removing low-frequency words in the bug report description information, and marking the severity of the bug reports;
further, the preprocessing of the original data set specifically comprises the following steps:
s11: removing the bug report with bug states of normal and enhancement in the original data set;
s12: setting the bug report severity corresponding to tags with bug states of major, critical, blocker and the like as 'severe', and setting the bug report severity corresponding to tags with trivisual, minor and the like as 'non-severe';
s13: and extracting short description and long description from the bug reports reserved in the data set to serve as description information of the bug reports, and performing word segmentation, word de-stop and word stem processing on the description information of each bug report to form a text matrix. Each row in the matrix represents a bug report, each column represents a word, s ij Represents the number of jth words in the ith bug report, where i ∈ [1,M ∈],j∈[1,N]. As shown below, M BR And M is the total number of samples in the data set, and N is the total number of different words.
S14: deleting the characteristic columns corresponding to the low-frequency words in the text matrix, and only keeping the characteristic columns with high word frequency;
s2, aiming at the problem of unbalanced category of original data, carrying out unbalanced processing on a text matrix in the original data set by adopting four unbalanced processing strategies such as a cost matrix, random undersampling, random oversampling and synthesis of a few types of oversampling to obtain a balanced data set; the one that works best can be selected based on the specific data set. The specific content of the algorithm is as follows:
the sampling algorithm solves the learning problem of unbalanced data from a data level, and the method for solving the learning of unbalanced data on an algorithm level is mainly based on a cost sensitive learning algorithm, and the core element of the cost sensitive learning method is a cost matrix.
Random undersampling, which is to balance class distribution by randomly eliminating samples of a majority of classes; until the majority class and minority class instances achieve a balance, and then further classification.
Random oversampling increases the number of instances of the minority class by randomly replicating the minority class until the data is balanced and further sorting.
And synthesizing a minority class of oversampling, wherein the basic idea of the algorithm is to analyze a minority class of samples and artificially synthesize a new sample according to the minority class of samples to be added into a data set, and the flow of the algorithm is as follows:
(1) For each sample x in the minority class, calculating a sample set S from the sample x to the minority class by using Euclidean distance as a standard min The k neighbors of the distance between all samples are obtained.
(2) Setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each sample x of a minority class, wherein the selected neighbors are assumed to be x n 。
(3) For each randomly selected neighbor x n New samples are constructed according to the following equations, respectively.
x new =x+rand(0,1)*|x-x n |
And S3, aiming at two data noise problems in the balanced data set, namely the self-contained data noise of the original data set and newly introduced noise after balanced operation. The data set with small scale and high quality is obtained by three processing modes of feature extraction, example extraction, simultaneous extraction of features and examples and the like. The method comprises the following concrete steps:
the method comprises the following steps: the gene encodes. And representing the feature sequences in the data set as a feature vector of 1 × n (feature total number), and representing the selected feature combinations by using a {0,1} binary string, wherein 0 represents that the corresponding features are not selected, and 1 represents that the corresponding features are selected. Similarly, the sequence of instances in the dataset is represented as a feature vector of 1 × m (total number of instances), the selected instances are combined and represented by a {0,1} binary string, 0 represents that no corresponding instance is selected, and 1 represents that a corresponding instance is selected.
Step two: and (5) initializing a population. And (3) generating an initial population when only characteristic extraction is carried out on the data set by using a genetic algorithm: the data sets are sorted from high to low according to importance degrees by using several basic feature extraction algorithms (such as IG, CHI, oneR and Relief) respectively, and 10%, 20%, 90% and 90% are respectively taken from the sorted feature sets, so that NP individuals in initial populations are obtained. Generation of initial populations during case selection of data sets with genetic algorithms: randomly generating a random number between [0,1] for each gene position of each individual in the population, and setting the corresponding binary position of the individual to be 1 if the number is greater than or equal to 0.5, otherwise setting the corresponding binary position to be 0. And (3) generating an initial population when the characteristics and the examples are extracted from the data set by using a genetic algorithm: and respectively combining the binary strings generated initially when the genetic algorithm is used for only carrying out feature extraction on the data set and the binary strings generated initially when the genetic algorithm is used for only carrying out feature extraction on the data set to obtain individuals in the initial population.
Step three: and (6) selecting operation. And calculating the fitness of each individual in the population according to the fitness function. The individual with the largest fitness function value is copied to the next generation (retaining the elite genes) for one sixth of the population and then roulette selection is used to generate five out of six individuals for the remaining population.
Step four: and (4) performing a crossover operation. Dividing the population individuals into two groups ([ 1, mid ], [ mid +1 NP ]), wherein NP is the number of the population, and the individuals in the groups are combined two by two in sequence. Then, a single-point crossing is carried out, namely, a crossing probability Cross _ Ratio between [0,1] is randomly generated, if the Cross _ Ratio is more than or equal to the lower limit (Cross _ L) of the crossing probability and is less than or equal to the upper limit (Cross _ H) of the crossing probability, the two individuals are crossed by the gene position, otherwise, the two individuals are not crossed. When crossing, a crossing point is randomly generated, and the gene positions behind the two body crossing points are exchanged.
Step five: and (5) performing mutation operation. For each individual in the population, a Variation probability is randomly generated, if the Variation probability is smaller than a Variation rate (Variation _ Ratio), the individual is varied, otherwise, the individual is not varied. When the Variation is performed, variation _ Num variant loci are randomly selected from the binary string of an individual, and are changed to 1 if the locus is 0 and 0 if the locus is 1.
Step six: and judging the end condition. When the maximum number of iterations N is reached, the iteration is terminated.
For the fitness function mentioned in step three: the definition is as follows:
when the data set is reduced by using a genetic algorithm, an individual needs to be selected according to the fitness function value, and the function is used for measuring the classification capability of the individual. Larger values of the function are more preferable for the individual. The fitness function is defined as follows:
J(x)=S b -S w
wherein S is b Representing the fuzzy distance between classes, S w Indicating an intra-class blur distance. The samples can be separated because they are located in different regions of the feature space. If the distance between the classes is larger and the distance between the samples in the classes is smaller, the classification effect is better. The specific calculation method is described as follows:
there are various ways to represent the fuzzy distance between two modes, such as hamming distance, euclidean distance, etc., and we use the euclidean distance as follows:
in calculating the inter-class distance S b When u is turned on A (x i )、u B (x i ) Respectively represent the mean vectors of class a and class B, which can be found by:
wherein: w is a i Two categories of "severe" and "not severe", c i Class center feature vector for ith class, at w i In the category of n i Individual sample data.
Using Euclidean distance for any two classesCalculating the distance between classes, and adding to obtain S b 。
Calculating the intra-class distance S w While u is A (x i )、u B (x i ) Are the feature vectors of sample a and sample B within the same class. Calculating the intra-class distance between the data in each class, and adding the intra-class distances of the classes to obtain S w . The definition is as follows:
assume that there is m in the "Severe" class
1 A sample is obtained by
Assume that there is m in the "not severe" class
2 A sample is obtained by
S4, learning based on small-scale high-quality data sets by adopting four classification algorithms of naive Bayes, polynomial naive Bayes, K neighbor and a support vector machine, and establishing a classification model;
and S5, predicting the severity of the newly submitted bug report.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.