CN109934286B - Bug report severity recognition method based on text feature extraction and imbalance processing strategy - Google Patents

Bug report severity recognition method based on text feature extraction and imbalance processing strategy Download PDF

Info

Publication number
CN109934286B
CN109934286B CN201910183106.2A CN201910183106A CN109934286B CN 109934286 B CN109934286 B CN 109934286B CN 201910183106 A CN201910183106 A CN 201910183106A CN 109934286 B CN109934286 B CN 109934286B
Authority
CN
China
Prior art keywords
bug
data set
severity
bug report
individual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910183106.2A
Other languages
Chinese (zh)
Other versions
CN109934286A (en
Inventor
郭世凯
陈荣
魏苗苗
张佳丽
李辉
唐文君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Maritime University filed Critical Dalian Maritime University
Priority to CN201910183106.2A priority Critical patent/CN109934286B/en
Publication of CN109934286A publication Critical patent/CN109934286A/en
Application granted granted Critical
Publication of CN109934286B publication Critical patent/CN109934286B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for identifying the severity of a Bug report based on text feature extraction and imbalance strategies, which comprises the steps of carrying out imbalance on a Bug report data set and text feature extraction operation based on a genetic algorithm, wherein a generated classification model has no deviation when classifying a newly submitted Bug report, fitting tends to be balanced, and limitation during classification is avoided; through extraction operation, the characteristics and the examples can be simultaneously extracted, a data set with smaller scale and higher quality is obtained, the accuracy of bug report severity identification is improved, the time cost and the labor cost of bug severity identification are saved, the working efficiency is improved, and developers can conveniently process bug reports with higher severity preferentially.

Description

Bug report severity identification method based on text feature extraction and imbalance processing strategy
Technical Field
The invention relates to the technical field of data processing, in particular to a Bug report severity identification method based on text feature extraction and an imbalance processing strategy.
Background
As the functions of software applications become more abundant and the software scale becomes larger, the logic in each software becomes more complex, and thus the software defects inevitably increase. For example, mozilla may have 128.29 bug reports submitted per day on average. As the number of bug reports increases, manual assignment of bug report fixes becomes increasingly burdensome, and higher severity bug reports should be prioritized due to the different severity of the bug reports being submitted.
By analyzing the bug report data, it was found that there are two challenges to identify the severity of the bug report: firstly, bug reports in a bug warehouse are usually unbalanced, a bug report with high severity only accounts for a small part of the bug reports, and the performance of identifying a few types of bug reports with high severity is often sacrificed in order to achieve the maximization of overall accuracy in the existing research work; secondly, the big quality of the big scale of bug warehouse is low, this is because the number of the bug that submits to the bug warehouse every day is many, and the bug report comprises natural language, and the natural language understanding and the expression mode of everywhere that submits and everybody is different, leads to the data noise too much and the quality is low, consequently influences the classification performance of bug report, consequently lacks the method of solving simultaneously balanced data set and carrying out the reduction and the noise removal to the data set in the research process.
Disclosure of Invention
According to the problems in the prior art, the invention discloses a Bug report severity identification method based on text feature extraction and unbalance processing strategies, which specifically comprises the following steps:
s1, collecting repaired bugs from a bug warehouse as an original data set, and preprocessing the original data set as follows: removing invalid bug reports in the data set, extracting the text information of the reserved bug reports, processing the text information into a text vector matrix by using a text preprocessing method, removing low-frequency words in the bug report description information, and marking the severity of the bug reports;
s2, aiming at the problem of unbalanced category of original data, carrying out unbalanced processing on a text matrix in the original data set by adopting four unbalanced processing strategies such as a cost matrix, random undersampling, random oversampling and synthesis of a few types of oversampling to obtain a balanced data set;
and S3, aiming at two data noise problems in the balanced data set, namely the self-contained data noise of the original data set and the newly introduced noise after the balanced operation. And (3) combining the genetic algorithm with feature extraction, example extraction and feature and example simultaneous extraction to reduce the data set, and taking the reduced data set matrix as a final training set.
S4, learning based on a small-scale high-quality data set by adopting four classification algorithms of naive Bayes, polynomial naive Bayes, K neighbor and a support vector machine, and establishing a classification model;
and S5, predicting the severity of the newly submitted bug report.
The following method is specifically adopted in S1:
s11: removing the bug report with the bug state of normal and enhancement in the original dataset;
s12: setting the severity of the bug report corresponding to the bug report state of major, critical and blocker labels as severe, and setting the severity of the bug report corresponding to the state of trivisual and minor labels as non-severe;
s13: extracting short description and long description from the bug reports reserved in the data set as description information of the bug reports, and performing word segmentation, word deactivation and word stem processing on the description information of each bug report to form a text matrix;
s14: deleting the characteristic columns corresponding to the low-frequency words in the text matrix, and only keeping the characteristic columns with high word frequency;
s3, the following method is specifically adopted:
s31: initializing a population, wherein the population comprises NP individuals, and each individual represents an extraction scheme of a data set;
s32, calculating a fitness function value for each individual in the population respectively, and recording an extraction scheme corresponding to the individual with the maximum function value;
s33, respectively carrying out selection, crossing and mutation operations on the binary strings corresponding to each individual in the population;
s34, taking the NP individuals processed in the S33 as a new population, calculating the fitness value of each individual in the new population, selecting the individual with the maximum fitness value from the new population, comparing the individual with the maximum fitness value with the function value recorded in the S32, and recording an extraction scheme corresponding to the individual with the larger function value;
and S35, returning to S32 to continuously iterate until the set maximum iteration number is reached, stopping the algorithm, and taking the extraction scheme corresponding to the individual with the maximum recorded fitness value as the final extraction scheme for the data set.
By adopting the technical scheme, the Bug report severity identification method based on the text feature extraction and imbalance processing strategy provided by the invention has the advantages that through carrying out imbalance and genetic algorithm-based text feature extraction on the Bug report data set, the generated classification model has no deviation when classifying the newly submitted Bug report, the fitting tends to be balanced, and the limitation during classification is avoided; through extraction operation, the characteristics and the examples can be simultaneously extracted, a data set with smaller scale and higher quality is obtained, the accuracy of bug report severity identification is improved, the time cost and the labor cost of bug severity identification are saved, the working efficiency is improved, and developers can conveniently process bug reports with higher severity preferentially.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
In order to make the technical solutions and advantages of the present invention clearer, the following describes the technical solutions in the embodiments of the present invention clearly and completely with reference to the drawings in the embodiments of the present invention:
as shown in fig. 1, a method for identifying a Bug report severity based on text feature extraction and imbalance processing strategy specifically includes the following steps:
s1, collecting repaired bugs from a bug warehouse as an original data set, and preprocessing the original data set as follows: removing invalid bug reports in the data set, extracting the text information of the reserved bug reports, processing the text information into a text vector matrix by using a text preprocessing method, removing low-frequency words in the bug report description information, and marking the severity of the bug reports;
further, the preprocessing of the original data set specifically comprises the following steps:
s11: removing the bug report with bug states of normal and enhancement in the original data set;
s12: setting the bug report severity corresponding to tags with bug states of major, critical, blocker and the like as 'severe', and setting the bug report severity corresponding to tags with trivisual, minor and the like as 'non-severe';
s13: and extracting short description and long description from the bug reports reserved in the data set to serve as description information of the bug reports, and performing word segmentation, word de-stop and word stem processing on the description information of each bug report to form a text matrix. Each row in the matrix represents a bug report, each column represents a word, s ij Represents the number of jth words in the ith bug report, where i ∈ [1,M ∈],j∈[1,N]. As shown below, M BR And M is the total number of samples in the data set, and N is the total number of different words.
Figure BDA0001991935810000031
S14: deleting the characteristic columns corresponding to the low-frequency words in the text matrix, and only keeping the characteristic columns with high word frequency;
s2, aiming at the problem of unbalanced category of original data, carrying out unbalanced processing on a text matrix in the original data set by adopting four unbalanced processing strategies such as a cost matrix, random undersampling, random oversampling and synthesis of a few types of oversampling to obtain a balanced data set; the one that works best can be selected based on the specific data set. The specific content of the algorithm is as follows:
the sampling algorithm solves the learning problem of unbalanced data from a data level, and the method for solving the learning of unbalanced data on an algorithm level is mainly based on a cost sensitive learning algorithm, and the core element of the cost sensitive learning method is a cost matrix.
Random undersampling, which is to balance class distribution by randomly eliminating samples of a majority of classes; until the majority class and minority class instances achieve a balance, and then further classification.
Random oversampling increases the number of instances of the minority class by randomly replicating the minority class until the data is balanced and further sorting.
And synthesizing a minority class of oversampling, wherein the basic idea of the algorithm is to analyze a minority class of samples and artificially synthesize a new sample according to the minority class of samples to be added into a data set, and the flow of the algorithm is as follows:
(1) For each sample x in the minority class, calculating a sample set S from the sample x to the minority class by using Euclidean distance as a standard min The k neighbors of the distance between all samples are obtained.
(2) Setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each sample x of a minority class, wherein the selected neighbors are assumed to be x n
(3) For each randomly selected neighbor x n New samples are constructed according to the following equations, respectively.
x new =x+rand(0,1)*|x-x n |
And S3, aiming at two data noise problems in the balanced data set, namely the self-contained data noise of the original data set and newly introduced noise after balanced operation. The data set with small scale and high quality is obtained by three processing modes of feature extraction, example extraction, simultaneous extraction of features and examples and the like. The method comprises the following concrete steps:
the method comprises the following steps: the gene encodes. And representing the feature sequences in the data set as a feature vector of 1 × n (feature total number), and representing the selected feature combinations by using a {0,1} binary string, wherein 0 represents that the corresponding features are not selected, and 1 represents that the corresponding features are selected. Similarly, the sequence of instances in the dataset is represented as a feature vector of 1 × m (total number of instances), the selected instances are combined and represented by a {0,1} binary string, 0 represents that no corresponding instance is selected, and 1 represents that a corresponding instance is selected.
Step two: and (5) initializing a population. And (3) generating an initial population when only characteristic extraction is carried out on the data set by using a genetic algorithm: the data sets are sorted from high to low according to importance degrees by using several basic feature extraction algorithms (such as IG, CHI, oneR and Relief) respectively, and 10%, 20%, 90% and 90% are respectively taken from the sorted feature sets, so that NP individuals in initial populations are obtained. Generation of initial populations during case selection of data sets with genetic algorithms: randomly generating a random number between [0,1] for each gene position of each individual in the population, and setting the corresponding binary position of the individual to be 1 if the number is greater than or equal to 0.5, otherwise setting the corresponding binary position to be 0. And (3) generating an initial population when the characteristics and the examples are extracted from the data set by using a genetic algorithm: and respectively combining the binary strings generated initially when the genetic algorithm is used for only carrying out feature extraction on the data set and the binary strings generated initially when the genetic algorithm is used for only carrying out feature extraction on the data set to obtain individuals in the initial population.
Step three: and (6) selecting operation. And calculating the fitness of each individual in the population according to the fitness function. The individual with the largest fitness function value is copied to the next generation (retaining the elite genes) for one sixth of the population and then roulette selection is used to generate five out of six individuals for the remaining population.
Step four: and (4) performing a crossover operation. Dividing the population individuals into two groups ([ 1, mid ], [ mid +1 NP ]), wherein NP is the number of the population, and the individuals in the groups are combined two by two in sequence. Then, a single-point crossing is carried out, namely, a crossing probability Cross _ Ratio between [0,1] is randomly generated, if the Cross _ Ratio is more than or equal to the lower limit (Cross _ L) of the crossing probability and is less than or equal to the upper limit (Cross _ H) of the crossing probability, the two individuals are crossed by the gene position, otherwise, the two individuals are not crossed. When crossing, a crossing point is randomly generated, and the gene positions behind the two body crossing points are exchanged.
Step five: and (5) performing mutation operation. For each individual in the population, a Variation probability is randomly generated, if the Variation probability is smaller than a Variation rate (Variation _ Ratio), the individual is varied, otherwise, the individual is not varied. When the Variation is performed, variation _ Num variant loci are randomly selected from the binary string of an individual, and are changed to 1 if the locus is 0 and 0 if the locus is 1.
Step six: and judging the end condition. When the maximum number of iterations N is reached, the iteration is terminated.
For the fitness function mentioned in step three: the definition is as follows:
when the data set is reduced by using a genetic algorithm, an individual needs to be selected according to the fitness function value, and the function is used for measuring the classification capability of the individual. Larger values of the function are more preferable for the individual. The fitness function is defined as follows:
J(x)=S b -S w
wherein S is b Representing the fuzzy distance between classes, S w Indicating an intra-class blur distance. The samples can be separated because they are located in different regions of the feature space. If the distance between the classes is larger and the distance between the samples in the classes is smaller, the classification effect is better. The specific calculation method is described as follows:
there are various ways to represent the fuzzy distance between two modes, such as hamming distance, euclidean distance, etc., and we use the euclidean distance as follows:
Figure BDA0001991935810000061
in calculating the inter-class distance S b When u is turned on A (x i )、u B (x i ) Respectively represent the mean vectors of class a and class B, which can be found by:
Figure BDA0001991935810000062
wherein: w is a i Two categories of "severe" and "not severe", c i Class center feature vector for ith class, at w i In the category of n i Individual sample data.
Using Euclidean distance for any two classesCalculating the distance between classes, and adding to obtain S b
Calculating the intra-class distance S w While u is A (x i )、u B (x i ) Are the feature vectors of sample a and sample B within the same class. Calculating the intra-class distance between the data in each class, and adding the intra-class distances of the classes to obtain S w . The definition is as follows:
Figure BDA0001991935810000063
assume that there is m in the "Severe" class 1 A sample is obtained by
Figure BDA0001991935810000064
Assume that there is m in the "not severe" class 2 A sample is obtained by
Figure BDA0001991935810000065
S4, learning based on small-scale high-quality data sets by adopting four classification algorithms of naive Bayes, polynomial naive Bayes, K neighbor and a support vector machine, and establishing a classification model;
and S5, predicting the severity of the newly submitted bug report.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (3)

1. A Bug report severity identification method based on text feature extraction and imbalance processing strategies is characterized by comprising the following steps: the method comprises the following steps:
s1, collecting repaired bugs from a bug warehouse as an original data set, and preprocessing the original data set as follows: removing invalid bug reports in the data set, extracting the text information of the reserved bug reports, processing the text information into a text vector matrix by using a text preprocessing method, removing low-frequency words in the bug report description information, and marking the severity of the bug reports;
s2, carrying out unbalance processing on a text matrix in the original data set by adopting four unbalance processing strategies of a cost matrix, random undersampling, random oversampling and synthesis of a few types of oversampling to obtain a balanced data set;
s3, reducing the data set by combining the genetic algorithm with feature extraction, instance extraction and feature and instance simultaneous extraction, and taking the reduced data set matrix as a final training set;
s4, modeling the reduced and balanced training set matrix by four classification algorithms of naive Bayes, polynomial naive Bayes, K neighbor and support vector machine, and screening out a classifier with the best prediction effect;
and S5, predicting the severity of the newly submitted bug report.
2. The method of claim 1 for identifying the severity of a Bug report based on a textual feature extraction and imbalance handling strategy, further characterized by: the following method is specifically adopted in S1:
s11: removing the bug report with bug states of normal and enhancement in the original data set;
s12: setting the severity of the bug report corresponding to the bug report with the state of major, critic and blocker as severe, and setting the severity of the bug report corresponding to the bug report with the state of trivisual and minor as not severe;
s13: extracting short description and long description from the bug reports reserved in the data set to serve as description information of the bug reports, and performing word segmentation, word stop removal and word stem processing on the description information of each bug report to form a text matrix;
s14: and deleting the characteristic columns corresponding to the low-frequency words in the text matrix, and only keeping the characteristic columns with high word frequency.
3. The method of claim 1 for identifying the severity of a Bug report based on a textual feature extraction and imbalance handling strategy, further characterized by: s3, the following method is specifically adopted:
s31: initializing a population, wherein the population comprises NP individuals, and each individual represents an extraction scheme of a data set;
s32, calculating a fitness function value for each individual in the population respectively, and recording an extraction scheme corresponding to the individual with the maximum function value;
s33, respectively carrying out selection, crossing and mutation operations on the binary strings corresponding to each individual in the population;
s34, taking the NP individuals processed in the S33 as a new population, calculating the fitness value of each individual in the new population, selecting the individual with the maximum fitness value from the new population, comparing the individual with the maximum fitness value with the function value recorded in the S32, and recording an extraction scheme corresponding to the individual with the larger function value;
and S35, returning to S32, continuously iterating until the set maximum iteration times is reached, stopping the algorithm, and taking the extraction scheme corresponding to the individual with the maximum recorded fitness value as the final extraction scheme for the data set.
CN201910183106.2A 2019-03-12 2019-03-12 Bug report severity recognition method based on text feature extraction and imbalance processing strategy Active CN109934286B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910183106.2A CN109934286B (en) 2019-03-12 2019-03-12 Bug report severity recognition method based on text feature extraction and imbalance processing strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910183106.2A CN109934286B (en) 2019-03-12 2019-03-12 Bug report severity recognition method based on text feature extraction and imbalance processing strategy

Publications (2)

Publication Number Publication Date
CN109934286A CN109934286A (en) 2019-06-25
CN109934286B true CN109934286B (en) 2022-11-11

Family

ID=66986759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910183106.2A Active CN109934286B (en) 2019-03-12 2019-03-12 Bug report severity recognition method based on text feature extraction and imbalance processing strategy

Country Status (1)

Country Link
CN (1) CN109934286B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287124B (en) * 2019-07-03 2023-04-25 大连海事大学 Method for automatically marking software error report and carrying out severity identification
CN110413792B (en) * 2019-08-08 2022-10-21 大连海事大学 High-influence defect report identification method
CN110471856A (en) * 2019-08-21 2019-11-19 大连海事大学 A kind of Software Defects Predict Methods based on data nonbalance

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6240804B1 (en) * 2017-04-13 2017-11-29 大▲連▼大学 Filtered feature selection algorithm based on improved information measurement and GA
CN109213865A (en) * 2018-09-14 2019-01-15 大连海事大学 A kind of software bug report categorizing system and classification method
CN109255029A (en) * 2018-09-05 2019-01-22 大连海事大学 A method of automatic Bug report distribution is enhanced using weighted optimization training set

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6240804B1 (en) * 2017-04-13 2017-11-29 大▲連▼大学 Filtered feature selection algorithm based on improved information measurement and GA
CN109255029A (en) * 2018-09-05 2019-01-22 大连海事大学 A method of automatic Bug report distribution is enhanced using weighted optimization training set
CN109213865A (en) * 2018-09-14 2019-01-15 大连海事大学 A kind of software bug report categorizing system and classification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于分类的软件缺陷严重性预测;王婧宇等;《计算机与数字工程》;20160820(第08期);全文 *

Also Published As

Publication number Publication date
CN109934286A (en) 2019-06-25

Similar Documents

Publication Publication Date Title
WO2021164382A1 (en) Method and apparatus for performing feature processing for user classification model
CN109934286B (en) Bug report severity recognition method based on text feature extraction and imbalance processing strategy
CN107169504B (en) A kind of hand-written character recognition method based on extension Non-linear Kernel residual error network
WO2019179403A1 (en) Fraud transaction detection method based on sequence width depth learning
CN107392919B (en) Adaptive genetic algorithm-based gray threshold acquisition method and image segmentation method
CN109491914B (en) High-impact defect report prediction method based on unbalanced learning strategy
CN110046634B (en) Interpretation method and device of clustering result
CN110188827B (en) Scene recognition method based on convolutional neural network and recursive automatic encoder model
CN112215278B (en) Multi-dimensional data feature selection method combining genetic algorithm and dragonfly algorithm
CN111833310B (en) Surface defect classification method based on neural network architecture search
CN114118369B (en) Image classification convolutional neural network design method based on group intelligent optimization
CN112036476A (en) Data feature selection method and device based on two-classification service and computer equipment
WO2024036709A1 (en) Anomalous data detection method and apparatus
CN114328048A (en) Disk fault prediction method and device
CN111767216B (en) Cross-version depth defect prediction method capable of relieving class overlap problem
CN111723856A (en) Image data processing method, device and equipment and readable storage medium
CN112418320A (en) Enterprise association relation identification method and device and storage medium
CN110796260B (en) Neural network model optimization method based on class expansion learning
CN115812210A (en) Method and apparatus for enhancing performance of machine learning classification tasks
CN115344693A (en) Clustering method based on fusion of traditional algorithm and neural network algorithm
CN114373092A (en) Progressive training fine-grained vision classification method based on jigsaw arrangement learning
CN112488188A (en) Feature selection method based on deep reinforcement learning
CN110020675A (en) A kind of dual threshold AdaBoost classification method
JP2011257805A (en) Information processing device, method and program
CN114862404A (en) Credit card fraud detection method and device based on cluster samples and limit gradients

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Guo Shikai

Inventor after: Chen Rong

Inventor after: Wei Miaomiao

Inventor after: Zhang Jiali

Inventor after: Li Hui

Inventor after: Tang Wenjun

Inventor before: Chen Rong

Inventor before: Wei Miaomiao

Inventor before: Zhang Jiali

Inventor before: Li Hui

Inventor before: Guo Shikai

Inventor before: Tang Wenjun

GR01 Patent grant
GR01 Patent grant