CN105701013A - Software defect data feature selection method based on mutual information - Google Patents
Software defect data feature selection method based on mutual information Download PDFInfo
- Publication number
- CN105701013A CN105701013A CN201610004279.XA CN201610004279A CN105701013A CN 105701013 A CN105701013 A CN 105701013A CN 201610004279 A CN201610004279 A CN 201610004279A CN 105701013 A CN105701013 A CN 105701013A
- Authority
- CN
- China
- Prior art keywords
- feature
- software
- data
- mutual information
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3604—Software analysis for verifying properties of programs
- G06F11/3616—Software analysis for verifying properties of programs using software metrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Stored Programmes (AREA)
Abstract
The present invention belongs to the field of software engineering, and specifically relates to a software defect data feature selection method based on mutual information. The method comprises the steps of: A. obtaining software module data and software feature data from a software data set, and preprocessing the data, which comprises label processing and feature classification; B. obtaining, through screening, most class-related features by using a mutual information theory, and then calculating correlation between the features to remove redundant software features, during which an unbalance coefficient is introduced to take imbalance between software data into account; and C. establishing a classification model according to an obtained optimal feature subset, classifying software modules, and verifying effectiveness of the selected features. According to the method provided by the present invention, the redundant software features are removed by using the mutual information theory, so that the calculation speed can be greatly increased, and for the imbalance between the software defect data, mutual information is modified, so as to improve classification accuracy of minority classes.
Description
Technical field
The invention belongs to field of software engineering, be specifically related to a kind of software defect data characteristics system of selection based on mutual information。
Background technology
At present, software system scale increases day by day and its logic complexity also strengthens day by day, increases along with the number of modules of existing defects in software, and this certainly will threaten the reliability of software, affects software quality, causes immeasurable loss。Software defect Predicting Technique is to instruct a kind of important approach with assessment software test job, but the software features of higher-dimension adds the Time & Space Complexity of software module categorizing process, and affects the raising of nicety of grading。
Feature selection (FeatureSelection) is also referred to as feature subset selection (FeatureSubsetSelection, FSS), or Attributions selection (AttributeSelection), refer to from whole features, choose an optimal feature subset, make the model constructed better。But in actual applications, feature quantity is often more, wherein would be likely to occur incoherent feature, also likely to be present and interdepend between feature, it is easy to cause following consequence: Characteristic Number is more many, the time needed for analyzing feature, training pattern is more long。Characteristic Number is more many, it is easy to cause " dimension disaster ", and model also can be more complicated, and its Generalization Ability can decline。Such as, 2004, NASA disclosed software data collection (NASAMDP), and the various software features that they extract from source code mainly include three major types--LOC, McCabe and Halstead。In each class software features, except basic feature is directly to extract from source code, other feature is all obtained by these basic feature value indirect calculation。And have it is demonstrated experimentally that have only to three important software features just can forecasting software module whether containing defective。Visible, in each class software features, there is more redundancy feature。Feature selection can reject uncorrelated (irrelevant) or the feature of redundancy (redundant), thus reaching to reduce Characteristic Number, improving model accuracy, reducing the purpose of operation time。
Feature selection approach based on mutual information (mutualinformation) shows good performance in feature selection, because passing judgment on selected feature quality without the classification results according to grader, has the feature of fast operation。But, the feature selection approach major part being currently based on mutual information is the operation on equilibrium criterion collection, and for the unbalanced dataset existed a large amount of in actual life, the feature selection approach based on mutual information can not well play its effect。Therefore, the present invention is with software defect data for object of study, it is proposed to based on the feature selection approach for unbalanced data of mutual information。
Summary of the invention
It is an object of the invention to the disequilibrium overcoming software defect data to exist, it is provided that a kind of software defect data characteristics system of selection based on mutual information, to select optimal feature subset。
For achieving the above object, technical solution of the present invention mainly includes three below step:
A. concentrate from software data and obtain data, and do pretreatment
(1) collect software data and include software features data, software module data, do pretreatment。And software module data are divided into training set and test set in order to training and test。The present invention adopts ten cross validations, and data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test。
(2) according to existing Heuristics, feature set is classified, obtain three feature sets, be LOC class respectively, McCabe class and Halstead class (being abbreviated as L, M, H respectively)。
(3) according to nine parts of training set data, the unbalance factor p of software module is obtained with formula (1)。
Formula (1)
B. Mutual Information Theory is utilized to remove Redundancy Software feature
(1) take into full account the disequilibrium of data set, introduce unbalance factor, calculate each feature f in three feature sets according to formula (2)iWith class y1And y2Dependency μ (fi), arrange in descending order according to dependency size, only take the feature of 70% before the relevance rank in three feature sets, now, t feature of feature set total surplus。
μ(fi)=logp × MI (fi,y1)+MI(fi,y2) formula (2)
Wherein, P (f in formulai) represent feature fiAt the total probability that two apoplexy due to endogenous wind occur, p (fi|y1), p (fi|y2) respectively feature fiAt the probability that two apoplexy due to endogenous wind occur。
(2) still according to Mutual Information Theory calculate respectively three subtracted by sieve after feature sets in dependency between feature between two, obtain feature correlation matrix。According to minimal redundancy criterion calculation formula (3), remove the feature big with the forward feature correlation of ranking, obtain final optimal feature subset S, it is ensured that S is sized to k。
C. according to the software features selected, it is possible to use it is carried out classifying quality detection by SVM, set up final
Disaggregated model。Classifying for software module, classification results adopts Gmeans to verify having of feature selection result
Effect property and degree of accuracy。The optimal feature subset S that final basis obtains sets up forecast model。
Wherein, A is actual defective and correct number of modules of classifying, and B is actual defective but is classified the number of modules of mistake, and C is actual zero defect but is classified the number of modules of mistake, and D is actual zero defect and correct number of modules of classifying。
Accompanying drawing explanation
Fig. 1 is based on the software defect data characteristics system of selection flow chart of mutual information。
Detailed description of the invention
Below in conjunction with Fig. 1, the present invention is described in further detail。
The first step: concentrate from software data and obtain data, and it is carried out pretreatment。
(1) first obtain software features collection and software module data, and do tag processes。Wherein, feature set F={f1,f2…fm}。Software module data set { X, Y}, X={x1,x2…xn, Y={y1,y2}={+1 ,-1}。If software module xiZero defect, then (xi,yi)=(xi,-1), otherwise, (xi,yi)=(xi,+1)。
(2) feature set is classified, obtain three feature sets, be LOC class respectively, McCabe class and Halstead class (being abbreviated as L, M, H respectively)。
(3) according to nine parts of training set data, following formula is utilized to obtain the unbalance factor p of software module。
Second step: utilize Mutual Information Theory to remove Redundancy Software feature
(1) each feature f in three feature sets is calculated according to following formulaiWith class y1And y2Dependency μ (fi), arrange in descending order according to dependency size, only take the feature of 70% before the relevance rank in three feature sets, now, t feature of feature set total surplus。
μ(fi)=logp × MI (fi,y1)+MI(fi,y2)
Wherein, P (f in formulai) represent feature fiAt the total probability that two apoplexy due to endogenous wind occur, p (fi|y1), p (fi|y2) respectively feature fiAt the probability that two apoplexy due to endogenous wind occur。
(2) according further to Mutual Information Theory calculate respectively three subtracted by sieve after feature sets in dependency between feature between two, obtain feature correlation matrix。
L category feature correlation matrix M category feature correlation matrix H category feature dependency square
According to following minimal redundancy criterion calculation formula, remove the feature big with the forward feature correlation of ranking, obtain final optimal feature subset S, it is ensured that S is sized to k。
3rd step: utilize SVM that it is carried out classifying quality detection according to the software features selected, sets up final disaggregated model。
(1) input: optimal feature subset S, module data;Output: classification results。
(2) predicting for software defect, classification results adopts Gmeans to verify the performance of feature selection result。The optimal feature subset S that final basis obtains is predicted model and sets up。
Wherein, A is actual defective and correct number of modules of classifying, and B is actual defective but is classified the number of modules of mistake, and C is actual zero defect but is classified the number of modules of mistake, and D is actual zero defect and correct number of modules of classifying。
The invention provides a kind of software defect data characteristics system of selection based on mutual information; it should be pointed out that, for those skilled in the art, under the premise without departing from the principles of the invention; can also making some improvement, these improvement also should be regarded as protection scope of the present invention。Each ingredient not clear and definite in the present embodiment is used equally to prior art and is realized。
Claims (1)
1. based on the software defect data characteristics system of selection of mutual information, it is characterised in that mainly include three below step:
A. concentrate from software data and obtain data, and do pretreatment
(1) collect software data, including software features data and software module data, and do pretreatment;Software module data are divided into training set and test set in order to training and test;The present invention adopts ten cross validations, and data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test;
(2) according to existing Heuristics, feature set is classified, obtain three feature sets, be LOC class respectively, McCabe class and Halstead class (being abbreviated as L, M, H respectively);
(3) according to nine parts of training set data, the unbalance factor p of software module is obtained by equation below;
B. Mutual Information Theory is utilized to remove Redundancy Software feature
(1) take into full account the disequilibrium of data set, introduce unbalance factor, calculate each feature f in three feature sets according to following formulaiWith class y1And y2Dependency μ (fi), arrange in descending order according to dependency size, only take the feature of 70% before the relevance rank in three feature sets, now, t feature of feature set total surplus;
μ(fi)=logp × MI (fi,y1)+MI(fi,y2)
Wherein, P (f in formulai) represent feature fiAt the total probability that two apoplexy due to endogenous wind occur, p (fi|y1), p (fi|y2) respectively feature fiAt the probability that two apoplexy due to endogenous wind occur;
(2) still according to Mutual Information Theory calculate respectively three subtracted by sieve after feature sets in dependency between feature between two, obtain feature correlation matrix;
L category feature correlation matrix M category feature correlation matrix H category feature dependency square
According to minimal redundancy criterion calculation formula, remove the feature big with the forward feature correlation of ranking, obtain final optimal feature subset S, it is ensured that S is sized to k;
C. according to the software features selected, it is possible to use it is carried out classifying quality detection by SVM, sets up final disaggregated model;Predicting for software defect, classification results adopts Gmeans to verify effectiveness and the degree of accuracy of feature selection result;The optimal feature subset S that final basis obtains sets up disaggregated model;
Wherein, A is actual defective and correct number of modules of classifying, and B is actual defective but is classified the number of modules of mistake, and C is actual zero defect but is classified the number of modules of mistake, and D is actual zero defect and correct number of modules of classifying。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610004279.XA CN105701013A (en) | 2016-01-04 | 2016-01-04 | Software defect data feature selection method based on mutual information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610004279.XA CN105701013A (en) | 2016-01-04 | 2016-01-04 | Software defect data feature selection method based on mutual information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105701013A true CN105701013A (en) | 2016-06-22 |
Family
ID=56226952
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610004279.XA Pending CN105701013A (en) | 2016-01-04 | 2016-01-04 | Software defect data feature selection method based on mutual information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105701013A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106201871A (en) * | 2016-06-30 | 2016-12-07 | 重庆大学 | Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised |
CN107273295A (en) * | 2017-06-23 | 2017-10-20 | 中国人民解放军国防科学技术大学 | A kind of software problem reporting sorting technique based on text randomness |
CN107391365A (en) * | 2017-07-06 | 2017-11-24 | 武汉大学 | A kind of hybrid characteristic selecting method of software-oriented failure prediction |
CN110147321A (en) * | 2019-04-19 | 2019-08-20 | 北京航空航天大学 | A kind of recognition methods of the defect high risk module based on software network |
CN113707330A (en) * | 2021-07-30 | 2021-11-26 | 电子科技大学 | Mongolian medicine syndrome differentiation model construction method, system and method |
CN115830235A (en) * | 2022-12-09 | 2023-03-21 | 皖南医学院第一附属医院(皖南医学院弋矶山医院) | Three-dimensional model reconstruction method for room defect image |
-
2016
- 2016-01-04 CN CN201610004279.XA patent/CN105701013A/en active Pending
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106201871A (en) * | 2016-06-30 | 2016-12-07 | 重庆大学 | Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised |
CN106201871B (en) * | 2016-06-30 | 2018-10-02 | 重庆大学 | Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised |
CN107273295A (en) * | 2017-06-23 | 2017-10-20 | 中国人民解放军国防科学技术大学 | A kind of software problem reporting sorting technique based on text randomness |
CN107273295B (en) * | 2017-06-23 | 2020-03-20 | 中国人民解放军国防科学技术大学 | Software problem report classification method based on text chaos |
CN107391365A (en) * | 2017-07-06 | 2017-11-24 | 武汉大学 | A kind of hybrid characteristic selecting method of software-oriented failure prediction |
CN107391365B (en) * | 2017-07-06 | 2020-10-13 | 武汉大学 | Mixed feature selection method oriented to software defect prediction |
CN110147321A (en) * | 2019-04-19 | 2019-08-20 | 北京航空航天大学 | A kind of recognition methods of the defect high risk module based on software network |
CN110147321B (en) * | 2019-04-19 | 2020-11-24 | 北京航空航天大学 | Software network-based method for identifying defect high-risk module |
CN113707330A (en) * | 2021-07-30 | 2021-11-26 | 电子科技大学 | Mongolian medicine syndrome differentiation model construction method, system and method |
CN113707330B (en) * | 2021-07-30 | 2023-04-28 | 电子科技大学 | Construction method of syndrome differentiation model of Mongolian medicine, syndrome differentiation system and method of Mongolian medicine |
CN115830235A (en) * | 2022-12-09 | 2023-03-21 | 皖南医学院第一附属医院(皖南医学院弋矶山医院) | Three-dimensional model reconstruction method for room defect image |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105701013A (en) | Software defect data feature selection method based on mutual information | |
Yang et al. | TLEL: A two-layer ensemble learning approach for just-in-time defect prediction | |
CN106201871B (en) | Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised | |
Fukushima et al. | An empirical study of just-in-time defect prediction using cross-project models | |
Yang et al. | Deep learning for just-in-time defect prediction | |
CN102637143B (en) | Software defect priority prediction method based on improved support vector machine | |
CN101614786B (en) | Online intelligent fault diagnosis method of power electronic circuit based on FRFT and IFSVC | |
Fioravanti et al. | A study on fault-proneness detection of object-oriented systems | |
US20190018753A1 (en) | Software program fault localization | |
CN107820620A (en) | Method and system for defect classification | |
US11709979B1 (en) | Bridge damage identification method considering uncertainty | |
CN105653450A (en) | Software defect data feature selection method based on combination of modified genetic algorithm and Adaboost | |
CN101908020A (en) | Method for prioritizing test cases based on classified excavation and version change | |
CN105975589A (en) | Feature selection method and device of high-dimension data | |
CN109241383A (en) | A kind of type of webpage intelligent identification Method and system based on deep learning | |
CN104008420A (en) | Distributed outlier detection method and system based on automatic coding machine | |
CN105389598A (en) | Feature selecting and classifying method for software defect data | |
CN106021671B (en) | In conjunction with the circuit health grading evaluation method of correlative relationship and grey cluster technology | |
CN104021180B (en) | A kind of modular software defect report sorting technique | |
US10761961B2 (en) | Identification of software program fault locations | |
Singh et al. | Melford: Using neural networks to find spreadsheet errors | |
CN107545038A (en) | A kind of file classification method and equipment | |
Naseem et al. | Investigating tree family machine learning techniques for a predictive system to unveil software defects | |
CN104318110A (en) | Method for improving risk design and maintenance efficiency of large complex system on basis of importance and sensibility complex sequence | |
Drew et al. | Hydrologic characteristics of freshwater mussel habitat: Novel insights from modeled flows |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160622 |
|
WD01 | Invention patent application deemed withdrawn after publication |