CN105701013A - Software defect data feature selection method based on mutual information - Google Patents

Software defect data feature selection method based on mutual information Download PDF

Info

Publication number
CN105701013A
CN105701013A CN201610004279.XA CN201610004279A CN105701013A CN 105701013 A CN105701013 A CN 105701013A CN 201610004279 A CN201610004279 A CN 201610004279A CN 105701013 A CN105701013 A CN 105701013A
Authority
CN
China
Prior art keywords
feature
software
data
mutual information
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610004279.XA
Other languages
Chinese (zh)
Inventor
李克文
邹晶杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN201610004279.XA priority Critical patent/CN105701013A/en
Publication of CN105701013A publication Critical patent/CN105701013A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3616Software analysis for verifying properties of programs using software metrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Stored Programmes (AREA)

Abstract

The present invention belongs to the field of software engineering, and specifically relates to a software defect data feature selection method based on mutual information. The method comprises the steps of: A. obtaining software module data and software feature data from a software data set, and preprocessing the data, which comprises label processing and feature classification; B. obtaining, through screening, most class-related features by using a mutual information theory, and then calculating correlation between the features to remove redundant software features, during which an unbalance coefficient is introduced to take imbalance between software data into account; and C. establishing a classification model according to an obtained optimal feature subset, classifying software modules, and verifying effectiveness of the selected features. According to the method provided by the present invention, the redundant software features are removed by using the mutual information theory, so that the calculation speed can be greatly increased, and for the imbalance between the software defect data, mutual information is modified, so as to improve classification accuracy of minority classes.

Description

Software defect data characteristics system of selection based on mutual information
Technical field
The invention belongs to field of software engineering, be specifically related to a kind of software defect data characteristics system of selection based on mutual information。
Background technology
At present, software system scale increases day by day and its logic complexity also strengthens day by day, increases along with the number of modules of existing defects in software, and this certainly will threaten the reliability of software, affects software quality, causes immeasurable loss。Software defect Predicting Technique is to instruct a kind of important approach with assessment software test job, but the software features of higher-dimension adds the Time & Space Complexity of software module categorizing process, and affects the raising of nicety of grading。
Feature selection (FeatureSelection) is also referred to as feature subset selection (FeatureSubsetSelection, FSS), or Attributions selection (AttributeSelection), refer to from whole features, choose an optimal feature subset, make the model constructed better。But in actual applications, feature quantity is often more, wherein would be likely to occur incoherent feature, also likely to be present and interdepend between feature, it is easy to cause following consequence: Characteristic Number is more many, the time needed for analyzing feature, training pattern is more long。Characteristic Number is more many, it is easy to cause " dimension disaster ", and model also can be more complicated, and its Generalization Ability can decline。Such as, 2004, NASA disclosed software data collection (NASAMDP), and the various software features that they extract from source code mainly include three major types--LOC, McCabe and Halstead。In each class software features, except basic feature is directly to extract from source code, other feature is all obtained by these basic feature value indirect calculation。And have it is demonstrated experimentally that have only to three important software features just can forecasting software module whether containing defective。Visible, in each class software features, there is more redundancy feature。Feature selection can reject uncorrelated (irrelevant) or the feature of redundancy (redundant), thus reaching to reduce Characteristic Number, improving model accuracy, reducing the purpose of operation time。
Feature selection approach based on mutual information (mutualinformation) shows good performance in feature selection, because passing judgment on selected feature quality without the classification results according to grader, has the feature of fast operation。But, the feature selection approach major part being currently based on mutual information is the operation on equilibrium criterion collection, and for the unbalanced dataset existed a large amount of in actual life, the feature selection approach based on mutual information can not well play its effect。Therefore, the present invention is with software defect data for object of study, it is proposed to based on the feature selection approach for unbalanced data of mutual information。
Summary of the invention
It is an object of the invention to the disequilibrium overcoming software defect data to exist, it is provided that a kind of software defect data characteristics system of selection based on mutual information, to select optimal feature subset。
For achieving the above object, technical solution of the present invention mainly includes three below step:
A. concentrate from software data and obtain data, and do pretreatment
(1) collect software data and include software features data, software module data, do pretreatment。And software module data are divided into training set and test set in order to training and test。The present invention adopts ten cross validations, and data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test。
(2) according to existing Heuristics, feature set is classified, obtain three feature sets, be LOC class respectively, McCabe class and Halstead class (being abbreviated as L, M, H respectively)。
(3) according to nine parts of training set data, the unbalance factor p of software module is obtained with formula (1)。
Formula (1)
B. Mutual Information Theory is utilized to remove Redundancy Software feature
(1) take into full account the disequilibrium of data set, introduce unbalance factor, calculate each feature f in three feature sets according to formula (2)iWith class y1And y2Dependency μ (fi), arrange in descending order according to dependency size, only take the feature of 70% before the relevance rank in three feature sets, now, t feature of feature set total surplus。
μ(fi)=logp × MI (fi,y1)+MI(fi,y2) formula (2)
Wherein, M I ( f i , y 1 ) = l o g p ( f i | y 1 ) p ( f i ) , M I ( f i , y 2 ) = l o g p ( f i | y 2 ) p ( f i ) , P (f in formulai) represent feature fiAt the total probability that two apoplexy due to endogenous wind occur, p (fi|y1), p (fi|y2) respectively feature fiAt the probability that two apoplexy due to endogenous wind occur。
(2) still according to Mutual Information Theory calculate respectively three subtracted by sieve after feature sets in dependency between feature between two, obtain feature correlation matrix。According to minimal redundancy criterion calculation formula (3), remove the feature big with the forward feature correlation of ranking, obtain final optimal feature subset S, it is ensured that S is sized to k。
f i j = m i n ( f i , f j ) = 1 | S | 2 Σ f i , f j ∈ S M I ( f i , f j ) Formula (3)
C. according to the software features selected, it is possible to use it is carried out classifying quality detection by SVM, set up final
Disaggregated model。Classifying for software module, classification results adopts Gmeans to verify having of feature selection result
Effect property and degree of accuracy。The optimal feature subset S that final basis obtains sets up forecast model。
G m e a n s = A A + B × D C + D Formula (4)
Wherein, A is actual defective and correct number of modules of classifying, and B is actual defective but is classified the number of modules of mistake, and C is actual zero defect but is classified the number of modules of mistake, and D is actual zero defect and correct number of modules of classifying。
Accompanying drawing explanation
Fig. 1 is based on the software defect data characteristics system of selection flow chart of mutual information。
Detailed description of the invention
Below in conjunction with Fig. 1, the present invention is described in further detail。
The first step: concentrate from software data and obtain data, and it is carried out pretreatment。
(1) first obtain software features collection and software module data, and do tag processes。Wherein, feature set F={f1,f2…fm}。Software module data set { X, Y}, X={x1,x2…xn, Y={y1,y2}={+1 ,-1}。If software module xiZero defect, then (xi,yi)=(xi,-1), otherwise, (xi,yi)=(xi,+1)。
(2) feature set is classified, obtain three feature sets, be LOC class respectively, McCabe class and Halstead class (being abbreviated as L, M, H respectively)。
(3) according to nine parts of training set data, following formula is utilized to obtain the unbalance factor p of software module。
Second step: utilize Mutual Information Theory to remove Redundancy Software feature
(1) each feature f in three feature sets is calculated according to following formulaiWith class y1And y2Dependency μ (fi), arrange in descending order according to dependency size, only take the feature of 70% before the relevance rank in three feature sets, now, t feature of feature set total surplus。
μ(fi)=logp × MI (fi,y1)+MI(fi,y2)
Wherein, M I ( f i , y 1 ) = l o g p ( f i | y 1 ) p ( f i ) , M I ( f i , y 2 ) = l o g p ( f i | y 2 ) p ( f i ) , P (f in formulai) represent feature fiAt the total probability that two apoplexy due to endogenous wind occur, p (fi|y1), p (fi|y2) respectively feature fiAt the probability that two apoplexy due to endogenous wind occur。
(2) according further to Mutual Information Theory calculate respectively three subtracted by sieve after feature sets in dependency between feature between two, obtain feature correlation matrix。
L category feature correlation matrix M category feature correlation matrix H category feature dependency square
According to following minimal redundancy criterion calculation formula, remove the feature big with the forward feature correlation of ranking, obtain final optimal feature subset S, it is ensured that S is sized to k。
f i j = m i n ( f i , f j ) = 1 | S | 2 Σ f i , f j ∈ S M I ( f i , f j )
3rd step: utilize SVM that it is carried out classifying quality detection according to the software features selected, sets up final disaggregated model。
(1) input: optimal feature subset S, module data;Output: classification results。
(2) predicting for software defect, classification results adopts Gmeans to verify the performance of feature selection result。The optimal feature subset S that final basis obtains is predicted model and sets up。
G m e a n s = A A + B × D C + D
Wherein, A is actual defective and correct number of modules of classifying, and B is actual defective but is classified the number of modules of mistake, and C is actual zero defect but is classified the number of modules of mistake, and D is actual zero defect and correct number of modules of classifying。
The invention provides a kind of software defect data characteristics system of selection based on mutual information; it should be pointed out that, for those skilled in the art, under the premise without departing from the principles of the invention; can also making some improvement, these improvement also should be regarded as protection scope of the present invention。Each ingredient not clear and definite in the present embodiment is used equally to prior art and is realized。

Claims (1)

1. based on the software defect data characteristics system of selection of mutual information, it is characterised in that mainly include three below step:
A. concentrate from software data and obtain data, and do pretreatment
(1) collect software data, including software features data and software module data, and do pretreatment;Software module data are divided into training set and test set in order to training and test;The present invention adopts ten cross validations, and data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test;
(2) according to existing Heuristics, feature set is classified, obtain three feature sets, be LOC class respectively, McCabe class and Halstead class (being abbreviated as L, M, H respectively);
(3) according to nine parts of training set data, the unbalance factor p of software module is obtained by equation below;
B. Mutual Information Theory is utilized to remove Redundancy Software feature
(1) take into full account the disequilibrium of data set, introduce unbalance factor, calculate each feature f in three feature sets according to following formulaiWith class y1And y2Dependency μ (fi), arrange in descending order according to dependency size, only take the feature of 70% before the relevance rank in three feature sets, now, t feature of feature set total surplus;
μ(fi)=logp × MI (fi,y1)+MI(fi,y2)
Wherein, M I ( f i , y 1 ) = l o g p ( f i | y 1 ) p ( f i ) , M I ( f i , y 2 ) = l o g p ( f i | y 2 ) p ( f i ) , P (f in formulai) represent feature fiAt the total probability that two apoplexy due to endogenous wind occur, p (fi|y1), p (fi|y2) respectively feature fiAt the probability that two apoplexy due to endogenous wind occur;
(2) still according to Mutual Information Theory calculate respectively three subtracted by sieve after feature sets in dependency between feature between two, obtain feature correlation matrix;
L category feature correlation matrix M category feature correlation matrix H category feature dependency square
According to minimal redundancy criterion calculation formula, remove the feature big with the forward feature correlation of ranking, obtain final optimal feature subset S, it is ensured that S is sized to k;
f i j = m i n ( f i , f j ) = 1 | S | 2 Σ f i , f j ∈ S M I ( f i , f j )
C. according to the software features selected, it is possible to use it is carried out classifying quality detection by SVM, sets up final disaggregated model;Predicting for software defect, classification results adopts Gmeans to verify effectiveness and the degree of accuracy of feature selection result;The optimal feature subset S that final basis obtains sets up disaggregated model;
G m e a n s = A A + B × D C + D
Wherein, A is actual defective and correct number of modules of classifying, and B is actual defective but is classified the number of modules of mistake, and C is actual zero defect but is classified the number of modules of mistake, and D is actual zero defect and correct number of modules of classifying。
CN201610004279.XA 2016-01-04 2016-01-04 Software defect data feature selection method based on mutual information Pending CN105701013A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610004279.XA CN105701013A (en) 2016-01-04 2016-01-04 Software defect data feature selection method based on mutual information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610004279.XA CN105701013A (en) 2016-01-04 2016-01-04 Software defect data feature selection method based on mutual information

Publications (1)

Publication Number Publication Date
CN105701013A true CN105701013A (en) 2016-06-22

Family

ID=56226952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610004279.XA Pending CN105701013A (en) 2016-01-04 2016-01-04 Software defect data feature selection method based on mutual information

Country Status (1)

Country Link
CN (1) CN105701013A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106201871A (en) * 2016-06-30 2016-12-07 重庆大学 Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN107273295A (en) * 2017-06-23 2017-10-20 中国人民解放军国防科学技术大学 A kind of software problem reporting sorting technique based on text randomness
CN107391365A (en) * 2017-07-06 2017-11-24 武汉大学 A kind of hybrid characteristic selecting method of software-oriented failure prediction
CN110147321A (en) * 2019-04-19 2019-08-20 北京航空航天大学 A kind of recognition methods of the defect high risk module based on software network
CN113707330A (en) * 2021-07-30 2021-11-26 电子科技大学 Mongolian medicine syndrome differentiation model construction method, system and method
CN115830235A (en) * 2022-12-09 2023-03-21 皖南医学院第一附属医院(皖南医学院弋矶山医院) Three-dimensional model reconstruction method for room defect image

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106201871A (en) * 2016-06-30 2016-12-07 重庆大学 Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN106201871B (en) * 2016-06-30 2018-10-02 重庆大学 Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN107273295A (en) * 2017-06-23 2017-10-20 中国人民解放军国防科学技术大学 A kind of software problem reporting sorting technique based on text randomness
CN107273295B (en) * 2017-06-23 2020-03-20 中国人民解放军国防科学技术大学 Software problem report classification method based on text chaos
CN107391365A (en) * 2017-07-06 2017-11-24 武汉大学 A kind of hybrid characteristic selecting method of software-oriented failure prediction
CN107391365B (en) * 2017-07-06 2020-10-13 武汉大学 Mixed feature selection method oriented to software defect prediction
CN110147321A (en) * 2019-04-19 2019-08-20 北京航空航天大学 A kind of recognition methods of the defect high risk module based on software network
CN110147321B (en) * 2019-04-19 2020-11-24 北京航空航天大学 Software network-based method for identifying defect high-risk module
CN113707330A (en) * 2021-07-30 2021-11-26 电子科技大学 Mongolian medicine syndrome differentiation model construction method, system and method
CN113707330B (en) * 2021-07-30 2023-04-28 电子科技大学 Construction method of syndrome differentiation model of Mongolian medicine, syndrome differentiation system and method of Mongolian medicine
CN115830235A (en) * 2022-12-09 2023-03-21 皖南医学院第一附属医院(皖南医学院弋矶山医院) Three-dimensional model reconstruction method for room defect image

Similar Documents

Publication Publication Date Title
CN105701013A (en) Software defect data feature selection method based on mutual information
Yang et al. TLEL: A two-layer ensemble learning approach for just-in-time defect prediction
CN106201871B (en) Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
Fukushima et al. An empirical study of just-in-time defect prediction using cross-project models
Yang et al. Deep learning for just-in-time defect prediction
CN101614786B (en) Online intelligent fault diagnosis method of power electronic circuit based on FRFT and IFSVC
CN102637143B (en) Software defect priority prediction method based on improved support vector machine
Fioravanti et al. A study on fault-proneness detection of object-oriented systems
US20190018753A1 (en) Software program fault localization
CN107820620A (en) Method and system for defect classification
US11709979B1 (en) Bridge damage identification method considering uncertainty
CN105653450A (en) Software defect data feature selection method based on combination of modified genetic algorithm and Adaboost
CN101908020A (en) Method for prioritizing test cases based on classified excavation and version change
CN106651057A (en) Mobile terminal user age prediction method based on installation package sequence table
CN109858414A (en) A kind of invoice piecemeal detection method
CN109241383A (en) A kind of type of webpage intelligent identification Method and system based on deep learning
CN104008420A (en) Distributed outlier detection method and system based on automatic coding machine
CN109710725A (en) A kind of Chinese table column label restoration methods and system based on text classification
CN105389598A (en) Feature selecting and classifying method for software defect data
CN106021671B (en) In conjunction with the circuit health grading evaluation method of correlative relationship and grey cluster technology
CN104021180B (en) A kind of modular software defect report sorting technique
US10761961B2 (en) Identification of software program fault locations
Singh et al. Melford: Using neural networks to find spreadsheet errors
CN104318110A (en) Method for improving risk design and maintenance efficiency of large complex system on basis of importance and sensibility complex sequence
Drew et al. Hydrologic characteristics of freshwater mussel habitat: Novel insights from modeled flows

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160622

WD01 Invention patent application deemed withdrawn after publication