CN105389598A - Feature selecting and classifying method for software defect data - Google Patents

Feature selecting and classifying method for software defect data Download PDF

Info

Publication number
CN105389598A
CN105389598A CN201511003241.2A CN201511003241A CN105389598A CN 105389598 A CN105389598 A CN 105389598A CN 201511003241 A CN201511003241 A CN 201511003241A CN 105389598 A CN105389598 A CN 105389598A
Authority
CN
China
Prior art keywords
feature
software
data
tlv triple
optimal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201511003241.2A
Other languages
Chinese (zh)
Inventor
李克文
邹晶杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN201511003241.2A priority Critical patent/CN105389598A/en
Publication of CN105389598A publication Critical patent/CN105389598A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a feature selecting and classifying method for software defect data, which can be used for guiding the whole process of classifying software defect data, and comprises the following steps of: A, obtaining data from a software data set, and pre-processing data, wherein the data pre-processing includes labeling data and dividing software features into three categories according to known experiential knowledge; B, working out the feature with the maximum category correlation and minimal feature correlation according to the theory of mutual information, and adding into an optimal software feature set; C, classifying the selected software features by a classifier, and arranging the same in an ascending order according to classification results; and D, forecasting the defects of a software module by using a two-dimensional loop cascade Adaboost and an optimal feature sub-set, to accurately remove samples without defects in real time, thus reducing operation time. According to the feature selecting and classifying method for software defect data, the unbalancedness of the software defect data is overcome, the most categories are removed in real time to balance the data set, and thereby the operation time is reduced, and the operation efficiency is improved.

Description

The feature selecting of software-oriented defective data and sorting technique
Technical field
The invention belongs to soft project application, be specifically related to feature selecting and the sorting technique of a kind of software-oriented defective data.
Background technology
At present, software systems scale increases day by day and its logic complexity also strengthens day by day, and the module along with existing defects in software increases, and this certainly will threaten the reliability of software, affects software quality, causes immeasurable loss.Software defect forecasting techniques, can the distribution situation of forecasting software defect exactly as instructing and a kind of important approach of assessment Experience of Software Testing Work, and this has important realistic meaning for raising software quality.For software systems, rational prediction defect can be added up and not yet be found but still the defect counts existed and defect distribution.The key of software defect prediction finds defective module, and this is two classification problems in essence, is divided into " defectiveness " and " zero defect " two class by software module.The prerequisite of classification carries out feature selecting, and the optimal feature subset according to selecting is classified.But in practical operation, there are following two difficult points in software defect forecasting process:
(1) there is bulk redundancy feature in software features
2004, NASA disclosed software data collection (NASAMDP), and the various software features that they extract from source code mainly comprise three major types--LOC, McCabe and Halstead.In each class software features, except essential characteristic directly extracts from source code, other feature is all obtained by these essential characteristic value indirect calculation.And have experiment to prove, whether the software features that only needs three are important just forecasting software module can contain defectiveness.Visible, in each class software features, there is more redundancy feature.A large amount of redundancies or uncorrelated features participate in computing, will certainly reduce arithmetic speed and efficiency.Therefore, need to carry out dimension-reduction treatment to software features, according to the classification of software features, in each class, select the feature large to software defect predicted impact power.
(2) there is serious unbalancedness in software module data
In the software module of reality, module (minority class) quantity of " defectiveness " will far less than the module of " zero defect " (most class), therefore, software defect prediction is also the classification problem of unbalanced data, and this is also data mining study hotspot in recent years.In forecasting process, target detects that the module of " defectiveness " is repaired.But software data is concentrated exists a large amount of most classes and zero defect module, the time of their at substantial in computing and resource.Therefore, detect most class of " zero defect " as early as possible exactly and removed data set thus reduce follow-up operand, this is significant for the efficiency improving whole assorting process.
For above two problems, work out the feature selecting meeting software data characteristic and the sorting technique of complete set, this is for improving software defect prediction effect, reducing operation time, significant.
Summary of the invention
The object of the invention is to solve the feature selecting of software-oriented defective data and low, the consuming time problem of a specified duration of efficiency existing for sorting technique, feature selecting and the sorting technique of a kind of software-oriented defective data are provided, to reduce operation time, improve operation efficiency.
For achieving the above object, technical solution of the present invention mainly comprises following four steps:
A. concentrate from software data and obtain data, pre-service is carried out to data
(1) data comprise software features collection, software module, software module data are divided into training set and test set in order to training and testing.The present invention adopts ten cross validations, data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test.And data are done tag processes.
(2) feature set classified according to existing knowledge, obtaining three feature sets, is LOC class respectively, McCabe class and Halstead class.
B. optimum software features collection is obtained according to Mutual Information Theory
(1) each feature f in three feature sets is calculated according to Mutual Information Theory iwith class y 1and y 2correlativity, according to correlativity size by descending sort, in three feature sets, only get the feature of before correlativity rank 50%, obtain three sieves subtract after character subset.
(2) calculate respectively three sieves subtract after character subset in correlativity between each feature, remove with rank before 30% the large feature of feature correlation, ensure that final optimal feature subset is S, size is t and S={L, M, H}.
C. the software features selected is sorted by classifying quality
(1) optimal characteristics will obtained, inputs SVM successively and trains it.
(2) disaggregated model trained is acted on test set, after obtaining classification results, the size according to Gmeans value sorts from small to large to software features, get an element of every category feature in order successively, the tlv triple of composition optimal feature subset, is designated as (l, h, m), then optimal feature subset S can be expressed as: S={ (l, h, m) | l ∈ L, h ∈ H, m ∈ M}.
D. two-dimentional circulation cascade Adaboost and feature set S is utilized to classify to software module
(1) setting cascade structure is Pyatyi, and every one-level is Adaboost sorter.Each Adaboost sorter is integrated by several Weak Classifiers (classification error rate <0.5) weighting.
(2) from S, pick out first feature tlv triple (l, h, m), the Gmeans value of three elements in this tlv triple is minimum in S.Input this feature tlv triple, carry out first order classification.Be identified as defective sample and directly enter next stage, be identified as flawless sample and then enter loop structure at the corresponding levels and carry out secondary discrimination.If this sample is once identified as defectiveness in secondary discrimination, then enter next stage circulation, otherwise, then give up this sample.
(3) second level differentiate then take out from S be number two, the feature tlv triple of the 3rd, carry out aforesaid operations.The like, until level V, adopt five feature tlv triple.Finally obtain classification results.
Accompanying drawing explanation
The feature selecting of Tu1Shi software-oriented defective data and sorting technique process flow diagram.
Fig. 2 is two-dimentional circulation cascade Adaboost software prediction model.
Embodiment
Below in conjunction with Fig. 1, the present invention is described in further detail.
The first step: concentrate from software data and obtain data, pre-service is carried out to data
(1) first obtain software features collection and software module data, and training set is done tag processes.Wherein, feature set F={f 1, f 2f m.Software module data set { X, Y}, X={x 1, x 2x n, Y={y 1, y 2}={+1 ,-1}.If software module x izero defect, then (x i, y i)=(x i,-1), otherwise, (x i, y i)=(x i,+1).
(2) feature set classified according to existing knowledge, obtaining three feature sets, is LOC class respectively, and McCabe class and Halstead class, be abbreviated as L, M, H.
Second step: obtain optimum software features collection according to Mutual Information Theory
According to Mutual Information Theory, the correlativity between any Two Variables can be calculated, as shown in the formula
M I ( x , y ) = l o g p ( x , y ) p ( x ) p ( y )
Wherein, the marginal distribution probability that p (x) and p (y) is x and y, p (x, y) is the joint distribution probability of x and y.
(1) each feature f in three feature sets is calculated according to Mutual Information Theory iwith class y 1and y 2correlativity, according to correlativity size by descending sort, in three feature sets, only get the feature of before correlativity rank 50%, obtain three sieves subtract after character subset.
(2) calculate respectively three sieves subtract after character subset in correlativity between each feature, remove with rank before 30% the large feature of feature correlation, ensure that final optimal feature subset is S, size is t and S={L, M, H}.
3rd step: the software features selected is sorted by classifying quality
(1) optimal characteristics will obtained, inputs SVM successively and trains it.
(2) disaggregated model trained is acted on test set, after obtaining classification results, the size according to Gmeans value sorts from small to large to software features.Get an element of every category feature in order successively, the tlv triple of composition optimal feature subset, be designated as (l, h, m), then optimal feature subset S can be expressed as S={ (l, h, m) | l ∈ L, h ∈ H, m ∈ M}.
4th step: utilize two-dimentional circulation cascade Adaboost and feature set S to classify to software module
(1) setting cascade structure is Pyatyi, and every one-level is Adaboost sorter.Each Adaboost sorter is integrated by several Weak Classifiers (classification error rate <0.5) weighting.
(2) from S, pick out first feature tlv triple (l, h, m), the Gmeans value of three elements in this tlv triple is minimum in S.Input this feature tlv triple, carry out first order classification.Be identified as defective sample and directly enter next stage, be identified as flawless sample and then enter loop structure at the corresponding levels and carry out secondary discrimination.If this sample is once identified as defectiveness in secondary discrimination, then enter next stage circulation, otherwise, then give up this sample.
(3) second level differentiate then take out from S be number two, the feature tlv triple of the 3rd, carry out aforesaid operations.The like, until level V, adopt five feature tlv triple.Finally obtain classification results.
The invention provides feature selecting and the sorting technique of a kind of software-oriented defective data; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvement, these improvement also should be considered as protection scope of the present invention.Each ingredient not clear and definite in the present embodiment all can be used for prior art and is realized.

Claims (1)

1. the feature selecting of software-oriented defective data and sorting technique, is characterized in that, mainly comprise following three steps:
A. concentrate from software data and obtain data, pre-service is carried out to data
(1) data comprise software features collection, software module; Software module data be divided into training set and test set in order to training and testing; The present invention adopts ten cross validations, data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test; And data are done tag processes;
(2) feature set classified according to existing knowledge, obtaining three feature sets, is LOC class respectively, McCabe class and Halstead class;
B. optimum software features collection is obtained according to Mutual Information Theory
(1) each feature f in three feature sets is calculated according to Mutual Information Theory iwith class y 1and y 2correlativity, according to correlativity size by descending sort, in three feature sets, only get the feature of before correlativity rank 50%, obtain three sieves subtract after character subset;
(2) calculate respectively three sieves subtract after character subset in correlativity between each feature, remove with rank before 30% the large feature of feature correlation, ensure that final optimal feature subset is S, size is t and S={L, M, H};
C. the software features selected is sorted by classifying quality
(1) optimal characteristics will obtained, inputs SVM successively and trains it;
(2) disaggregated model trained is acted on test set, after obtaining classification results, size according to Gmeans value sorts from small to large to software features, get an element of every category feature in order successively, the tlv triple of composition optimal feature subset, is designated as (l, h, m), then optimal feature subset S can be expressed as:
S={(l,h,m)|l∈L,h∈H,m∈M};
D. two-dimentional circulation cascade Adaboost and feature set S is utilized to classify to software module
(1) setting cascade structure is Pyatyi, and every one-level is Adaboost sorter; Each Adaboost sorter is integrated by several Weak Classifiers (classification error rate <0.5) weighting;
(2) from S, pick out first feature tlv triple (l, h, m), the Gmeans value of three elements in this tlv triple is minimum in S; Input this feature tlv triple, carry out first order classification; Be identified as defective sample and directly enter next stage, be identified as flawless sample and then enter loop structure at the corresponding levels and carry out secondary discrimination; If this sample is once identified as defectiveness in secondary discrimination, then enter next stage circulation, otherwise, then give up this sample;
(3) second level differentiate then take out from S be number two, the feature tlv triple of the 3rd, carry out aforesaid operations; The like, until level V, adopt five feature tlv triple; Finally obtain classification results.
CN201511003241.2A 2015-12-28 2015-12-28 Feature selecting and classifying method for software defect data Pending CN105389598A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511003241.2A CN105389598A (en) 2015-12-28 2015-12-28 Feature selecting and classifying method for software defect data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511003241.2A CN105389598A (en) 2015-12-28 2015-12-28 Feature selecting and classifying method for software defect data

Publications (1)

Publication Number Publication Date
CN105389598A true CN105389598A (en) 2016-03-09

Family

ID=55421868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511003241.2A Pending CN105389598A (en) 2015-12-28 2015-12-28 Feature selecting and classifying method for software defect data

Country Status (1)

Country Link
CN (1) CN105389598A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106201871A (en) * 2016-06-30 2016-12-07 重庆大学 Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN107391365A (en) * 2017-07-06 2017-11-24 武汉大学 A kind of hybrid characteristic selecting method of software-oriented failure prediction
CN107577605A (en) * 2017-09-04 2018-01-12 南京航空航天大学 A kind of feature clustering system of selection of software-oriented failure prediction
CN109117380A (en) * 2018-09-28 2019-01-01 中国科学院长春光学精密机械与物理研究所 A kind of method for evaluating software quality, device, equipment and readable storage medium storing program for executing
CN109480871A (en) * 2018-10-30 2019-03-19 北京机械设备研究所 A kind of fatigue detection method towards RSVP brain-computer interface
CN110147321A (en) * 2019-04-19 2019-08-20 北京航空航天大学 A kind of recognition methods of the defect high risk module based on software network
CN110188692A (en) * 2019-05-30 2019-08-30 南通大学 A kind of reinforcing that effective target quickly identifies circulation Cascading Methods
CN112200261A (en) * 2020-10-20 2021-01-08 广东工业大学 Steel plate defect sampling method

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106201871A (en) * 2016-06-30 2016-12-07 重庆大学 Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN106201871B (en) * 2016-06-30 2018-10-02 重庆大学 Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN107391365A (en) * 2017-07-06 2017-11-24 武汉大学 A kind of hybrid characteristic selecting method of software-oriented failure prediction
CN107391365B (en) * 2017-07-06 2020-10-13 武汉大学 Mixed feature selection method oriented to software defect prediction
CN107577605A (en) * 2017-09-04 2018-01-12 南京航空航天大学 A kind of feature clustering system of selection of software-oriented failure prediction
CN109117380A (en) * 2018-09-28 2019-01-01 中国科学院长春光学精密机械与物理研究所 A kind of method for evaluating software quality, device, equipment and readable storage medium storing program for executing
CN109117380B (en) * 2018-09-28 2022-02-08 中国科学院长春光学精密机械与物理研究所 Software quality evaluation method, device, equipment and readable storage medium
CN109480871A (en) * 2018-10-30 2019-03-19 北京机械设备研究所 A kind of fatigue detection method towards RSVP brain-computer interface
CN110147321A (en) * 2019-04-19 2019-08-20 北京航空航天大学 A kind of recognition methods of the defect high risk module based on software network
CN110188692A (en) * 2019-05-30 2019-08-30 南通大学 A kind of reinforcing that effective target quickly identifies circulation Cascading Methods
CN112200261A (en) * 2020-10-20 2021-01-08 广东工业大学 Steel plate defect sampling method

Similar Documents

Publication Publication Date Title
CN105389598A (en) Feature selecting and classifying method for software defect data
US10706332B2 (en) Analog circuit fault mode classification method
CN106201871B (en) Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN103217960B (en) Automatic selection method of dynamic scheduling strategy of semiconductor production line
CN103632168B (en) Classifier integration method for machine learning
CN107316049A (en) A kind of transfer learning sorting technique based on semi-supervised self-training
CN109190626A (en) A kind of semantic segmentation method of the multipath Fusion Features based on deep learning
CN110263166A (en) Public sentiment file classification method based on deep learning
CN105389583A (en) Image classifier generation method, and image classification method and device
CN106651574A (en) Personal credit assessment method and apparatus
CN103617435A (en) Image sorting method and system for active learning
CN104268514A (en) Gesture detection method based on multi-feature fusion
CN106096661A (en) Zero sample image sorting technique based on relative priority random forest
CN105701013A (en) Software defect data feature selection method based on mutual information
CN104318110B (en) Method for improving risk design and maintenance efficiency of large complex system
CN104915679A (en) Large-scale high-dimensional data classification method based on random forest weighted distance
CN105631477A (en) Traffic sign recognition method based on extreme learning machine and self-adaptive lifting
CN107545038A (en) A kind of file classification method and equipment
CN113407644A (en) Enterprise industry secondary industry multi-label classifier based on deep learning algorithm
CN104966075A (en) Face recognition method and system based on two-dimensional discriminant features
CN106250913A (en) A kind of combining classifiers licence plate recognition method based on local canonical correlation analysis
CN103942415A (en) Automatic data analysis method of flow cytometer
CN104318224A (en) Face recognition method and monitoring equipment
CN109388804A (en) Report core views extracting method and device are ground using the security of deep learning model
Li et al. GADet: A Geometry-Aware X-ray Prohibited Items Detector

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160309

WD01 Invention patent application deemed withdrawn after publication