CN105389598A - Feature selecting and classifying method for software defect data - Google Patents
Feature selecting and classifying method for software defect data Download PDFInfo
- Publication number
- CN105389598A CN105389598A CN201511003241.2A CN201511003241A CN105389598A CN 105389598 A CN105389598 A CN 105389598A CN 201511003241 A CN201511003241 A CN 201511003241A CN 105389598 A CN105389598 A CN 105389598A
- Authority
- CN
- China
- Prior art keywords
- feature
- software
- data
- tlv triple
- optimal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a feature selecting and classifying method for software defect data, which can be used for guiding the whole process of classifying software defect data, and comprises the following steps of: A, obtaining data from a software data set, and pre-processing data, wherein the data pre-processing includes labeling data and dividing software features into three categories according to known experiential knowledge; B, working out the feature with the maximum category correlation and minimal feature correlation according to the theory of mutual information, and adding into an optimal software feature set; C, classifying the selected software features by a classifier, and arranging the same in an ascending order according to classification results; and D, forecasting the defects of a software module by using a two-dimensional loop cascade Adaboost and an optimal feature sub-set, to accurately remove samples without defects in real time, thus reducing operation time. According to the feature selecting and classifying method for software defect data, the unbalancedness of the software defect data is overcome, the most categories are removed in real time to balance the data set, and thereby the operation time is reduced, and the operation efficiency is improved.
Description
Technical field
The invention belongs to soft project application, be specifically related to feature selecting and the sorting technique of a kind of software-oriented defective data.
Background technology
At present, software systems scale increases day by day and its logic complexity also strengthens day by day, and the module along with existing defects in software increases, and this certainly will threaten the reliability of software, affects software quality, causes immeasurable loss.Software defect forecasting techniques, can the distribution situation of forecasting software defect exactly as instructing and a kind of important approach of assessment Experience of Software Testing Work, and this has important realistic meaning for raising software quality.For software systems, rational prediction defect can be added up and not yet be found but still the defect counts existed and defect distribution.The key of software defect prediction finds defective module, and this is two classification problems in essence, is divided into " defectiveness " and " zero defect " two class by software module.The prerequisite of classification carries out feature selecting, and the optimal feature subset according to selecting is classified.But in practical operation, there are following two difficult points in software defect forecasting process:
(1) there is bulk redundancy feature in software features
2004, NASA disclosed software data collection (NASAMDP), and the various software features that they extract from source code mainly comprise three major types--LOC, McCabe and Halstead.In each class software features, except essential characteristic directly extracts from source code, other feature is all obtained by these essential characteristic value indirect calculation.And have experiment to prove, whether the software features that only needs three are important just forecasting software module can contain defectiveness.Visible, in each class software features, there is more redundancy feature.A large amount of redundancies or uncorrelated features participate in computing, will certainly reduce arithmetic speed and efficiency.Therefore, need to carry out dimension-reduction treatment to software features, according to the classification of software features, in each class, select the feature large to software defect predicted impact power.
(2) there is serious unbalancedness in software module data
In the software module of reality, module (minority class) quantity of " defectiveness " will far less than the module of " zero defect " (most class), therefore, software defect prediction is also the classification problem of unbalanced data, and this is also data mining study hotspot in recent years.In forecasting process, target detects that the module of " defectiveness " is repaired.But software data is concentrated exists a large amount of most classes and zero defect module, the time of their at substantial in computing and resource.Therefore, detect most class of " zero defect " as early as possible exactly and removed data set thus reduce follow-up operand, this is significant for the efficiency improving whole assorting process.
For above two problems, work out the feature selecting meeting software data characteristic and the sorting technique of complete set, this is for improving software defect prediction effect, reducing operation time, significant.
Summary of the invention
The object of the invention is to solve the feature selecting of software-oriented defective data and low, the consuming time problem of a specified duration of efficiency existing for sorting technique, feature selecting and the sorting technique of a kind of software-oriented defective data are provided, to reduce operation time, improve operation efficiency.
For achieving the above object, technical solution of the present invention mainly comprises following four steps:
A. concentrate from software data and obtain data, pre-service is carried out to data
(1) data comprise software features collection, software module, software module data are divided into training set and test set in order to training and testing.The present invention adopts ten cross validations, data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test.And data are done tag processes.
(2) feature set classified according to existing knowledge, obtaining three feature sets, is LOC class respectively, McCabe class and Halstead class.
B. optimum software features collection is obtained according to Mutual Information Theory
(1) each feature f in three feature sets is calculated according to Mutual Information Theory
iwith class y
1and y
2correlativity, according to correlativity size by descending sort, in three feature sets, only get the feature of before correlativity rank 50%, obtain three sieves subtract after character subset.
(2) calculate respectively three sieves subtract after character subset in correlativity between each feature, remove with rank before 30% the large feature of feature correlation, ensure that final optimal feature subset is S, size is t and S={L, M, H}.
C. the software features selected is sorted by classifying quality
(1) optimal characteristics will obtained, inputs SVM successively and trains it.
(2) disaggregated model trained is acted on test set, after obtaining classification results, the size according to Gmeans value sorts from small to large to software features, get an element of every category feature in order successively, the tlv triple of composition optimal feature subset, is designated as (l, h, m), then optimal feature subset S can be expressed as: S={ (l, h, m) | l ∈ L, h ∈ H, m ∈ M}.
D. two-dimentional circulation cascade Adaboost and feature set S is utilized to classify to software module
(1) setting cascade structure is Pyatyi, and every one-level is Adaboost sorter.Each Adaboost sorter is integrated by several Weak Classifiers (classification error rate <0.5) weighting.
(2) from S, pick out first feature tlv triple (l, h, m), the Gmeans value of three elements in this tlv triple is minimum in S.Input this feature tlv triple, carry out first order classification.Be identified as defective sample and directly enter next stage, be identified as flawless sample and then enter loop structure at the corresponding levels and carry out secondary discrimination.If this sample is once identified as defectiveness in secondary discrimination, then enter next stage circulation, otherwise, then give up this sample.
(3) second level differentiate then take out from S be number two, the feature tlv triple of the 3rd, carry out aforesaid operations.The like, until level V, adopt five feature tlv triple.Finally obtain classification results.
Accompanying drawing explanation
The feature selecting of Tu1Shi software-oriented defective data and sorting technique process flow diagram.
Fig. 2 is two-dimentional circulation cascade Adaboost software prediction model.
Embodiment
Below in conjunction with Fig. 1, the present invention is described in further detail.
The first step: concentrate from software data and obtain data, pre-service is carried out to data
(1) first obtain software features collection and software module data, and training set is done tag processes.Wherein, feature set F={f
1, f
2f
m.Software module data set { X, Y}, X={x
1, x
2x
n, Y={y
1, y
2}={+1 ,-1}.If software module x
izero defect, then (x
i, y
i)=(x
i,-1), otherwise, (x
i, y
i)=(x
i,+1).
(2) feature set classified according to existing knowledge, obtaining three feature sets, is LOC class respectively, and McCabe class and Halstead class, be abbreviated as L, M, H.
Second step: obtain optimum software features collection according to Mutual Information Theory
According to Mutual Information Theory, the correlativity between any Two Variables can be calculated, as shown in the formula
Wherein, the marginal distribution probability that p (x) and p (y) is x and y, p (x, y) is the joint distribution probability of x and y.
(1) each feature f in three feature sets is calculated according to Mutual Information Theory
iwith class y
1and y
2correlativity, according to correlativity size by descending sort, in three feature sets, only get the feature of before correlativity rank 50%, obtain three sieves subtract after character subset.
(2) calculate respectively three sieves subtract after character subset in correlativity between each feature, remove with rank before 30% the large feature of feature correlation, ensure that final optimal feature subset is S, size is t and S={L, M, H}.
3rd step: the software features selected is sorted by classifying quality
(1) optimal characteristics will obtained, inputs SVM successively and trains it.
(2) disaggregated model trained is acted on test set, after obtaining classification results, the size according to Gmeans value sorts from small to large to software features.Get an element of every category feature in order successively, the tlv triple of composition optimal feature subset, be designated as (l, h, m), then optimal feature subset S can be expressed as S={ (l, h, m) | l ∈ L, h ∈ H, m ∈ M}.
4th step: utilize two-dimentional circulation cascade Adaboost and feature set S to classify to software module
(1) setting cascade structure is Pyatyi, and every one-level is Adaboost sorter.Each Adaboost sorter is integrated by several Weak Classifiers (classification error rate <0.5) weighting.
(2) from S, pick out first feature tlv triple (l, h, m), the Gmeans value of three elements in this tlv triple is minimum in S.Input this feature tlv triple, carry out first order classification.Be identified as defective sample and directly enter next stage, be identified as flawless sample and then enter loop structure at the corresponding levels and carry out secondary discrimination.If this sample is once identified as defectiveness in secondary discrimination, then enter next stage circulation, otherwise, then give up this sample.
(3) second level differentiate then take out from S be number two, the feature tlv triple of the 3rd, carry out aforesaid operations.The like, until level V, adopt five feature tlv triple.Finally obtain classification results.
The invention provides feature selecting and the sorting technique of a kind of software-oriented defective data; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvement, these improvement also should be considered as protection scope of the present invention.Each ingredient not clear and definite in the present embodiment all can be used for prior art and is realized.
Claims (1)
1. the feature selecting of software-oriented defective data and sorting technique, is characterized in that, mainly comprise following three steps:
A. concentrate from software data and obtain data, pre-service is carried out to data
(1) data comprise software features collection, software module; Software module data be divided into training set and test set in order to training and testing; The present invention adopts ten cross validations, data set is divided into ten parts, wherein trains for nine parts, and portion does accuracy test; And data are done tag processes;
(2) feature set classified according to existing knowledge, obtaining three feature sets, is LOC class respectively, McCabe class and Halstead class;
B. optimum software features collection is obtained according to Mutual Information Theory
(1) each feature f in three feature sets is calculated according to Mutual Information Theory
iwith class y
1and y
2correlativity, according to correlativity size by descending sort, in three feature sets, only get the feature of before correlativity rank 50%, obtain three sieves subtract after character subset;
(2) calculate respectively three sieves subtract after character subset in correlativity between each feature, remove with rank before 30% the large feature of feature correlation, ensure that final optimal feature subset is S, size is t and S={L, M, H};
C. the software features selected is sorted by classifying quality
(1) optimal characteristics will obtained, inputs SVM successively and trains it;
(2) disaggregated model trained is acted on test set, after obtaining classification results, size according to Gmeans value sorts from small to large to software features, get an element of every category feature in order successively, the tlv triple of composition optimal feature subset, is designated as (l, h, m), then optimal feature subset S can be expressed as:
S={(l,h,m)|l∈L,h∈H,m∈M};
D. two-dimentional circulation cascade Adaboost and feature set S is utilized to classify to software module
(1) setting cascade structure is Pyatyi, and every one-level is Adaboost sorter; Each Adaboost sorter is integrated by several Weak Classifiers (classification error rate <0.5) weighting;
(2) from S, pick out first feature tlv triple (l, h, m), the Gmeans value of three elements in this tlv triple is minimum in S; Input this feature tlv triple, carry out first order classification; Be identified as defective sample and directly enter next stage, be identified as flawless sample and then enter loop structure at the corresponding levels and carry out secondary discrimination; If this sample is once identified as defectiveness in secondary discrimination, then enter next stage circulation, otherwise, then give up this sample;
(3) second level differentiate then take out from S be number two, the feature tlv triple of the 3rd, carry out aforesaid operations; The like, until level V, adopt five feature tlv triple; Finally obtain classification results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511003241.2A CN105389598A (en) | 2015-12-28 | 2015-12-28 | Feature selecting and classifying method for software defect data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511003241.2A CN105389598A (en) | 2015-12-28 | 2015-12-28 | Feature selecting and classifying method for software defect data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105389598A true CN105389598A (en) | 2016-03-09 |
Family
ID=55421868
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511003241.2A Pending CN105389598A (en) | 2015-12-28 | 2015-12-28 | Feature selecting and classifying method for software defect data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105389598A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106201871A (en) * | 2016-06-30 | 2016-12-07 | 重庆大学 | Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised |
CN107391365A (en) * | 2017-07-06 | 2017-11-24 | 武汉大学 | A kind of hybrid characteristic selecting method of software-oriented failure prediction |
CN107577605A (en) * | 2017-09-04 | 2018-01-12 | 南京航空航天大学 | A kind of feature clustering system of selection of software-oriented failure prediction |
CN109117380A (en) * | 2018-09-28 | 2019-01-01 | 中国科学院长春光学精密机械与物理研究所 | A kind of method for evaluating software quality, device, equipment and readable storage medium storing program for executing |
CN109480871A (en) * | 2018-10-30 | 2019-03-19 | 北京机械设备研究所 | A kind of fatigue detection method towards RSVP brain-computer interface |
CN110147321A (en) * | 2019-04-19 | 2019-08-20 | 北京航空航天大学 | A kind of recognition methods of the defect high risk module based on software network |
CN110188692A (en) * | 2019-05-30 | 2019-08-30 | 南通大学 | A kind of reinforcing that effective target quickly identifies circulation Cascading Methods |
CN112200261A (en) * | 2020-10-20 | 2021-01-08 | 广东工业大学 | Steel plate defect sampling method |
-
2015
- 2015-12-28 CN CN201511003241.2A patent/CN105389598A/en active Pending
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106201871A (en) * | 2016-06-30 | 2016-12-07 | 重庆大学 | Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised |
CN106201871B (en) * | 2016-06-30 | 2018-10-02 | 重庆大学 | Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised |
CN107391365A (en) * | 2017-07-06 | 2017-11-24 | 武汉大学 | A kind of hybrid characteristic selecting method of software-oriented failure prediction |
CN107391365B (en) * | 2017-07-06 | 2020-10-13 | 武汉大学 | Mixed feature selection method oriented to software defect prediction |
CN107577605A (en) * | 2017-09-04 | 2018-01-12 | 南京航空航天大学 | A kind of feature clustering system of selection of software-oriented failure prediction |
CN109117380A (en) * | 2018-09-28 | 2019-01-01 | 中国科学院长春光学精密机械与物理研究所 | A kind of method for evaluating software quality, device, equipment and readable storage medium storing program for executing |
CN109117380B (en) * | 2018-09-28 | 2022-02-08 | 中国科学院长春光学精密机械与物理研究所 | Software quality evaluation method, device, equipment and readable storage medium |
CN109480871A (en) * | 2018-10-30 | 2019-03-19 | 北京机械设备研究所 | A kind of fatigue detection method towards RSVP brain-computer interface |
CN110147321A (en) * | 2019-04-19 | 2019-08-20 | 北京航空航天大学 | A kind of recognition methods of the defect high risk module based on software network |
CN110188692A (en) * | 2019-05-30 | 2019-08-30 | 南通大学 | A kind of reinforcing that effective target quickly identifies circulation Cascading Methods |
CN112200261A (en) * | 2020-10-20 | 2021-01-08 | 广东工业大学 | Steel plate defect sampling method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105389598A (en) | Feature selecting and classifying method for software defect data | |
US10706332B2 (en) | Analog circuit fault mode classification method | |
CN106201871B (en) | Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised | |
CN103217960B (en) | Automatic selection method of dynamic scheduling strategy of semiconductor production line | |
CN103632168B (en) | Classifier integration method for machine learning | |
CN107316049A (en) | A kind of transfer learning sorting technique based on semi-supervised self-training | |
CN109190626A (en) | A kind of semantic segmentation method of the multipath Fusion Features based on deep learning | |
CN110263166A (en) | Public sentiment file classification method based on deep learning | |
CN105389583A (en) | Image classifier generation method, and image classification method and device | |
CN106651574A (en) | Personal credit assessment method and apparatus | |
CN103617435A (en) | Image sorting method and system for active learning | |
CN104268514A (en) | Gesture detection method based on multi-feature fusion | |
CN106096661A (en) | Zero sample image sorting technique based on relative priority random forest | |
CN105701013A (en) | Software defect data feature selection method based on mutual information | |
CN104318110B (en) | Method for improving risk design and maintenance efficiency of large complex system | |
CN104915679A (en) | Large-scale high-dimensional data classification method based on random forest weighted distance | |
CN105631477A (en) | Traffic sign recognition method based on extreme learning machine and self-adaptive lifting | |
CN107545038A (en) | A kind of file classification method and equipment | |
CN113407644A (en) | Enterprise industry secondary industry multi-label classifier based on deep learning algorithm | |
CN104966075A (en) | Face recognition method and system based on two-dimensional discriminant features | |
CN106250913A (en) | A kind of combining classifiers licence plate recognition method based on local canonical correlation analysis | |
CN103942415A (en) | Automatic data analysis method of flow cytometer | |
CN104318224A (en) | Face recognition method and monitoring equipment | |
CN109388804A (en) | Report core views extracting method and device are ground using the security of deep learning model | |
Li et al. | GADet: A Geometry-Aware X-ray Prohibited Items Detector |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160309 |
|
WD01 | Invention patent application deemed withdrawn after publication |