CN109934278A

CN109934278A - A kind of high-dimensional feature selection method of information gain mixing neighborhood rough set

Info

Publication number: CN109934278A
Application number: CN201910168981.3A
Authority: CN
Inventors: 陆惠玲; 周涛; 张飞飞; 梁蒙蒙; 杨健
Original assignee: Ningxia Medical University
Current assignee: Ningxia Medical University
Priority date: 2019-03-06
Filing date: 2019-03-06
Publication date: 2019-06-25
Anticipated expiration: 2039-03-06
Also published as: CN109934278B

Abstract

The invention discloses a kind of high-dimensional feature selection methods of information gain mixing neighborhood rough set, and specific steps include the following: step 1: data prediction；Step 2: image segmentation；Step 3: feature extraction；Step 4: feature normalization；Step 5: the feature selecting based on information gain；Step 6: the feature selecting based on field rough set；Step 7: Classification and Identification is carried out to reduction result twice.The present disclosure provides a kind of high-dimensional feature selection methods of information gain mixing neighborhood rough set, and from the feasibility of theoretical level analysis two stages Algorithm for Reduction.The accuracy of algorithm can be improved in algorithm, time complexity is effectively reduced, and the performance of the high dimensional feature selection algorithm of Comprehensive Correlation distinct methods building, ensure the superiority of context of methods, guarantee the science of result from the gradually selection of model method, pernicious identification good to lung tumors has certain reference value.

Description

A kind of high-dimensional feature selection method of information gain mixing neighborhood rough set

Technical field

The present invention relates to technical field of image processing, more particularly to a kind of information gain mixes neighborhood rough set High-dimensional feature selection method.

Background technique

Information gain (information gain, IG) and rough set (rough set, RS) are feature selectings common two Kind of algorithm, IG are to measure to include or provide the index of how many information content for classifier when not comprising some feature, are successively asked The information content that each feature provides classifier out, is then ranked up from big to small, K spy before taking according to certain rules Sign, to achieve the purpose that carry out feature selecting using information gain.IG progress feature selecting computation complexity is lower, only needs Single operation, therefore operational efficiency is higher, can effectively reject redundancy, uncorrelated and noise characteristic.But IG is as a kind of mistake Filter formula algorithm carries out feature selecting and still has problem, it can only investigate contribution of the feature to whole system, and cannot arrive in detail In some classification, and the relationship between feature is not considered, therefore be only suitable for (referring to whole for the feature selecting for doing " overall situation " Class all use same characteristic set).And the feature selecting of " part " can not be done (each classification has the feature set of oneself It closes, some features have biggish discrimination to a certain classification, and then insignificant to other classifications).RS is that processing is uncertain Property data effective tool, because its be not necessarily to priori knowledge characteristic, be widely used in feature selecting, pattern-recognition, data mining With the fields such as Knowledge Discovery.Two key concepts of RS research are that concept is approximate and attribute reduction respectively, and wherein attribute reduction is The dimension of attribute is reduced under the premise of not influencing current identification mission differentiability, but RS is constructed on the basis of certain Equivalence relation, be rather limited in many practical applications.Therefore in order to avoid data to the dependence of single method and Preferably the redundancy in rejecting data set and uncorrelated attribute, many scholars are superior by the global characteristics selective power of IG and RS Attribute reduction ability combine and carry out high dimensional feature selection, be successfully applied to sentiment analysis, real estate marked price analysis, swollen Tumor diagnostic classification, the prediction of fishing feelings etc..But Pawlak RS can only handle nominal type variable, the data in practical application are often Continuous numerical variable, although the data set after discretization is adapted to the building of RS algorithm equivalence class, but may also can lose It loses important information and different discretization strategies also will affect reduction effect.It is mentioned for this purpose, Hu Qinghua et al. introduces neighborhood relationships Go out improved Pawlak RS, i.e. neighborhood rough set (neighborhood rough set, NRS), it can be directly to continuous Numeric type data is handled.Although IG and RS can individually carry out feature selecting, have some limitations, therefore The advantage of the two is combined and carries out feature selecting with certain feasibility, selects high relevant feature by IG result Collection, then the attribute by NRS rejecting highly redundant, wherein NRS can overcome RS to be only suitable for handling discrete variable and causing original The problem of information is largely lost.Optimal character subset is obtained by attribute reduction twice, can preferably be rejected in data set Redundancy and uncorrelated features, improve the performance of algorithm, reduce time complexity, can also to avoid data to single method according to Rely.

Therefore, how to provide a kind of high-dimensional feature selection method of information gain mixing neighborhood rough set is this field skill The problem of art personnel's urgent need to resolve.

Summary of the invention

In view of this, the present invention provides a kind of high-dimensional feature selection method of information gain mixing neighborhood rough set, And from the feasibility of theoretical level analysis two stages Algorithm for Reduction.By with not Algorithm for Reduction, Pawlak RS, IG and NRS about Contracted calculation is compared it is found that the accuracy of algorithm can be improved in the algorithm, and time complexity, and Comprehensive Correlation is effectively reduced The performance of the high dimensional feature selection algorithm of distinct methods building, it is ensured that the superiority of context of methods, from the gradually choosing of model method The science for guaranteeing result is selected, pernicious identification good to lung tumors has certain reference value.

To achieve the goals above, the invention provides the following technical scheme:

A kind of high-dimensional feature selection method of information gain mixing neighborhood rough set, specific steps include the following:

Step 1: data prediction；Image is numbered in sequence respectively, pseudo- coloured silk is gone to be converted into gray level image；From gray scale Divide ROI region in image, and by the image normalization of ROI region；

Step 2: image segmentation；To the image of pretreated obtained ROI region, using maximum variance between clusters into Row segmentation；

Step 3: feature extraction；The target area image of ROI region after segmentation is extracted into feature；And construct continuous type Decision information table S₀；

Step 4: feature normalization；Continuous type decision information table S will be constructed in step 3₀Conditional attribute carry out normalizing Change, obtains new continuous type decision information table S；

Step 5: the feature selecting based on information gain；Using the continuous type decision information table S in step 4 as defeated Enter, the attribute set red after obtaining information gain reduction₁；

Step 6: the feature selecting based on field rough set；Attribute set red after inputting information gain reduction₁By The feature selecting of field rough set obtains reduction result red twice；

Step 7: Classification and Identification is carried out to reduction result twice.

Preferably, in a kind of high-dimensional feature selection method of above-mentioned information gain mixing neighborhood rough set, the step It is in order to eliminate error present in ROI region acquisition process and facilitate the processing of subsequent image, ROI image is complete in rapid one Portion is normalized to the image of 50 × 50 pixel sizes.

Preferably, in a kind of high-dimensional feature selection method of above-mentioned information gain mixing neighborhood rough set, the step In rapid two, the image of ROI region is cut into two groups in the punishment of a certain threshold value, one group of corresponding background, one group of corresponding target.

Preferably, in a kind of high-dimensional feature selection method of above-mentioned information gain mixing neighborhood rough set, the step It includes: shape feature, textural characteristics and gray feature that feature is extracted in rapid three.

Preferably, in a kind of high-dimensional feature selection method of above-mentioned information gain mixing neighborhood rough set, the step The target area image of the ROI region after segmentation is extracted into feature in rapid four, and the feature of extraction is normalized, so that Data after normalization are all fallen between [0,1], formula are as follows:

Wherein, x_maxAnd x_minRespectively indicate the maximum value and minimum value of sample array.Herein only to step 3 feature extraction The consecutive decision making table S constructed afterwards₀In conditional attribute be normalized, decision attribute is obtained without normalized To new continuous type decision information table S.

Preferably, in a kind of high-dimensional feature selection method of above-mentioned information gain mixing neighborhood rough set, the step Specific steps include: in rapid five

1) continuous type decision information table S=(U, A, V, f) is inputted, wherein U indicates domain, A=C ∪ D, C conditional attribute The target area image of ROI region after segmentation is extracted feature by collection, and the collection after feature is normalized It closes, D indicates the set that decision attribute is constituted；The union of V expression attribute codomain；The information function of f expression mapping relations；

2) init attributes set red₁=φ calculates the information gain Gain (C of each conditional attribute_i), calculate each The average value average of part attribute information gain；

3) the maximum attribute c of information gain is selected_i, attribute set red₁=red₁∪{c_i, and in conditional attribute collection C Remove the attribute；

If 4) the maximum attribute c of information gain_iInformation gain value be less than average value average, then stop obtain letter Cease the attribute set red of gain reduction₁, otherwise it is adjusted to step 2).

Preferably, in a kind of high-dimensional feature selection method of above-mentioned information gain mixing neighborhood rough set, the step Specific steps include: in rapid six

1) the attribute set red of information gain reduction is inputted₁=(U, A ', V, f), wherein A '=C ' ∪ D, C ' expression step Information gain value is greater than or equal to the set of the conditional attribute of average information yield value in five, determines the set of radius of neighbourhood δ, if Setting different degree lower limit is 0.001；

2) reduction set red=φ, sample smp=U twice are initialized；

3) rightUtilize formulaCalculate positive domainIts In,δ_B(x_i)={ x_j|x_j∈U,Δ_B(xi,x_j)≤δ },N represents the number that domain U is divided into equivalence class by decision attribute D,

4) for a ∈ B, a is selected_kSo that positive domainIt is maximum；

5) formula sig is utilized_B(c, B, D)=γ_B∪c(D)-γ_B(D) computation attribute different degree sig (a_k, red, D), whereinIndicate dependency degree of the decision attribute D relative to subset B；

If 6) sig (a_k, red, D) and be greater than the different degree lower limit value of setting, then reduction result red is exported, program is terminated, it is no Then, k value is recorded, is enabled:Then return step 2) continue to calculate, until output reduction result red。

It can be seen via above technical scheme that compared with prior art, it is mixed that the present disclosure provides a kind of information gains The high-dimensional feature selection method of neighborhood rough set is closed, and from the feasibility of theoretical level analysis two stages Algorithm for Reduction.Algorithm Time complexity, and the high dimensional feature choosing of Comprehensive Correlation distinct methods building is effectively reduced in the accuracy that algorithm can be improved Select the performance of algorithm, it is ensured that the superiority of context of methods guarantees the science of result, to lung from the gradually selection of model method Benign from malignant tumors identification in portion's has certain reference value.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 attached drawing is flow chart of the invention；

Fig. 2 attached drawing is data prediction schematic diagram of the invention；

Fig. 3 attached drawing is image segmentation schematic diagram of the invention；

Fig. 4 attached drawing is that the present invention is based on the flow charts of the feature selecting of field rough set；

Fig. 5 attached drawing is the histogram that algorithms of different reduction length compares in present invention experiment one；

Fig. 6 attached drawing is the histogram that algorithms of different classification accuracy compares in present invention experiment two；

Fig. 7 attached drawing is the histogram that the algorithms of different classification time compares in present invention experiment two.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a kind of high-dimensional feature selection methods of information gain mixing neighborhood rough set, and from The feasibility of theoretical level analysis two stages Algorithm for Reduction.By with not Algorithm for Reduction, Pawlak RS, IG and NRS Algorithm for Reduction It is compared it is found that the accuracy of algorithm can be improved in the algorithm, is effectively reduced time complexity, and Comprehensive Correlation not Tongfang The performance of the high dimensional feature selection algorithm of method building, it is ensured that the superiority of context of methods is protected from the gradually selection of model method The science of result is demonstrate,proved, pernicious identification good to lung tumors has certain reference value.

Embodiment:

(1) data acquisition

Data source is in hospital general, Ningxia Medical University, and every number of cases is according to including clinical diagnosis result, image data, inspection institute See, clinical findings are the standards for diagnosing lung's benign from malignant tumors.Cause model training insufficient in order to avoid data are very few, this Research is not limited to certain lung tumors.Therefore lung tumors data 3000 are obtained, wherein pulmonary malignant tumour CT data 1500, benign lung tumour CT data 1500.

(2) data prediction

Lung's innocent and malignant tumour CT image is obtained from DICOM file according to the inspection conclusion in doctor's advice referring to every number of cases, it will Image is numbered in sequence respectively, and pseudo- coloured silk is gone to be converted into gray level image.With the sick label of image department medical courses in general in gray level image Interception has the subgraph of stronger separating capacity as ROI region lung tumors centered on stove, and is by ROI region image normalization 50 × 50 pixels；Process of data preprocessing is as shown in Figure 2.

(3) image segmentation

In order to which the features such as shape, texture and gray scale to lung images are accurately measured, to the pretreated area ROI Domain selects maximum variance between clusters (OTSU algorithm) to be split.Because it is most effective, most steady that OTSU algorithm is that threshold value is chosen automatically One of fixed method, and do not influenced under certain condition by picture contrast and brightness change.The basic principle is that by the area ROI Area image is cut into two groups in the punishment of a certain threshold value, one group of corresponding background, one group of corresponding target.As shown in figure 3, providing the present invention point Cut 5 groups of examples of front and back；

(4) feature extraction

Feature extraction, the feature of extraction totally 104 dimension, including shape are carried out for the target area ROI after step (3) segmentation Shape feature, textural characteristics and gray feature, specific features are shown in Table 1.It extracts feature and constructs continuous type decision information table later S₀: including 3000 samples, each sample includes 104 dimension conditional attributes and 1 dimension decision attribute；

1 lung tumors CT characteristics of image of table

(5) feature normalization

Accurate data processed result in order to obtain, feature (the i.e. step (4) that the target area ROI after segmentation is extracted The continuous type characteristic set of extraction) difference for eliminating data bulk grade and dimension is normalized, the present invention uses commonly most Big minimum value method, so that the data after normalization are all fallen between [0,1], formula are as follows:

Wherein, x_maxAnd x_minRespectively indicate the maximum value and minimum value of sample array.

(6) based on the feature selecting of information gain

Input: continuous type decision information table S=(U, A, V, f), wherein U=(x₁,x₂,...,x_n) it is known as domain, indicate complete The set that body sample is constituted；A=C ∪ D, the set that C indicates that conditional attribute is constituted (pass through normalized in step (5) 104 dimensional feature set), (i.e. lung tumors is good pernicious, represents malignant pulmonary with number 1 for the set that D expression decision attribute is constituted Tumour, -1 represents benign malignant tumour)；The union of V expression attribute codomain；The information function of f expression mapping relations；

Output: the attribute set red after information gain reduction₁；

Step: 1) set red is initialized₁=φ calculates the information gain Gain (C of each conditional attribute_i), it calculates each The average value average of conditional attribute information gain；

2) the maximum attribute c of information gain is selected_i, red₁=red₁∪{c_i, and remove the category in conditional attribute collection C Property；

If 3) the maximum attribute c of information gain_iInformation gain value be less than average value average, then stop obtain red₁, otherwise it is adjusted to step 2).

(7) based on the feature selecting of neighborhood rough set

NRS attribute reduction is that redundant attributes, reduction are deleted under the premise of not influencing decision system decision-making capability itself Algorithm use before to greedy algorithm, as shown in figure 4, its key step is as follows:

Input: the attribute set red after information gain reduction₁=(U, A ', V, f), wherein A '=C ' ∪ D, C ' expression walks Suddenly information gain value determines the set of radius of neighbourhood δ more than or equal to the set of the conditional attribute of average information yield value in (6), It is 0.001 that different degree lower limit, which is arranged,；

Output: reduction set red twice；

Step: 1) reduction set red=φ, sample smp=U twice are initialized；

2) rightUtilize formulaCalculate positive domainIts In,δ_B(x_i)={ x_j|x_j∈U,Δ_B(x_i,x_j)≤δ },N represents the number that domain U is divided into equivalence class by decision attribute D,

3) for a ∈ B, a is selected_kSo that positive domainIt is maximum；

4) formula sig is utilized_B(c, B, D)=γ_B∪c(D)-γ_B(D) computation attribute different degree sig (a_k, red, D), whereinIndicate dependency degree of the decision attribute D relative to subset B；

If 5) sig (a_k, red, D) and be greater than the different degree lower limit value of setting, then reduction result red is exported, program is terminated, it is no Then, k value is recorded, is enabled: red=red+a_k,Then return step 2) continue to calculate, until output reduction result red。

(8) machine learning method on Statistical Learning Theory basis is established using support vector machines, most according to structure Smallization principle can preferably solve small sample, overfitting, high latitude, the locally practical challenges such as ultimate attainment, have very strong Generalization ability and Classification and Identification ability can be solved effectively " non-linear, high during the CAD based on medical image is diagnosed The problem of dimension ".Classification and Identification is carried out using result of the SVM to reduction twice, wherein Selection of kernel function Radial basis kernel function (Radial Basis Function, RBF), C and g using grid optimizing algorithm (Grid Search, GS) optimization SVM join Number.

The performance evaluation for early diagnosing accuracy includes the big index of sensibility and specificity two, but the two indexs are difficult The overall performance of comprehensive interpretive classification device.Therefore, the present invention is reduction length to reduction model evaluation index, and disaggregated model is commented Valence index includes: accuracy (Accuracy), sensibility (Sensitivity), specificity (Specificity), F value (F- Score value), Ma Xiusi relative coefficient (Matthews correlation coefficient, MCC), balance F score (balanced F Score,F₁Score), youden index (Youden index, YI) and algorithm are time-consuming (Time).

Accuracy (Accuracy) is the most common evaluation index, and accuracy is higher, and classifier is better, and calculation formula is such as Under:

Sensitivity (sensitive) and specificity (specificity) are respectively intended to measure classifier to positive example and negative example Recognition capability, be worth it is bigger, recognition performance is higher, and calculation formula is as follows:

F value is recall ratio and precision ratio weighted harmonic mean, for weighing accurate rate and recall rate.

MCC is the related coefficient described between actual classification and prediction classification, considers true positives, true negative, vacation comprehensively Positive and false negative is a kind of more balanced index, its value range is [- 1,1], and value indicates tested right closer to 1 The prediction of elephant is more accurate, and calculation formula is as follows:

F₁Score is a kind of more comprehensively evaluation index that two disaggregated model accuracy are measured in statistics, is accurate A kind of weighted average of rate and recall rate, its value range are [0,1], and the accuracy rate for being worth closer 1 representative model is higher, meter It is as follows to calculate formula:

YI is also known as correct index, is indicated with the value that the sum of sensitivity and specificity subtract 1, its value range is [0,1], value Closer to 1, the authenticity of model prediction is better, and calculation formula is as follows:

YI=Sensitivity+Specificity-1

Algorithm time-consuming (Time) indicates algorithm from bringing into operation to terminating the time it takes.

Wherein, TP indicates correctly to be divided into the number of positive example, i.e., practical to be positive example and be classified device and be divided into positive example Sample number；FP indicates mistakenly to be divided into the number of positive example, i.e., the example that is actually negative but is classified the sample that device is divided into positive example This number；FN indicates the number for mistakenly being divided the example that is negative, i.e., practical to be positive example but be classified the sample number that device divides the example that is negative； TN indicates correctly to be divided the number of example of being negative, i.e., the example that is actually negative and is classified the sample number that device divides the example that is negative.

Experimental result and analysis

Original decision letter can be effectively reduced from theoretic in character subset by IG's and after the reduction of NRS two stages The dimension of table is ceased, time complexity and space complexity are reduced.Data noise can be reduced with preliminary screening by IG, rejected related The lesser attribute of property, the attribute of highly redundant can be effectively rejected by bis- reduction of NRS.In order to further verify text proposition The feasibility and validity of two stages reduction high dimensional feature selection algorithm, with 3000 (1500 benign, and 1500 pernicious) lungs Portion's tumour CT image is research object, extracts shape, texture and gray feature totally 104 dimension construction original respectively after obtaining ROI region Beginning characteristic set carries out two stages reduction using IG and NRS, and reduction result carries out Classification and Identification using SVM.

Test the comparison of an algorithms of different reduction result

Reduction is carried out to original decision information table using different algorithms, concrete outcome is as shown in Figure 5.From figure 5 it can be seen that adopting When being carried out after reduction original decision information table compared to not reduction with algorithms of different, the dimension of information table, which has, largely to drop Low, the reduction length of inventive algorithm is only above NRS algorithm, reduces by 65 dimensions compared to raw information table dimension.

Test the comparison of two algorithms of different classification results

Five folding of SVM intersection is utilized respectively to the reduction of one algorithms of different of experiment (to select from 1500 good (evil) property every time 300 are taken as test set, remaining 1200 are used as training set) Classification and Identification is carried out, from accuracy, susceptibility, specificity, F The superiority and inferiority of value, MCC, F1Score, 8 Youden, total time metrics evaluation algorithms, each five folding of index of algorithms of different intersect result Average value as final evaluation result.Concrete outcome is shown in Table 2:

The comparison of 2 algorithms of different classification results of table

As can be seen from Table 2, each evaluation index of algorithm difference number of crossings of the same race has differences, for the property of comprehensive measure algorithm Can, final classification result of the average value intersected using five foldings as the algorithm.Removed Pawlak RS-SVM, calculation of the invention Method susceptibility has lesser degree of reduction compared to other algorithms, other indexs are better than other algorithms, accuracy, specificity, F Value, MCC, F1Score, Youden be respectively increased 0.17%~0.84%, 0.67%~1.4%, 0.0015~0.0081, 0.0035~0.0169,0.0017~0.0083 and 0.003~0.0167, the time reduces 8.06s~203.81s.Due to accurate Degree and time are most common evaluation indexes, in order to which the expression algorithms of different that is more clear is in accuracy and time two indices Difference, the average value of the two indexs is drawn into histogram, respectively as shown in Fig. 6 and Fig. 7.By Fig. 6 and Fig. 7 as it can be seen that this hair The accuracy of the accuracy highest of bright algorithm, Pawlak RS-SVM model is minimum.Because PawlakRS is established in equivalence relation On the basis of, nominal type variable can only be handled, logarithm type data need to pass through sliding-model control, and the time for not only increasing algorithm is complicated Degree can also lose important information, and different discretization methods also will affect final treatment effect, the spy after discretization Collection is closed can not portray lung tumors ROI region comprehensively.The time complexity of inventive algorithm reduces low when comparing not reduction 4.27 times, algorithm is compared also below other, it can be seen that lung tumors high dimensional feature selection algorithm can be improved in inventive algorithm Accuracy, the time complexity of algorithm is effectively reduced, have certain promotional value.

In order to improve the performance of lung tumors computer-aided diagnosis algorithm, the advantage and disadvantage of IG and NRS are analyzed, propose one The lung tumors high dimensional feature selection algorithm of kind of mixing IG and NRS, and from the feasible of theoretical level analysis two stages Algorithm for Reduction Property.For the validity of verification algorithm, the 104 dimensional features construction decision information table of 3000 lung tumors CT images is extracted, is borrowed Helping IG and NRS, attribute reduction obtains optimal character subset twice, finally carries out Classification and Identification using SVM.By with not reduction Algorithm, Pawlak RS, IG and NRS Algorithm for Reduction are compared it is found that the accuracy of algorithm can be improved in the algorithm, effectively drops Low time complexity, and the performance of the lung tumors high dimensional feature selection algorithm of Comprehensive Correlation distinct methods building, it is ensured that this The superiority of inventive method guarantees the science of result, to lung tumors area of computer aided from the gradually selection of model method Diagnosis has certain reference value.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of high-dimensional feature selection method of information gain mixing neighborhood rough set, which is characterized in that specific steps include It is as follows:

Step 1: data prediction；Image is numbered in sequence respectively, pseudo- coloured silk is gone to be converted into gray level image；From gray level image Middle division ROI region, and by the image normalization of ROI region；

Step 2: image segmentation；To the image of pretreated obtained ROI region, divided using maximum variance between clusters It cuts, obtains background area image and target area image；

Step 4: feature normalization；Continuous type decision information table S will be constructed in step 3₀It is normalized, wherein only to continuous Type decision information table S₀In conditional attribute be normalized, obtain continuous type decision information table S；

Step 5: the feature selecting based on information gain；Using the continuous type decision information table S in step 4 as input, Carry out feature selecting, the attribute set red after obtaining information gain reduction₁；

Step 6: the feature selecting based on field rough set；Attribute set red after inputting information gain reduction₁It is thick by field The feature selecting of rough collection obtains reduction result red twice；

2. a kind of high-dimensional feature selection method of information gain mixing neighborhood rough set according to claim 1, special Sign is that it includes: shape feature, textural characteristics and gray feature that feature is extracted in the step 3.

3. a kind of high-dimensional feature selection method of information gain mixing neighborhood rough set according to claim 1, special Sign is, the target area image of the ROI region after segmentation is extracted feature in the step 4, and to the feature of extraction into Row normalization, so that the data after normalization are all fallen between [0,1], formula are as follows:

4. a kind of high-dimensional feature selection method of information gain mixing neighborhood rough set according to claim 1, special Sign is that specific steps include: in the step 5

1) continuous type decision information table S=(U, A, V, f) is inputted, wherein U indicates that domain, A=C ∪ D, C indicate conditional attribute collection, The target area image of ROI region after will dividing extracts feature, and the set after being normalized, D expression are determined The set that plan attribute is constituted；The union of V expression attribute codomain；The information function of f expression mapping relations；

2) init attributes set red₁=φ calculates the information gain Gain (C of each conditional attribute_i), calculate each condition category The average value average of property information gain；

3) the maximum attribute c of information gain is selected_i, attribute set red₁=red₁∪{c_i, and remove this in conditional attribute collection C Attribute；

If 4) the maximum attribute c of information gain_iInformation gain value be less than average value average, then stop obtain information gain The attribute set red of reduction₁, otherwise it is adjusted to step 2).

5. a kind of high-dimensional feature selection method of information gain mixing neighborhood rough set according to claim 4, special Sign is that specific steps include: in the step 6

1) the attribute set red of information gain reduction is inputted₁=(U, A ', V, f), wherein in A '=C ' ∪ D, C ' expression step 5 Information gain value is greater than or equal to the set of the conditional attribute of average information yield value, determines the set of radius of neighbourhood δ, setting weight Spending lower limit is 0.001；

2) reduction set red=φ, sample smp=U twice are initialized；

3) rightUtilize formulaCalculate positive domainWherein,δ_B(x_i)={ x_j|x_j∈U,Δ_B(x_i,x_j)≤δ },N represents the number that domain U is divided into equivalence class by decision attribute D,

4) for a ∈ B, a is selected_kSo that positive domainIt is maximum；

If 6) sig (a_k, red, D) and be greater than the different degree lower limit value of setting, then reduction result red is exported, program is terminated, otherwise, note K value is recorded, is enabled:Then return step 2) continue to calculate, until output reduction result red.