CN111583194A - High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm - Google Patents
High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm Download PDFInfo
- Publication number
- CN111583194A CN111583194A CN202010322570.8A CN202010322570A CN111583194A CN 111583194 A CN111583194 A CN 111583194A CN 202010322570 A CN202010322570 A CN 202010322570A CN 111583194 A CN111583194 A CN 111583194A
- Authority
- CN
- China
- Prior art keywords
- algorithm
- bird nest
- svm
- feature
- fitness
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
- G06T7/0012—Biomedical image inspection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2111—Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/12—Edge-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/136—Segmentation; Edge detection involving thresholding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10072—Tomographic images
- G06T2207/10081—Computed x-ray tomography [CT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30004—Biomedical image processing
- G06T2207/30096—Tumor; Lesion
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Radiology & Medical Imaging (AREA)
- Medical Informatics (AREA)
- Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
- Physiology (AREA)
- Quality & Reliability (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a high-dimensional feature selection algorithm based on a Bayesian rough set and a cuckoo algorithm, which comprises the following steps: acquiring a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image; extracting high-dimensional characteristic components of the segmented ROI image, and constructing a decision information table containing characteristic attributes based on the characteristic components; and reducing the original feature space by adopting a BRSGA algorithm to obtain an optimal feature subset, optimizing a penalty factor and a kernel function parameter of the SVM by utilizing a CS algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result. According to the method, the optimal feature subset is generated through the genetic algorithm and the BRS, the feature dimension is reduced on the premise of not reducing the classification accuracy, the constraint of manual parameter setting is eliminated, and the time consumption is reduced. The CS is used for carrying out global optimization on SVM parameters, so that the method has the advantages of more effective exploration of search space, enriched population diversity, good robustness and stronger global search capability.
Description
Technical Field
The invention relates to the technical field of medical image recognition, in particular to a high-dimensional feature selection algorithm based on a Bayesian rough set and a Cuckoo algorithm.
Background
With the development of Computer Aided Diagnosis (CAD) research, medical image processing techniques have been rapidly developed. However, due to the multimode property, the gray fuzzy property and the uncertainty of the medical image, the missed diagnosis rate and the misdiagnosis rate in the single-mode medical image diagnosis process are high. Therefore, different modality medical image processing technologies are developed and divided into a pixel level, a feature level and a decision level according to different levels. And the feature level processing can realize the compression of the information quantity on the basis of keeping important information, and the processing speed is higher. In the medical image feature level processing process, the redundancy and correlation among features enable the dimension disaster to become an NP-hard problem, and feature selection is an effective measure for solving the problem, so that the dimension of a feature space can be effectively reduced, and the time complexity is reduced.
The problems of the high-dimensional feature selection process include how to generate an optimal feature subset, how to evaluate the effect, selecting a classifier used for evaluation, optimizing parameters of the classifier, and the like, and in response to the problems, experts and scholars have proposed a plurality of algorithms in succession in recent years. Firstly, the proposal of the Variable Precision Rough Set (VPRS) can effectively overcome the limitation that the Rough Set (RS) can only process accurate classification data, relax the lower approximation of the RS from 'complete inclusion' to 'partial inclusion' by introducing the classification error rate beta, and improve the robustness and generalization capability of the processing result of the data set with noise. The core of VPRS research is the problem of selecting the classification error rate beta, and the main research field comprises three aspects: first, regardless of the details of β selection, various extended VPRS models are proposed, such as: variable precision fuzzy rough set, variable precision multi-granularity rough set, generalized VPRS, extended VPRS based on beta-tolerance relation and Babbitt distance and the like; secondly, obtaining the value of beta through different calculation modes, such as taking the average inclusion degree as a threshold value which is approximate up and down; thirdly, introducing a probability formula to provide a plurality of probability RS models, such as VPRS, a game rough set, a decision rough set, a Bayesian Rough Set (BRS), a 0.5 probability rough set and the like. The various methods in the probability rough set have certain correlation, and the difference is reflected in the difference of the calculation of the probability formula and the parameter design mode. The BRS introduces prior probability on the basis of VPRS, replaces the classification error rate beta in the VPRS by the prior probability, does not need to manually set parameters, not only overcomes the complete accurate division of the RS on the lower approximation, but also avoids the influence of the parameter beta in the VPRS on the upper approximation and the lower approximation. Many studies on BRS are still in a theoretical analysis stage at present, a mature independent model is lacked, and the problem of processing high-dimensional feature selection of medical images by combining with other algorithms is not seen.
Secondly, the performance of the classifier is the basis for evaluating a high-dimensional feature selection algorithm, a Support Vector Machine (SVM) is a commonly used binary classification algorithm, the introduction of a kernel function widens the application range of the SVM, the commonly used kernel function comprises a polynomial kernel function, a radial basis kernel function (RBF) and a Sigmoid kernel function, and the polynomial kernel function is low in calculation speed, seriously influences the effect and is less in application; RBF is less than Sigmoid kernel function parameters, only a kernel matrix needs to be calculated in the calculation process, the time complexity is small, the manual setting workload of the parameters is large, the time is long, the finally obtained parameters are not necessarily optimal, and the selection of the parameters needs to be converted into an optimization problem for analysis.
Therefore, how to provide a high-dimensional feature selection algorithm based on a bayesian rough set and a cuckoo algorithm with low time complexity and better robustness is a problem to be solved by those skilled in the art.
Disclosure of Invention
In view of this, the invention provides a high-dimensional feature selection algorithm based on a bayesian rough set and a cuckoo algorithm, and provides a high-dimensional feature selection algorithm based on BRSGA and CS two-stage optimization by combining BRS, GA, CS and SVM algorithms. In the first stage, the BRSGA algorithm is adopted to reduce the original feature space to obtain an optimal feature subset, in the second stage, the CS algorithm is used to optimize the penalty factor and the kernel function parameter of the SVM, the optimal parameter combination is used to construct a CS-SVM classification model, and the lung tumor image is identified.
In order to achieve the above purpose, the invention provides the following technical scheme:
a high-dimensional feature selection algorithm based on a Bayesian rough set and a cuckoo algorithm comprises the following steps:
s1, obtaining a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image;
s2, extracting high-dimensional feature components of the segmented ROI image, and constructing a decision information table containing feature attributes based on the high-dimensional feature components, wherein the feature attributes correspond to features of different dimensions in the high-dimensional feature components;
s3, based on a Bayes rough set model, constructing a fitness objective function by utilizing the weighted summation of a global relative gain function, an attribute reduction length and a gene coding weight function, and reducing the feature attributes by combining genetic operator combination to obtain a reduced feature subset;
s4, optimizing the penalty factor and the kernel function of the SVM by using a cuckoo algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result.
Preferably, the high-dimensional feature components in S2 include shape features, texture features and gray scale features of the lung tumor image.
Preferably, the S3 specifically includes the following steps:
s31, constructing a fitness objective function:
the first objective function is a global relative gain function of the equivalence relation E with respect to the feature attribute D:measuring the attribute importance of the information system S by adopting global relative gain;
wherein | C | is the number of conditional attributes, LrThe number of genes in chromosome r is 1;
wherein, the numerator is the product sum of genes with the length of non-0 and 1, and the denominator is the length of the chromosome;
constructing a fitness objective function F (x) — ω 1 × target1- ω 2 × target2+ ω 3 × target3 to perform feature attribute reduction on the feature attributes;
s32, optimizing the genetic operator according to the fitness objective function:
calculating the fitness value of the characteristic attribute according to the fitness objective function, judging whether a termination condition is met, and if so, obtaining a reduced characteristic subset; if not, the characteristic attributes are sequentially subjected to genetic algorithm operation consisting of non-return remainder random selection, uniform intersection and Gaussian transformation, and S32 is executed again.
Preferably, the specific step of optimizing the SVM parameter by the cuckoo algorithm in S4 includes:
s41, initialization setting: including probability PaIteration times N, bird nest number N, upper and lower limits, penalty factor c of SVM and RBF kernel function parameter sigma;
s42, initializing n bird nest positions, calculating the fitness values of all bird nests, and storing the current optimal positions and the fitness values;
s43, updating the position of the bird nest according to a formula, comparing the position with the adaptability value of the bird nest at the corresponding position of the previous generation, and keeping the position of the bird nest with the minimum adaptability value and the adaptability value as the optimal bird nest;
s44, generating a random number r with a given probability PaDiscarding bad bird nest if r > PaIf not, updating the bird nest;
s45, recalculating the fitness value of the bird nest, replacing the bird nest with a high fitness value with the bird nest with a low fitness value, and generating a group of new bird nest positions;
s46, judging whether iteration times are finished, if so, stopping searching to obtain a global optimal fitness value and a corresponding optimal bird nest, and if not, jumping to S43 to continue optimizing;
and S47, constructing an SVM prediction model according to the optimal parameters c and sigma corresponding to the optimal bird nest position.
Compared with the prior art, the high-dimensional feature selection algorithm based on the Bayesian rough set and the Cuckoo algorithm has the advantages that:
the attribute importance is analyzed from the perspective of a global relative gain function, an optimal feature subset is generated by combining the attribute reduction length and the weighting and construction fitness function of a gene coding weight function through genetic operations such as selection, intersection and variation, the feature dimension is reduced on the premise of not reducing the classification accuracy, the constraint of manual setting of parameters is eliminated, and the time consumption is reduced to a great extent. The CS is used for carrying out global optimization on Support Vector Machine (SVM) parameters, global search in the CS algorithm has infinite mean value and variance, a search space can be explored more effectively than an algorithm using a standard Gaussian process, the search field is widened, the diversity of population is enriched, and the method has good robustness and strong global search capability. The feature selection is carried out by combining the BRS and an intelligent optimization algorithm, and the optimization of the parameters of the SVM by using the CS has certain feasibility and effectiveness.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a high-dimensional feature selection algorithm based on a Bayesian rough set and a Cuckoo algorithm according to the present invention;
FIG. 2 is a comparison graph before and after segmenting an ROI by using an Otsu algorithm according to an embodiment of the present invention;
FIG. 3 is a flowchart of optimal feature subset generation according to an embodiment of the present invention;
FIG. 4 is a flow chart of a CS optimization SVM parameter provided by an embodiment of the present invention;
FIG. 5 is a diagram illustrating a change situation of a fitness function in a process of generating a subset of certain features according to an embodiment of the present invention;
fig. 6 is a comparison diagram of results of different classification algorithms based on BRSGA selection algorithm according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a high-dimensional feature selection algorithm based on a Bayesian rough set and a cuckoo algorithm, wherein a flow chart is shown in figure 1 and comprises the steps of data acquisition, data preprocessing, image segmentation, feature extraction, attribute reduction, classification recognition and the like. And finally, classifying and identifying the lung tumor CT image by adopting a two-stage optimized high-dimensional feature selection algorithm. The specific implementation process is as follows:
and S1, acquiring a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image.
Before the target contour segmentation, the method specifically comprises the following image acquisition and preprocessing processes:
3000 lung tumor CT images with definite diagnosis conclusion are collected by comprehensively considering the popularity of common lung tumor examination imaging methods, the acceptance degree and the cost of doctors and patients and avoiding the influence of factors such as the specification, the model and the environment of examination equipment on the lung tumor CT images, wherein 1500 malignant (benign) tumors are collected. And (3) intercepting subgraphs with strong distinguishing capability from the obtained image as ROI regions, and normalizing all ROI regions into an experimental image with the size of 50 x 50 pixels.
And (3) target contour segmentation process:
the segmentation of the target contour (including the lesion contour) from the truncated ROI region plays a crucial role in various clinical applications. However, in the current clinical practical application, the manual labeling method of the radiologist is still adopted, and a large amount of intensive manual operations are easy to make mistakes, so that the accurate segmentation by using the computer technology has a very great practical value. In the embodiment, an Otsu threshold segmentation method is adopted, and the core idea is to segment an image into two groups, and when the interclass variance between the two groups reaches the maximum, the obtained value is the optimal segmentation threshold. The basic principle of the Otsu algorithm is as follows:
assuming that the size of one image is m × n and the image has a gray level of l, the gray level range is 0, l-1],niRepresenting the number of occurrences of gray level i, the frequency of occurrence of gray level i in all pixels is pi=ni(m × n.) suppose that a pixel having a gray level less than q constitutes A1Class i, i.e. A1Has a gray scale range of [0, q ]]The gray scale range is [ q +1, l-1]The pixel point is A2If P is1(q),P2(q) each represents A1Class A and A2Summary of class appearanceRate, u1(q),u2(q) represents A1Class A and A2Average of class gray levels, then:
between-group variance σ of imagesb(q) is expressed as:
when the interclass variance between two groups reaches the maximum, the obtained value is the optimal segmentation threshold, i.e. the pixel segmentation threshold is:
the ROI region is segmented by Otsu, and as shown in fig. 2, an example of ROI images before and after segmentation by Otsu algorithm is given, where fig. 2(a) is ROI image before segmentation and fig. 2(b) is ROI image after segmentation.
S2, extracting high-dimensional feature components of the segmented ROI image, wherein the high-dimensional feature components comprise 104-dimensional features including shape features, texture features and gray features, and the specific features are shown in Table 1. And constructing a decision information table containing characteristic attributes based on the high-dimensional characteristic components, wherein the characteristic attributes correspond to the characteristics of different dimensions in the high-dimensional characteristic components, the size of the constructed decision information table is 3000 × 105, discretization is carried out on the decision information table by adopting a fuzzy C mean value clustering algorithm, and after discretization, numerical labels are given to tumor characteristics, namely the characteristic attributes which represent the benign and malignant properties of the tumor, and the characteristic attributes are positioned in the last column of the decision information table.
TABLE 1 pulmonary tumor CT image feature set
S3, based on a Bayes rough set model, constructing a fitness objective function by utilizing the weighted summation of a global relative gain function, an attribute reduction length and a gene coding weight function, and reducing the feature attributes by combining genetic operator combination to obtain a reduced feature subset; the embodiment combines the BRS algorithm and the GA algorithm to carry out attribute reduction, reduces the time complexity and the space complexity of the classifier, and improves the classification performance.
As shown in fig. 3, the reduction specifically includes the following steps:
s31, establishing a BRS model:
1) setting parameters: chromosomes are sequences of 0, 1, the length of which is equal to the number N of conditional attributes, the crossover probability PcProbability of variation PmThe maximum iteration number K is 150, the initial population M is 20, and the fitness function is f (x);
2) and (3) encoding: coding in a binary mode, wherein the length of the coding is equal to the number of condition attributes, 0 represents that the characteristic is not selected, and 1 represents that the characteristic is selected;
3) characteristic attributes, i.e. generation of the initial population: randomly generating M chromosome strings with the length equal to the number of the conditional attributes to form an initial population;
4) genetic operator: the genetic operators comprise selection, crossing and mutation operators, and the genetic operators are combined into playback-free random remainder selection, uniform crossing and Gaussian mutation.
S32, constructing a fitness objective function: comprehensively considering the global relative gain function, the attribute reduction length and the gene coding weight function, and carrying out the optimization process of the genetic algorithm by weighting and constructing a fitness function frame to find the feature subset with the most distinguishing capability.
The global relative gain function of the equivalence relation E with respect to the feature attribute D isMeasuring the attribute importance of the information system S by adopting global relative gain;
in the BRS model, the attribute reduction algorithm process taking the global relative gain as heuristic information is as follows:
s321: calculating a kernel attribute set gamma of the condition attributes in the information system S ═ U, A, V, f, and calculating a dependency R of the decision attributes on the condition attributesC(D);
S322: calculating the dependency degree R of the decision attribute on the core attributeγ(D) If R isγ(D)=RC(D) If yes, go to S324 to obtain R reduction, otherwise, let C ═ C- γ, forComputingThe values of (a) constitute a set M;
s323, sorting the elements in the set M in ascending order, and adding the maximum value of the elements to the set gamma, namely gamma-gamma ∪ Ci,Go to S322 to continue the calculation;
s324: the result is an R reduction of the BRS.
Wherein | C | is the number of conditional attributes, LrThe number of genes in chromosome r is 1;
The value of the gene position can only be 0 and 1, otherwise, penalty is carried out, since genes which are more than 1 or less than 0 or less than-1 can appear in the chromosome, for the situation, a gene coding weight function is constructed as target3, the numerator obtains the product sum which is not 0 and 1, if the gene position i is 0, 0 x (0-1) is 0, if the gene position 1, 1 x (1-1) is 0, only the product sum which is not 0 and 1 is calculated, and the denominator is expressed as the length of the chromosome.
Example (c): provided that chromosome r ═ 01-231 ], (r-1) [ -10-320 ], then:
r×(r-1)=[0 1 -2 3 1]×[-1 0 -3 2 0]=[0 0 6 6 0]
Σ abs (r × (r-1)) -12, and the chromosome length is 5, then target 3-12/5-2.4.
And constructing a fitness objective function F (x) — ω 1 × target1- ω 2 × target2+ ω 3 × target3 to perform feature attribute reduction on the feature attributes.
S32, optimizing the genetic operator according to the fitness objective function:
calculating the fitness value of the characteristic attribute according to the fitness objective function, and judging whether a termination condition is met, wherein the termination condition is a set fixed value; if yes, obtaining a reduced feature subset; if not, the characteristic attributes are sequentially subjected to genetic algorithm operation consisting of non-return remainder random selection, uniform intersection and Gaussian transformation, and S32 is executed again.
S4, optimizing the penalty factor and the kernel function of the SVM by using a cuckoo algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result.
Referring to fig. 4, the specific steps of the cuckoo algorithm for optimizing SVM parameters include:
s41, initialization setting: including probability PaIteration times N, bird nest number N, upper and lower limits, penalty factor c of SVM and RBF kernel function parameter sigma;
s42, initializing n bird nest positions, calculating the fitness values of all bird nests, and storing the current optimal positions and the fitness values; one bird nest is a feasible solution, the fitness value of the bird nest is calculated, the n feasible solutions obtained by initialization are brought into the value obtained by objective function calculation, the value is kept to be optimal (the maximum or minimum can be selected according to specific requirements), and the optimal position and the fitness value of the bird nest are obtained;
s43, updating the position of the bird nest according to the position updating formula, comparing the position with the adaptability value of the bird nest at the corresponding position of the previous generation, and keeping the position of the bird nest with the minimum adaptability value and the adaptability value as the optimal bird nest;
s44, automatically generating a random number r according to a Gaussian random function, and giving a probability PaDiscarding bad bird nest if r > PaIf not, updating the bird nest;
s45, recalculating the fitness value of the bird nest, replacing the bird nest with a high fitness value with the bird nest with a low fitness value, and generating a group of new bird nest positions;
s46, judging whether iteration times are finished, if so, stopping searching to obtain a global optimal fitness value and a corresponding optimal bird nest, and if not, jumping to S43 to continue optimizing;
and S47, constructing an SVM prediction model according to the optimal parameters c and sigma corresponding to the optimal bird nest position.
The search path of the cuckoo CS algorithm is levy flight, and in the walking in the form, short-distance exploration and occasional long-distance walking are alternated, so that the search range can be expanded, the diversity of the population is increased, and the local optimum is avoided. The relevant definitions are as follows:
the formula for searching the position of the bird nest by the CS algorithm is as follows:
in the formula:for the position of the ith bird nest in the t-th generation, α is a step control quantity, generally 0.1 is taken, α is used for determining a random search range:
wherein, α0Is a constant (α)0=0.01),xbestRepresenting the current optimal solution.
In the formula for searching the position of the bird nest,denotes a point-to-point product, L (λ) is a random search path, and Levy-u is t-λAnd λ is more than 1 and less than or equal to 3, and levy distribution is obeyed, then the corresponding position updating formula is as follows:
wherein μ and ν are both subject to normal distribution:
in the formula, the Gamma function is a standard Gamma function.
The performance evaluation of the medical image recognition result comprises two indexes of sensitivity and specificity, but the two indexes are difficult to comprehensively describe the overall performance of the classifier. Therefore, in the embodiment, evaluation indexes are respectively set in the two stages of feature selection and classification identification, and the feature selection stage includes reduction length, attribute importance and time. The classification identification stage comprises Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), F value, Mazis Correlation Coefficient (MCC), balance F Score (F1Score), Johnson index (YI) and Time (Time), and the calculation formula is as follows:
YI=Sensitivity+Specificity-1
wherein TP represents the number of successfully identified malignant target contours; FP represents the number of erroneously identified malignant tumor target contours; TN represents the number of successfully identified benign tumor target contours; FN represents the number of well-identified benign tumor target contours.
In order to verify the feasibility and the effectiveness of the technical scheme, two groups of comparison experiments are designed, wherein the experiments comprise that in order to verify the feasibility and the effectiveness of the BRSGA feature selection algorithm, the parameters of the SVM are optimized by fixedly adopting a GS algorithm, and the advantages and the disadvantages of the BRSGA algorithm and the VPRSGA under different beta conditions at different stages are compared. And in the second experiment, a feature selection algorithm is fixed on the basis of the first experiment, and the advantages and disadvantages of the CS-SVM, the GS-SVM, the GA-SVM and the PSO-SVM in the classification stage are compared.
Experiment one: experimental result comparison based on same classification algorithm and different feature selection algorithms
And the fixed classification recognition algorithm is GS-SVM, and compares the advantages and disadvantages of the BRSGA and the VPRS algorithm under different parameters in two stages of feature selection and classification recognition, wherein the VPRS parameter beta is respectively set to be 0.1, 0.2, 0.3 and 0.4. Specific results are shown in table 3, fig. 5 and table 4. And in the optimal feature subset generation stage, each parameter is reduced for 5 times in combination, the reduction length, the attribute importance and the time are respectively obtained, and the average value of indexes of the 5-time reduction result of each parameter is obtained and used as an experimental result under the parameter. In the stage of classification and identification, five-fold intersection is performed on each reduction result of each parameter by using LIBSVM (namely, 300 cases of benign tumors and 300 cases of malignant tumors are selected as a test set each time, and the rest of data are selected as a training set), and five groups of identification results are obtained for each parameter, including: accuracy, sensitivity, specificity, F value, MCC, F1Score, YI and time, calculating the average value of each index after five-fold intersection as the classification result after reduction under the parameter, and finally calculating the average value of each index after five-fold intersection as the reduction and classification result under the parameter combination.
TABLE 3 comparison of the results of attribute reduction for different feature selection algorithms
As can be seen from table 3, the reduction length of the present invention is 7.8 dimensions without manual setting of the classification error rate β, and between the reduction lengths of the VPRSGA algorithm with different β values, the reduction length is significantly reduced compared to β -0.1, and the reduction length is slightly increased compared to β -0.2 and β -0.4. The fitness value is only slightly higher than the VPRSGA algorithm with β ═ 0.4. The importance degree is reduced by 0.0002 compared with the VPRSGA with beta being 0.4, and is higher than the VPRSGA model under other parameters. The VPRSGA model with the reduction time higher than beta-0.2 has 16.54-419.35 seconds lower than that of VPRSGA model with other parameter values, wherein the time is shortened by 2.7 times compared with beta-0.1. As can be seen from fig. 5, the algorithm herein has no premature convergence in the reduction stage, and the VPRSGA algorithm has different degrees of premature convergence under different β values, for example, in fig. 5b, the VPRSGA algorithm has a more serious premature convergence phenomenon at β ═ 0.1 in a certain reduction result. Therefore, compared with the VPRSGA algorithm, the method and the device are not only free from the constraint of manual parameter setting, but also achieve a relatively ideal effect.
TABLE 4 comparison of different feature selection algorithm classification recognition results
As can be seen from table 4: compared with the VPRSGA algorithm with beta being 0.1, the accuracy and the specificity of the VPRSGA algorithm are respectively reduced by 0.07 percent, 0.43 percent, 0.0015 percent, 0.0006 percent and 0.0013 percent through MCC, F1Score and YI, but the sensitivity is improved by 0.3 percent, and the VPRSGA algorithm with the classification time beta being 0.1 is 3.4 times that of the BRSGA algorithm. Although the accuracy of the BRS algorithm is reduced within an acceptable range, the time consumption is reduced to a great extent, and the overall performance of the BRS algorithm is better than that of the VPRS algorithm with beta being 0.1 by comprehensively considering the accuracy and the time consumption; compared with VPRSGA algorithm with the parameter beta being 0.2, 0.3 and 0.4, the BRSGA algorithm has the advantages that time is reduced, other indexes are improved to different degrees, and the indexes are obviously improved compared with the indexes of the VPRSGA algorithm with the parameter beta being 0.2. As seen from the classification result, compared with the VPRSGA model, the BRSGA model is not bound by parameters, and the classification performance of the model is improved.
As can be seen from experiments, when the classification algorithm is fixed (namely, the SVM parameters are optimized by using the grid optimization algorithm), the BRSGA feature selection algorithm is free from the constraint of manual parameter setting compared with the VPRSGA algorithm, and has ideal effects in the attribute reduction and classification stages, so that the BRSGA is used as the fixed feature selection algorithm when the effectiveness of the CS algorithm on the optimization of the SVM parameters is verified.
Experiment two: experimental result comparison of different classification algorithms based on same feature selection algorithm
And fixing the optimal feature subset generation algorithm as BRSGA, optimizing SVM parameters by adopting a CS algorithm, and comparing the parameters with GS-SVM, GA-SVM and PSO-SVM. Classifying and identifying results obtained by 5 times of reduction of the BRSGA algorithm in the first experiment, obtaining classification results including accuracy, sensitivity, specificity, F value, MCC, F1Score, YI and time by five-fold cross verification each time, calculating an average value of each index of the five-fold cross as the classification result after the reduction, and taking the average value of the five times of reduction as the final result of the classification model. In order to quantitatively describe whether the algorithm and the comparison algorithm have statistical significance in identifying the accuracy, a paired t test is adopted for carrying out hypothesis test, the statistical hypothesis test is based on five indexes of comprehensively describing the accuracy, F value, MCC, F1Score and YI, and the significance level is set to be p less than 0.05. The null hypothesis is that the difference between the average values of the same evaluation indexes of the present invention and the comparative algorithm is 0. The average and standard deviation of the recognition results of the five-reduction classification are given for each evaluation index, and the results are shown in table 5, and the average of the five-reduction results of each index is plotted in a line graph, and the results are shown in fig. 6.
TABLE 5 comparison of results of different classification algorithms based on BRSGA selection algorithm
Indicates that the marked result has significance difference with the corresponding index of the algorithm (CS-SVM) in the text when the significance level is 0.05.
As can be seen from Table 5, the quantitative analysis results show that the invention is superior to the other three comparative algorithms in five evaluation indexes, and has statistically significant differences. From fig. 6 as a whole: and the classification result after the five times of reduction by BRSGA shows a fluctuation trend, wherein most classification indexes of the 4 th reduction are relatively better, and all indexes of the 3 rd reduction except the classification time are relatively lower. Because the initial population of the genetic algorithm is randomly generated in the generation process of the optimal feature subset, the reduction results of each time are different, each parameter is reduced for 5 times, the classification adopts a five-fold cross mode, and finally the average value of each index of the reduction and classification results is adopted to evaluate the overall performance of the model, so that one-sided evaluation can be effectively avoided.
Fig. 6 is a comparison diagram of classification results at the following classification and identification stages: (a) accuracy; (b) a classification time; (c) f value; (d) sensitivity; (e) specificity; (f) MCC; (g) f1 Score; (h) youden. In the five-time reduction process, the CS-SVM algorithm is higher than the GS-SVM, GA-SVM and PSO-SVM algorithms in 6 evaluation indexes such as accuracy, F value, sensitivity, MCC, F1Score and Youden index, and the classification time is slightly higher than the GS-SVM. The classification time of the PSO-SVM algorithm is far higher than that of the other three algorithms, the other 6 evaluation indexes except the hypersensitivity are higher than those of the GS-SVM algorithm and the GA-SVM algorithm in the 3 rd reduction, and all indexes in the other 4 th reduction are lower than those of the GS-SVM algorithm, the GA-SVM algorithm and the CS-SVM algorithm. As can be known from the graphs 6a, c, f, g and h, the CS-SVM algorithm in the fifth reduction is higher than GS-SVM, GA-SVM and PSO-SVM in all comprehensive evaluation indexes, and has certain robustness and higher popularization value.
The reason why the GS-SVM classification time is lower than that of the CS-SVM is as follows: firstly, due to the difficulty in acquiring medical image data, the test set data only comprises 600 cases, and the time complexity is lower than that of a CS algorithm, but the data in real clinical is massive and is increased sharply every day, even exponentially, and when the number of samples is increased, the time complexity of the GS algorithm is greatly increased, so that the requirement of clinical application cannot be met; secondly, the GS algorithm has a certain randomness in the search range given by experience, and the optimal parameters cannot be obtained. The CS algorithm is a group intelligent search algorithm, has two search capabilities of local and global, widens the search field, enriches the diversity of the group, has good robustness, and can effectively avoid randomness caused by experience compared with the GS algorithm.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (4)
1. A high-dimensional feature selection algorithm based on a Bayesian rough set and a cuckoo algorithm is characterized by comprising the following steps:
s1, obtaining a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image;
s2, extracting high-dimensional feature components of the segmented ROI image, and constructing a decision information table containing feature attributes based on the high-dimensional feature components, wherein the feature attributes correspond to features of different dimensions in the high-dimensional feature components;
s3, based on a Bayes rough set model, constructing a fitness objective function by utilizing the weighted summation of a global relative gain function, an attribute reduction length and a gene coding weight function, and reducing the feature attributes by combining genetic operator combination to obtain a reduced feature subset;
s4, optimizing the penalty factor and the kernel function of the SVM by using a cuckoo algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result.
2. The high-dimensional feature selection algorithm based on Bayesian-coarse-set and Cuckoo-distribution algorithm as recited in claim 1, wherein the high-dimensional feature components in S2 comprise shape features, texture features and gray scale features of lung tumor images.
3. The high-dimensional feature selection algorithm based on bayesian rough set and cuckoo algorithm as claimed in claim 1, wherein said S3 specifically comprises the following steps:
s31, constructing a fitness objective function:
the first objective function is a global relative gain function of the equivalence relation E with respect to the feature attribute D:measuring the attribute importance of the information system S by adopting global relative gain;
wherein | C | is the number of conditional attributes, LrThe number of genes in chromosome r is 1;
wherein, the numerator is the product sum of genes with the length of non-0 and 1, and the denominator is the length of the chromosome;
constructing a fitness objective function F (x) — ω 1 × target1- ω 2 × target2+ ω 3 × target3 to perform feature attribute reduction on the feature attributes;
s32, optimizing the genetic operator according to the fitness objective function:
calculating the fitness value of the characteristic attribute according to the fitness objective function, judging whether a termination condition is met, and if so, obtaining a reduced characteristic subset; if not, the characteristic attributes are sequentially subjected to genetic algorithm operation consisting of non-return remainder random selection, uniform intersection and Gaussian transformation, and S32 is executed again.
4. The high-dimensional feature selection algorithm based on the bayesian rough set and the cuckoo algorithm as claimed in claim 1, wherein the step of optimizing the SVM parameters by the cuckoo algorithm in S4 comprises:
s41, initialization setting: including probability PaIteration times N, bird nest number N, upper and lower limits, penalty factor c of SVM and RBF kernel function parameter sigma;
s42, initializing n bird nest positions, calculating the fitness values of all bird nests, and storing the current optimal positions and the fitness values;
s43, updating the position of the bird nest according to a formula, comparing the position with the adaptability value of the bird nest at the corresponding position of the previous generation, and keeping the position of the bird nest with the minimum adaptability value and the adaptability value as the optimal bird nest;
s44, generating a random number r with a given probability PaDiscarding bad bird nest if r > PaIf not, updating the bird nest;
s45, recalculating the fitness value of the bird nest, replacing the bird nest with a high fitness value with the bird nest with a low fitness value, and generating a group of new bird nest positions;
s46, judging whether iteration times are finished, if so, stopping searching to obtain a global optimal fitness value and a corresponding optimal bird nest, and if not, jumping to S43 to continue optimizing;
and S47, constructing an SVM prediction model according to the optimal parameters c and sigma corresponding to the optimal bird nest position.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010322570.8A CN111583194B (en) | 2020-04-22 | 2020-04-22 | High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010322570.8A CN111583194B (en) | 2020-04-22 | 2020-04-22 | High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111583194A true CN111583194A (en) | 2020-08-25 |
CN111583194B CN111583194B (en) | 2022-07-15 |
Family
ID=72111635
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010322570.8A Active CN111583194B (en) | 2020-04-22 | 2020-04-22 | High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111583194B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113111577A (en) * | 2021-04-01 | 2021-07-13 | 燕山大学 | Cement mill operation index decision method based on multi-target cuckoo search |
CN114595713A (en) * | 2022-01-19 | 2022-06-07 | 北京理工大学 | Optimization feature selection-based gas pressure regulating station state monitoring method |
CN114627964A (en) * | 2021-09-13 | 2022-06-14 | 东北林业大学 | Prediction enhancer based on multi-core learning and intensity classification method and classification equipment thereof |
CN118469733A (en) * | 2024-07-15 | 2024-08-09 | 山东乐谷信息科技有限公司 | Safe accounting account book system based on blockchain technology |
CN118672304A (en) * | 2024-08-21 | 2024-09-20 | 四川腾盾科技有限公司 | Fixed wing cluster unmanned aerial vehicle anti-collision method and system based on multi-element game |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784353A (en) * | 2016-08-29 | 2018-03-09 | 普天信息技术有限公司 | A kind of function optimization method based on cuckoo searching algorithm |
CN109186971A (en) * | 2018-08-06 | 2019-01-11 | 江苏大学 | Hub motor mechanical breakdown inline diagnosis method based on dynamic bayesian network |
CN109325580A (en) * | 2018-09-05 | 2019-02-12 | 南京邮电大学 | A kind of adaptive cuckoo searching method for Services Composition global optimization |
CN109978880A (en) * | 2019-04-08 | 2019-07-05 | 哈尔滨理工大学 | Lung tumors CT image is carried out sentencing method for distinguishing using high dimensional feature selection |
US20190253558A1 (en) * | 2018-02-13 | 2019-08-15 | Risto Haukioja | System and method to automatically monitor service level agreement compliance in call centers |
US20190318248A1 (en) * | 2018-04-13 | 2019-10-17 | NEC Laboratories Europe GmbH | Automated feature generation, selection and hyperparameter tuning from structured data for supervised learning problems |
-
2020
- 2020-04-22 CN CN202010322570.8A patent/CN111583194B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784353A (en) * | 2016-08-29 | 2018-03-09 | 普天信息技术有限公司 | A kind of function optimization method based on cuckoo searching algorithm |
US20190253558A1 (en) * | 2018-02-13 | 2019-08-15 | Risto Haukioja | System and method to automatically monitor service level agreement compliance in call centers |
US20190318248A1 (en) * | 2018-04-13 | 2019-10-17 | NEC Laboratories Europe GmbH | Automated feature generation, selection and hyperparameter tuning from structured data for supervised learning problems |
CN109186971A (en) * | 2018-08-06 | 2019-01-11 | 江苏大学 | Hub motor mechanical breakdown inline diagnosis method based on dynamic bayesian network |
CN109325580A (en) * | 2018-09-05 | 2019-02-12 | 南京邮电大学 | A kind of adaptive cuckoo searching method for Services Composition global optimization |
CN109978880A (en) * | 2019-04-08 | 2019-07-05 | 哈尔滨理工大学 | Lung tumors CT image is carried out sentencing method for distinguishing using high dimensional feature selection |
Non-Patent Citations (1)
Title |
---|
张飞飞,等: ""基于贝叶斯粗糙集的肺部肿瘤CT图像高维特征选择算法"", 《生物医学工程研究》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113111577A (en) * | 2021-04-01 | 2021-07-13 | 燕山大学 | Cement mill operation index decision method based on multi-target cuckoo search |
CN113111577B (en) * | 2021-04-01 | 2023-05-05 | 燕山大学 | Cement mill operation index decision method based on multi-target cuckoo search |
CN114627964A (en) * | 2021-09-13 | 2022-06-14 | 东北林业大学 | Prediction enhancer based on multi-core learning and intensity classification method and classification equipment thereof |
CN114595713A (en) * | 2022-01-19 | 2022-06-07 | 北京理工大学 | Optimization feature selection-based gas pressure regulating station state monitoring method |
CN118469733A (en) * | 2024-07-15 | 2024-08-09 | 山东乐谷信息科技有限公司 | Safe accounting account book system based on blockchain technology |
CN118672304A (en) * | 2024-08-21 | 2024-09-20 | 四川腾盾科技有限公司 | Fixed wing cluster unmanned aerial vehicle anti-collision method and system based on multi-element game |
Also Published As
Publication number | Publication date |
---|---|
CN111583194B (en) | 2022-07-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111583194B (en) | High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm | |
Xia et al. | Complete random forest based class noise filtering learning for improving the generalizability of classifiers | |
JP2022538866A (en) | System and method for image preprocessing | |
US7640219B2 (en) | Parameter optimized nearest neighbor vote and boundary based classification | |
WO2017151759A1 (en) | Category discovery and image auto-annotation via looped pseudo-task optimization | |
CN111553127A (en) | Multi-label text data feature selection method and device | |
CN110969626A (en) | Method for extracting hippocampus of human brain nuclear magnetic resonance image based on 3D neural network | |
CN112464005B (en) | Depth-enhanced image clustering method | |
CN113764034B (en) | Method, device, equipment and medium for predicting potential BGC in genome sequence | |
CN111340135B (en) | Renal mass classification method based on random projection | |
CN114596467A (en) | Multimode image classification method based on evidence deep learning | |
KR20230029004A (en) | System and method for prediction of lung cancer final stage using chest automatic segmentation image | |
CN111110192A (en) | Skin abnormal symptom auxiliary diagnosis system | |
CN112735596A (en) | Similar patient determination method and device, electronic equipment and storage medium | |
CN118468061B (en) | Automatic algorithm matching and parameter optimizing method and system | |
CN115310491A (en) | Class-imbalance magnetic resonance whole brain data classification method based on deep learning | |
CN114266927A (en) | Unsupervised saliency target detection method, system, equipment and medium | |
CN117195027A (en) | Cluster weighted clustering integration method based on member selection | |
CN108376567B (en) | Label propagation algorithm-based clinical drug-drug adverse reaction detection method | |
US20240144474A1 (en) | Medical-image-based lesion analysis method | |
Usha et al. | Feature Selection Techniques in Learning Algorithms to Predict Truthful Data | |
AU2021102593A4 (en) | A Method for Detection of a Disease | |
Manokhin | Machine learning for probabilistic prediction | |
Wang et al. | Probing perfection: The relentless art of meddling for pulmonary airway segmentation from HRCT via a human-AI collaboration based active learning method | |
Yang et al. | Multi-Strategy Assisted Multi-Objective Whale Optimization Algorithm for Feature Selection. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |