CN111583194B

CN111583194B - High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm

Info

Publication number: CN111583194B
Application number: CN202010322570.8A
Authority: CN
Inventors: 周涛; 陆惠玲; 张飞飞; 韩强; 贺钧; 田金琴; 董雅丽
Original assignee: North Minzu University
Current assignee: North Minzu University
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2022-07-15
Anticipated expiration: 2040-04-22
Also published as: CN111583194A

Abstract

The invention discloses a high-dimensional feature selection algorithm based on a Bayesian rough set and a cuckoo algorithm, which comprises the following steps: acquiring a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image; extracting high-dimensional characteristic components of the segmented ROI image, and constructing a decision information table containing characteristic attributes based on the characteristic components; and reducing the original feature space by adopting a BRSGA algorithm to obtain an optimal feature subset, optimizing the penalty factor and the kernel function parameter of the SVM by utilizing a CS algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result. According to the method, the optimal feature subset is generated through the genetic algorithm and the BRS, the feature dimension is reduced on the premise of not reducing the classification accuracy, the constraint of manual parameter setting is eliminated, and the time consumption is reduced. The CS is used for carrying out global optimization on SVM parameters, so that the method has the advantages of more effective exploration of search space, enriched population diversity, good robustness and stronger global search capability.

Description

High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm

Technical Field

The invention relates to the technical field of medical image identification, in particular to a high-dimensional feature selection algorithm based on a Bayesian rough set and a Cuckoo algorithm.

Background

With the development of Computer Aided Diagnosis (CAD) research, medical image processing techniques have been rapidly developed. However, the multi-modality, gray level ambiguity and uncertainty of the medical image make the missed diagnosis rate and misdiagnosis rate in the single-modality medical image diagnosis process high. Therefore, different modality medical image processing technologies are developed and divided into a pixel level, a feature level and a decision level according to different levels. And the feature level processing can realize the compression of the information quantity on the basis of keeping important information, and the processing speed is higher. In the medical image feature level processing process, the redundancy and correlation among features enable the dimension disaster to be an NP-hard problem, feature selection is an effective measure for solving the problem, the dimension of a feature space can be effectively reduced, and the time complexity is reduced.

The problems of the high-dimensional feature selection process include how to generate an optimal feature subset, how to evaluate the effect, selecting a classifier used for evaluation, optimizing parameters of the classifier, and the like, and in response to the problems, experts and scholars have proposed a plurality of algorithms in succession in recent years. Firstly, the Variable Precision Rough Set (VPRS) is provided, which can effectively overcome the limitation that the Rough Set (RS) can only process accurate classification data, relax the lower approximation of the RS from 'complete inclusion' to 'partial inclusion' by introducing a classification error rate beta, and improve the robustness and generalization capability of the processing result of the data set with noise. The core of VPRS research is the problem of selecting the classification error rate beta, and the main research field comprises three aspects: first, regardless of the details of β selection, various extended VPRS models are proposed, such as: variable precision fuzzy rough set, variable precision multi-granularity rough set, generalized VPRS, expanded VPRS based on beta-tolerance relation and Bhattacharyya distance and the like; secondly, obtaining the value of beta through different calculation modes, such as taking the average inclusion degree as a threshold value for selecting upper and lower approximation; thirdly, introducing a probability formula to provide a plurality of probability RS models, such as VPRS, a game rough set, a decision rough set, a Bayesian Rough Set (BRS), a 0.5 probability rough set and the like. The various methods in the probability rough set have certain correlation, and the difference is reflected in the difference of the calculation of the probability formula and the parameter design mode. The BRS introduces prior probability on the basis of VPRS, replaces the classification error rate beta in the VPRS by the prior probability, does not need to manually set parameters, not only overcomes the complete accurate division of the RS on the lower approximation, but also avoids the influence of the parameter beta in the VPRS on the upper approximation and the lower approximation. Many BRS researches are still in a theoretical analysis stage at present, a mature independent model is lacked, and the problem of processing high-dimensional feature selection of medical images by combining with other algorithms is not seen.

Secondly, the performance of the classifier is the basis for evaluating a high-dimensional feature selection algorithm, a Support Vector Machine (SVM) is a commonly used binary classification algorithm, the introduction of a kernel function widens the application range of the SVM, the commonly used kernel function comprises a polynomial kernel function, a radial basis kernel function (RBF) and a Sigmoid kernel function, and the polynomial kernel function is low in calculation speed, seriously influences the effect and is less in application; compared with a Sigmoid kernel function, the RBF has fewer parameters, only a kernel matrix needs to be calculated in the calculation process, the time complexity is lower, the manual setting workload of the parameters is large, the time is longer, the finally obtained parameters are not necessarily optimal, and the selection of the parameters needs to be converted into an optimization problem for analysis.

Therefore, how to provide a high-dimensional feature selection algorithm based on a bayesian rough set and a cuckoo algorithm with low time complexity and better robustness is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the invention provides a high-dimensional feature selection algorithm based on a bayesian rough set and a cuckoo algorithm, and provides a high-dimensional feature selection algorithm based on BRSGA and CS two-stage optimization by combining BRS, GA, CS and SVM algorithms. In the first stage, the BRSGA algorithm is adopted to reduce the original feature space to obtain an optimal feature subset, in the second stage, the CS algorithm is used to optimize the penalty factor and the kernel function parameter of the SVM, the optimal parameter combination is used to construct a CS-SVM classification model, and the lung tumor image is identified.

In order to achieve the above purpose, the invention provides the following technical scheme:

a high-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm comprises the following steps:

s1, obtaining a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image;

s2, extracting high-dimensional feature components of the segmented ROI image, and constructing a decision information table containing feature attributes based on the high-dimensional feature components, wherein the feature attributes correspond to features of different dimensions in the high-dimensional feature components;

s3, based on a Bayes rough set model, constructing a fitness objective function by utilizing the weighted summation of a global relative gain function, an attribute reduction length and a gene coding weight function, and reducing the feature attributes by combining genetic operator combination to obtain a reduced feature subset;

s4, optimizing the penalty factor and the kernel function of the SVM by using a cuckoo algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result.

Preferably, the high-dimensional feature components in S2 include shape features, texture features and gray scale features of the lung tumor image.

Preferably, the S3 specifically includes the following steps:

s31, constructing a fitness objective function:

the first objective function is a global relative gain function of the equivalence relation E with respect to the feature attribute D:

measuring the attribute importance of the information system S by adopting global relative gain;

the second objective function is attribute reduction length:

wherein | C | is the number of conditional attributes, L_rThe number of genes in chromosome r is 1;

the third objective function is a gene coding weight function:

wherein, the numerator is the product sum of genes with the length of non-0 and 1, and the denominator is the length of the chromosome;

constructing a fitness objective function F (x) — ω 1 × target1- ω 2 × target2+ ω 3 × target3 to perform feature attribute reduction on the feature attributes;

s32, optimizing the genetic operator according to the fitness objective function:

calculating the fitness value of the characteristic attribute according to the fitness objective function, judging whether a termination condition is met, and if so, obtaining a reduced characteristic subset; if not, the characteristic attributes are sequentially subjected to genetic algorithm operation consisting of non-return remainder random selection, uniform intersection and Gaussian transformation, and S32 is executed again.

Preferably, the specific step of optimizing the SVM parameter by the cuckoo algorithm in S4 includes:

s41, initialization setting: including probability P_aIteration times N, bird nest number N, upper and lower limits, penalty factor c of SVM and RBF kernel function parameter sigma;

s42, initializing n bird nest positions, calculating the fitness values of all bird nests and storing the current optimal positions and the fitness values;

s43, updating the position of the bird nest according to a formula, comparing the position with the adaptability value of the bird nest at the corresponding position of the previous generation, and reserving the position with the minimum adaptability value and the adaptability value as the optimal bird nest;

s44, generating a random number r with a given probability P_aDiscarding bad bird nest if r > P_aIf not, updating the bird nest;

s45, recalculating the fitness value of the bird nest, replacing the bird nest with a high fitness value with the bird nest with a low fitness value, and generating a group of new bird nest positions;

s46, judging whether iteration times are finished, if so, stopping searching to obtain a global optimal fitness value and a corresponding optimal bird nest, and if not, jumping to S43 to continue optimizing;

and S47, constructing an SVM prediction model according to the optimal parameters c and sigma corresponding to the optimal bird nest position.

Compared with the prior art, the high-dimensional feature selection algorithm based on the Bayesian rough set and the Cuckoo algorithm has the advantages that:

the attribute importance degree is analyzed from the perspective of a global relative gain function, an attribute reduction length and the weighting and construction fitness function of a gene coding weight function are combined, an optimal feature subset is generated through genetic operations such as selection, intersection and variation, the feature dimension is reduced on the premise of not reducing the classification accuracy, the constraint of manual parameter setting is eliminated, and the time consumption is reduced to a great extent. The CS is used for carrying out global optimization on Support Vector Machine (SVM) parameters, global search in the CS algorithm has infinite mean value and variance, a search space can be explored more effectively than an algorithm using a standard Gaussian process, the search field is widened, the diversity of population is enriched, and the method has good robustness and strong global search capability. The feature selection is carried out by combining the BRS and an intelligent optimization algorithm, and the parameter of the SVM is optimized by using the CS, so that certain feasibility and effectiveness are achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a high-dimensional feature selection algorithm based on Bayesian rough set and Cuckoo algorithm provided by the present invention;

FIG. 2 is a comparison of ROI before and after segmentation by Otsu algorithm according to an embodiment of the present invention;

FIG. 3 is a flowchart of an optimal feature subset generation process provided by an embodiment of the present invention;

FIG. 4 is a flow chart of a CS optimization SVM parameter provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of a change situation of a fitness function in a process of generating a certain feature subset according to an embodiment of the present invention;

fig. 6 is a comparison diagram of results of different classification algorithms based on BRSGA selection algorithm according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a high-dimensional feature selection algorithm based on a Bayesian rough set and a cuckoo algorithm, wherein a flow chart is shown in figure 1 and comprises the steps of data acquisition, data preprocessing, image segmentation, feature extraction, attribute reduction, classification recognition and the like. And finally, classifying and identifying the lung tumor CT image by adopting a two-stage optimized high-dimensional feature selection algorithm. The specific implementation process is as follows:

and S1, acquiring a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image.

Before the target contour segmentation, the method specifically comprises the following image acquisition and preprocessing processes:

3000 lung tumor CT images with definite diagnosis conclusion are collected by comprehensively considering the popularity of common lung tumor examination imaging methods, the acceptance degree and the cost of doctors and patients and avoiding the influence of factors such as the specification, the model and the environment of examination equipment on the lung tumor CT images, wherein 1500 malignant (benign) tumors are collected. And (3) intercepting subgraphs with strong distinguishing capability from the obtained image as ROI regions, and normalizing all ROI regions into an experimental image with the size of 50-50 pixels.

And (3) target contour segmentation process:

the segmentation of the target contour (including the lesion contour) from the truncated ROI region plays a crucial role in various clinical applications. However, in the current clinical practical application, a manual labeling method for radiologists is still adopted, and a large amount of intensive manual operations are easy to make mistakes, so that the accurate segmentation by using the computer technology has a very great practical value. In the embodiment, an Otsu threshold segmentation method is adopted, and the core idea is to segment an image into two groups, and when the interclass variance between the two groups reaches the maximum, the obtained value is the optimal segmentation threshold. The basic principle of Otsu's algorithm is as follows:

assuming that the size of an image is m × n and the gray level of the image is l, the gray level range is [0, l-1 ]]，n_iRepresenting the number of occurrences of gray level i, the frequency of occurrence of gray level i in all pixels is p_i＝n_i/(m.times.n). Suppose that a pixel having a gray level less than q constitutes A₁Class A, i.e. A₁Has a gray scale range of [0, q ]]The gray scale range is [ q +1, l-1 ]]The pixel point is A₂If P is₁(q)，P₂(q) each represents A₁Class A and A₂Probability of class occurrence, u₁(q)，u₂(q) represents A₁Class A and A₂Average of class gray levels, then:

between-group variance σ of images_b(q) is represented by:

when the interclass variance between two groups reaches the maximum, the obtained value is the optimal segmentation threshold, i.e. the pixel segmentation threshold is:

the ROI region is segmented by Otsu, and as shown in fig. 2, an example of ROI images before and after segmentation by Otsu algorithm is given, where fig. 2(a) is ROI image before segmentation and fig. 2(b) is ROI image after segmentation.

S2, extracting high-dimensional feature components of the segmented ROI image, wherein the high-dimensional feature components comprise 104-dimensional features including shape features, texture features and gray features, and the specific features are shown in Table 1. And constructing a decision information table containing characteristic attributes based on the high-dimensional characteristic components, wherein the characteristic attributes correspond to the characteristics of different dimensions in the high-dimensional characteristic components, the size of the constructed decision information table is 3000 x 105, discretization processing is carried out on the decision information table by adopting a fuzzy C mean value clustering algorithm, and after discretization, numerical labels, namely characteristic attributes, representing the benign and malignant tumors, are given to the tumor characteristics, and the characteristic attributes are located in the last column of the decision information table.

TABLE 1 pulmonary tumor CT image feature set

S3, based on a Bayes rough set model, constructing a fitness objective function by utilizing the weighted summation of a global relative gain function, an attribute reduction length and a gene coding weight function, and reducing the feature attributes by combining genetic operator combination to obtain a reduced feature subset; the embodiment combines the BRS algorithm and the GA algorithm to carry out attribute reduction, reduces the time complexity and the space complexity of the classifier and improves the classification performance.

As shown in fig. 3, the reduction specifically includes the following steps:

s31, establishing a BRS model:

1) setting parameters: the chromosome is a sequence consisting of 0, 1, the length of which is equal to the number N of conditional attributes, the cross probability P_cProbability of variation P_mThe maximum iteration number K is 150, the initial population M is 20, and the fitness function is f (x);

2) and (3) encoding: coding in a binary mode, wherein the length of the coding is equal to the number of condition attributes, 0 represents that the characteristic is not selected, and 1 represents that the characteristic is selected;

3) characteristic attributes, namely generation of initial population: randomly generating M chromosome strings with the length equal to the number of the conditional attributes to form an initial population;

4) genetic operator: the genetic operators comprise selection, crossover and mutation operators, and the genetic operators are combined into playback-free random remainder selection, uniform crossover and Gaussian mutation.

S32, constructing a fitness objective function: comprehensively considering the global relative gain function, the attribute reduction length and the gene coding weight function, and carrying out the optimization process of the genetic algorithm by weighting and constructing a fitness function frame to find the feature subset with the most distinguishing capability.

The global relative gain function of the equivalence relation E with respect to the feature attribute D is

in the BRS model, the attribute reduction algorithm process taking the global relative gain as heuristic information is as follows:

s321: calculating a kernel attribute set gamma of the condition attributes in the information system S ═ U, A, V, f, and calculating a dependency R of the decision attributes on the condition attributes_C(D)；

S322: calculating the dependency R of the decision attributes on the core attributes_γ(D) If R is_γ(D)＝R_C(D) If yes, go to S324 to find R reduction, otherwise let C equal C- γ, for

Computing

The values of (a) constitute a set M;

s323: the elements in the set M are sorted in ascending order, and the maximum value is added to the set gamma, namely gamma-U-C_i，

Go to S322 to continue the calculation;

s324: the result is an R reduction of the BRS.

Attribute reduction length of

the weight function of the gene code is

The value of the gene position can only be 0 and 1, otherwise, penalty is carried out, since genes which are more than 1 or less than 0 or less than-1 can appear in the chromosome, for the situation, a gene coding weight function is constructed as target3, the numerator obtains the product sum which is not 0 and 1, if the gene position i is 0, 0 x (0-1) is 0, if the gene position 1, 1 x (1-1) is 0, only the product sum which is not 0 and 1 is calculated, and the denominator is expressed as the length of the chromosome.

Example (c): provided that chromosome r ═ 01-231 ], (r-1) [ -10-320 ], then:

r×(r-1)＝[0 1 -2 3 1]×[-1 0 -3 2 0]＝[0 0 6 6 0]

Σ abs (r × (r-1)) -12, and a chromosome length of 5, target 3-12/5-2.4.

And constructing a fitness objective function F (x) — ω 1 × target1- ω 2 × target2+ ω 3 × target3 to perform feature attribute reduction on the feature attributes.

calculating the fitness value of the characteristic attribute according to the fitness objective function, and judging whether a termination condition is met, wherein the termination condition is a set fixed value; if yes, obtaining a reduced feature subset; if not, the genetic algorithm operation consisting of random selection of the residue without the return, uniform intersection and Gaussian transformation is sequentially carried out on the characteristic attributes, and S32 is executed again.

S4, optimizing the penalty factors and the kernel function of the SVM by using a cuckoo algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result.

Referring to fig. 4, the specific steps of the cuckoo algorithm for optimizing SVM parameters include:

s41, initialization setting: including the probability P_aIteration times N, the number N of bird nests, upper and lower limits, a penalty factor c of the SVM and a RBF kernel function parameter sigma;

s42, initializing n bird nest positions, calculating the fitness values of all bird nests, and storing the current optimal positions and the fitness values; one bird nest is a feasible solution, the fitness value of the bird nest is calculated, the n feasible solutions obtained by initialization are brought into the value obtained by objective function calculation, the value is kept to be optimal (the maximum or minimum can be selected according to specific requirements), and the optimal position and the fitness value of the bird nest are obtained;

s43, updating the position of the bird nest according to the position updating formula, comparing the position with the adaptability value of the bird nest at the corresponding position of the previous generation, and reserving the position of the bird nest with the minimum adaptability value and the adaptability value as the optimal bird nest;

s44, automatically generating a random number r according to a Gaussian random function, and giving a probability P_aDiscarding bad bird nest if r > P_aIf not, updating the bird nest;

s45, recalculating the fitness value of the bird nest, replacing the bird nest with a high fitness value to generate a new bird nest position;

The search path of the cuckoo CS algorithm is levy flight, and in the walking in the form, short-distance exploration and occasional long-distance walking are alternated, so that the search range can be expanded, the diversity of the population is increased, and the local optimum is avoided. The relevant definitions are as follows:

the formula for searching the nest position by the CS algorithm is as follows:

in the formula:

the position of the ith bird nest in the t-th generation is defined as alpha, which is a step control quantity and is generally 0.1, and alpha is used for determining a random search range:

wherein alpha is₀Is a constant (alpha)₀＝0.01)，x_bestRepresenting the current optimal solution.

In the formula for searching the position of the bird nest,

denotes a point-to-point product, L (λ) is a random search path, and Levy-u is t^-λIf λ is more than 1 and less than or equal to 3, and levy distribution is obeyed, the corresponding position updating formula is as follows:

wherein μ and ν are both subject to normal distribution:

where Γ is the standard Gamma function.

The performance evaluation of the medical image recognition result comprises two indexes of sensitivity and specificity, but the two indexes are difficult to comprehensively describe the overall performance of the classifier. Therefore, the embodiment sets evaluation indexes in two stages of feature selection and classification identification respectively, and the feature selection stage comprises reduction length, attribute importance and time. The classification identification stage comprises Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), F value, Mazis Correlation Coefficient (MCC), balance F Score (F1Score), Johnson index (YI) and Time (Time), and the calculation formula is as follows:

YI＝Sensitivity+Specificity-1

wherein TP represents the number of successfully identified malignant target contours; FP represents the number of false identified malignant tumor target contours; TN represents the number of successfully identified benign tumor target contours; FN represents the number of well-identified benign tumor target contours.

In order to verify the feasibility and the effectiveness of the technical scheme, two groups of comparison experiments are designed, wherein the experiments comprise that in order to verify the feasibility and the effectiveness of the BRSGA feature selection algorithm, the parameters of the SVM are optimized by fixedly adopting a GS algorithm, and the advantages and the disadvantages of the BRSGA algorithm and the VPRSGA under different beta conditions at different stages are compared. And in the second experiment, a feature selection algorithm is fixed on the basis of the first experiment, and the advantages and the disadvantages of the CS-SVM, the GS-SVM, the GA-SVM and the PSO-SVM in the classification stage are compared.

Experiment one: experimental result comparison based on same classification algorithm and different feature selection algorithms

The fixed classification recognition algorithm is GS-SVM, the advantages and disadvantages of BRSGA and VPRS algorithms under different parameters in two stages of feature selection and classification recognition are compared, wherein the VPRS parameter beta is respectively set to be 0.1, 0.2, 0.3 and 0.4. Specific results are shown in table 3, fig. 5 and table 4. And in the optimal feature subset generation stage, each parameter is reduced for 5 times in combination to respectively obtain the reduction length, the attribute importance and the time, and the average value of indexes of the 5-time reduction result of each parameter is obtained and used as an experimental result under the parameter. In the stage of classification and identification, five-fold intersection is performed on each parameter reduction result by using LIBSVM (namely, 300 cases of benign tumors and 300 cases of malignant tumors are selected as a test set each time, and the rest of data are selected as a training set), and five groups of identification results are obtained for each parameter, including: accuracy, sensitivity, specificity, F value, MCC, F1Score, YI and time, calculating the average value of each index after five-fold cross as the classification result after reduction under the parameter, and finally calculating the average value of each index after five-fold cross as the reduction and classification result under the parameter combination.

TABLE 3 comparison of attribute reduction results for different feature selection algorithms

As can be seen from table 3, the reduction length of the present invention is 7.8 dimensions without manual setting of the classification error rate β, and between the reduction lengths of the VPRSGA algorithm with different β values, the reduction length is significantly reduced compared to β -0.1, and the reduction length is slightly increased compared to β -0.2 and β -0.4. The fitness value is only slightly higher than the VPRSGA algorithm with β ═ 0.4. The importance is reduced by 0.0002 compared with VPRSGA of β ═ 0.4, higher than the VPRSGA model under other parameters. The VPRSGA model with the reduction time higher than beta-0.2 has 16.54-419.35 seconds lower than that of VPRSGA model with other parameter values, wherein the time is shortened by 2.7 times compared with beta-0.1. As can be seen from fig. 5, the algorithm herein has no premature convergence in the reduction stage, and the VPRSGA algorithm has different degrees of premature convergence under different β values, for example, in fig. 5b, the VPRSGA algorithm has a more serious premature convergence phenomenon at β ═ 0.1 in a certain reduction result. Therefore, compared with the VPRSGA algorithm, the method and the device are free from the constraint of manual parameter setting and achieve a relatively ideal effect.

TABLE 4 comparison of different feature selection algorithm classification recognition results

As can be seen from table 4: compared with the VPRSGA algorithm with the beta being 0.1, the accuracy, the specificity, the MCC, the F1Score and the YI of the VPRSGA algorithm are respectively reduced by 0.07 percent, 0.43 percent, 0.0015 percent, 0.0006 percent and 0.0013 percent, but the sensitivity is improved by 0.3 percent, and the VPRSGA algorithm with the classification time beta being 0.1 is 3.4 times that of the BRSGA algorithm. Although the accuracy of the BRS algorithm is reduced within an acceptable range, the time consumption is reduced to a great extent, and the overall performance of the BRS algorithm is better than that of the VPRS algorithm with beta being 0.1 by comprehensively considering the accuracy and the time consumption; compared with VPRSGA algorithm with the parameter beta being 0.2, 0.3 and 0.4, the BRSGA algorithm has the advantages that time is reduced, other indexes are improved to different degrees, and the indexes are obviously improved compared with the indexes of the VPRSGA algorithm with the parameter beta being 0.2. As can be seen from the classification result, compared with the VPRSGA model, the BRSGA model is not only free from the constraint of parameters, but also improves the classification performance of the model.

As can be seen from experiments, when the SVM parameters are optimized by using a grid optimization algorithm, compared with the VPRSGA algorithm, the BRSGA feature selection algorithm is free from the constraint of manual parameter setting, and shows a relatively ideal effect in the stages of attribute reduction and classification, so that the BRSGA is used as the fixed feature selection algorithm when the effectiveness of the CS algorithm on the optimization of the SVM parameters is verified.

Experiment two: experimental result comparison of different classification algorithms based on same feature selection algorithm

And fixing the optimal feature subset generation algorithm as BRSGA, optimizing SVM parameters by adopting a CS algorithm, and comparing with GS-SVM, GA-SVM and PSO-SVM. The method comprises the steps of performing classification recognition on results obtained by 5 times of reduction of a BRSGA algorithm in an experiment I, obtaining classification results including accuracy, sensitivity, specificity, an F value, MCC, F1Score, YI and time by five-fold cross verification each time, and calculating an average value of each index of five-fold cross to serve as the classification result after the reduction, wherein the average value of the five-fold cross is the final result of a classification model. In order to quantitatively describe whether the algorithm and the comparison algorithm have statistical significance in identifying the accuracy, a paired t test is adopted for carrying out hypothesis test, the statistical hypothesis test is based on five indexes of comprehensively describing the accuracy, F value, MCC, F1Score and YI, and the significance level is set to be p less than 0.05. The null hypothesis is that the difference between the average values of the same evaluation indexes of the present invention and the comparative algorithm is 0. The average and standard deviation of the recognition results of the five-reduction classification are given for each evaluation index, and the results are shown in table 5, and the average of the five-reduction results of each index is plotted in a line graph, and the results are shown in fig. 6.

TABLE 5 comparison of results of different classification algorithms based on BRSGA selection algorithm

Indicates that the marked result has a significance difference from the corresponding index of the algorithm (CS-SVM) herein at a significance level of 0.05.

As can be seen from Table 5, the quantitative analysis results show that the invention is superior to other three comparative algorithms in five evaluation indexes, and all have statistically significant differences. From fig. 6 as a whole: and the classification result after the five times of reduction by BRSGA shows a fluctuation trend, wherein most classification indexes of the 4 th reduction are relatively better, and all indexes of the 3 rd reduction except the classification time are relatively lower. Because the initial population of the genetic algorithm is randomly generated in the generation process of the optimal feature subset, the reduction result of each time is different, each parameter combination is reduced for 5 times, a five-fold cross mode is adopted for classification, the overall performance of the model is finally evaluated by the average value of each index of the reduction result and the classification result, and one-sided evaluation can be effectively avoided.

Fig. 6 is a comparison diagram of classification results at the following classification and identification stages: (a) accuracy; (b) a classification time; (c) f value; (d) sensitivity; (e) specificity; (f) MCC; (g) f1 Score; (h) youden. In the five-time reduction process, the CS-SVM algorithm is higher than the GS-SVM, GA-SVM and PSO-SVM algorithms in 6 evaluation indexes such as accuracy, F value, sensitivity, MCC, F1Score and Youden index, and the classification time is slightly higher than the GS-SVM. The classification time of the PSO-SVM algorithm is far longer than that of the other three algorithms, the other 6 evaluation indexes except the sensibility are higher than those of the GS-SVM and GA-SVM algorithms in the 3 rd reduction, and all indexes in the other 4 th reductions are lower than those of the GS-SVM, the GA-SVM and the CS-SVM. As can be known from the figures 6a, c, f, g and h, the CS-SVM algorithm in the fifth reduction is higher than GS-SVM, GA-SVM and PSO-SVM in all comprehensive evaluation indexes, and has certain robustness and higher popularization value.

The reason why the GS-SVM classification time is lower than that of the CS-SVM is as follows: firstly, due to the difficulty in acquiring medical image data, the test set data only comprises 600 cases, the time complexity is lower than that of a CS algorithm, but the data in real clinical treatment is massive, and is sharply increased every day, even exponentially increased, and when the number of samples is increased, the time complexity of the GS algorithm is greatly increased, so that the requirement of clinical application cannot be met; secondly, the GS algorithm has a certain randomness in the search range given by experience, and the optimal parameters cannot be obtained. The CS algorithm is a group intelligent search algorithm, has local and global search capabilities, widens the search field, enriches the diversity of the group, has good robustness, and can effectively avoid randomness caused by experience compared with the GS algorithm.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A high-dimensional feature selection algorithm based on a Bayesian rough set and a cuckoo algorithm is characterized by comprising the following steps:

s3, based on a Bayes rough set model, constructing a fitness objective function by utilizing the weighted summation of a global relative gain function, an attribute reduction length and a gene coding weight function, and reducing the feature attributes by combining genetic operator combination to obtain a reduced feature subset; the method comprises the following steps: s31, constructing the fitness objective function:

the second objective function is attribute reduction length:

the third objective function is a gene coding weight function:

constructing a fitness objective function F (x), wherein the fitness objective function F (x) is-omega 1 × target 1-omega 2 × target2+ omega 3 × target3, and carrying out feature attribute reduction on the feature attributes;

2. The high-dimensional feature selection algorithm based on Bayesian-coarse-set and Cuckoo-distribution algorithm as recited in claim 1, wherein the high-dimensional feature components in S2 comprise shape features, texture features and gray scale features of lung tumor images.

3. The high-dimensional feature selection algorithm based on bayesian rough set and cuckoo algorithm as claimed in claim 1, wherein said S3 further comprises the steps of:

4. The high-dimensional feature selection algorithm based on the bayesian rough set and the cuckoo algorithm as claimed in claim 1, wherein the step of optimizing the SVM parameters by the cuckoo algorithm in S4 comprises:

s41, initialization setting: including probability P_aIteration times N, the number N of bird nests, upper and lower limits, a penalty factor c of the SVM and a RBF kernel function parameter sigma;

s42, initializing n bird nest positions, calculating the fitness value of all bird nests, and storing the current optimal position and the fitness value;

s43, updating the position of the bird nest according to a formula, comparing the position with the adaptability value of the bird nest at the corresponding position of the previous generation, and keeping the position of the bird nest with the minimum adaptability value and the adaptability value as the optimal bird nest;