CN111583194B - High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm - Google Patents

High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm Download PDF

Info

Publication number
CN111583194B
CN111583194B CN202010322570.8A CN202010322570A CN111583194B CN 111583194 B CN111583194 B CN 111583194B CN 202010322570 A CN202010322570 A CN 202010322570A CN 111583194 B CN111583194 B CN 111583194B
Authority
CN
China
Prior art keywords
algorithm
svm
feature
bird nest
fitness
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010322570.8A
Other languages
Chinese (zh)
Other versions
CN111583194A (en
Inventor
周涛
陆惠玲
张飞飞
韩强
贺钧
田金琴
董雅丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North Minzu University
Original Assignee
North Minzu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North Minzu University filed Critical North Minzu University
Priority to CN202010322570.8A priority Critical patent/CN111583194B/en
Publication of CN111583194A publication Critical patent/CN111583194A/en
Application granted granted Critical
Publication of CN111583194B publication Critical patent/CN111583194B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2111Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/136Segmentation; Edge detection involving thresholding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10072Tomographic images
    • G06T2207/10081Computed x-ray tomography [CT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30096Tumor; Lesion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Radiology & Medical Imaging (AREA)
  • Medical Informatics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Physiology (AREA)
  • Quality & Reliability (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a high-dimensional feature selection algorithm based on a Bayesian rough set and a cuckoo algorithm, which comprises the following steps: acquiring a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image; extracting high-dimensional characteristic components of the segmented ROI image, and constructing a decision information table containing characteristic attributes based on the characteristic components; and reducing the original feature space by adopting a BRSGA algorithm to obtain an optimal feature subset, optimizing the penalty factor and the kernel function parameter of the SVM by utilizing a CS algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result. According to the method, the optimal feature subset is generated through the genetic algorithm and the BRS, the feature dimension is reduced on the premise of not reducing the classification accuracy, the constraint of manual parameter setting is eliminated, and the time consumption is reduced. The CS is used for carrying out global optimization on SVM parameters, so that the method has the advantages of more effective exploration of search space, enriched population diversity, good robustness and stronger global search capability.

Description

High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm
Technical Field
The invention relates to the technical field of medical image identification, in particular to a high-dimensional feature selection algorithm based on a Bayesian rough set and a Cuckoo algorithm.
Background
With the development of Computer Aided Diagnosis (CAD) research, medical image processing techniques have been rapidly developed. However, the multi-modality, gray level ambiguity and uncertainty of the medical image make the missed diagnosis rate and misdiagnosis rate in the single-modality medical image diagnosis process high. Therefore, different modality medical image processing technologies are developed and divided into a pixel level, a feature level and a decision level according to different levels. And the feature level processing can realize the compression of the information quantity on the basis of keeping important information, and the processing speed is higher. In the medical image feature level processing process, the redundancy and correlation among features enable the dimension disaster to be an NP-hard problem, feature selection is an effective measure for solving the problem, the dimension of a feature space can be effectively reduced, and the time complexity is reduced.
The problems of the high-dimensional feature selection process include how to generate an optimal feature subset, how to evaluate the effect, selecting a classifier used for evaluation, optimizing parameters of the classifier, and the like, and in response to the problems, experts and scholars have proposed a plurality of algorithms in succession in recent years. Firstly, the Variable Precision Rough Set (VPRS) is provided, which can effectively overcome the limitation that the Rough Set (RS) can only process accurate classification data, relax the lower approximation of the RS from 'complete inclusion' to 'partial inclusion' by introducing a classification error rate beta, and improve the robustness and generalization capability of the processing result of the data set with noise. The core of VPRS research is the problem of selecting the classification error rate beta, and the main research field comprises three aspects: first, regardless of the details of β selection, various extended VPRS models are proposed, such as: variable precision fuzzy rough set, variable precision multi-granularity rough set, generalized VPRS, expanded VPRS based on beta-tolerance relation and Bhattacharyya distance and the like; secondly, obtaining the value of beta through different calculation modes, such as taking the average inclusion degree as a threshold value for selecting upper and lower approximation; thirdly, introducing a probability formula to provide a plurality of probability RS models, such as VPRS, a game rough set, a decision rough set, a Bayesian Rough Set (BRS), a 0.5 probability rough set and the like. The various methods in the probability rough set have certain correlation, and the difference is reflected in the difference of the calculation of the probability formula and the parameter design mode. The BRS introduces prior probability on the basis of VPRS, replaces the classification error rate beta in the VPRS by the prior probability, does not need to manually set parameters, not only overcomes the complete accurate division of the RS on the lower approximation, but also avoids the influence of the parameter beta in the VPRS on the upper approximation and the lower approximation. Many BRS researches are still in a theoretical analysis stage at present, a mature independent model is lacked, and the problem of processing high-dimensional feature selection of medical images by combining with other algorithms is not seen.
Secondly, the performance of the classifier is the basis for evaluating a high-dimensional feature selection algorithm, a Support Vector Machine (SVM) is a commonly used binary classification algorithm, the introduction of a kernel function widens the application range of the SVM, the commonly used kernel function comprises a polynomial kernel function, a radial basis kernel function (RBF) and a Sigmoid kernel function, and the polynomial kernel function is low in calculation speed, seriously influences the effect and is less in application; compared with a Sigmoid kernel function, the RBF has fewer parameters, only a kernel matrix needs to be calculated in the calculation process, the time complexity is lower, the manual setting workload of the parameters is large, the time is longer, the finally obtained parameters are not necessarily optimal, and the selection of the parameters needs to be converted into an optimization problem for analysis.
Therefore, how to provide a high-dimensional feature selection algorithm based on a bayesian rough set and a cuckoo algorithm with low time complexity and better robustness is a problem to be solved by those skilled in the art.
Disclosure of Invention
In view of this, the invention provides a high-dimensional feature selection algorithm based on a bayesian rough set and a cuckoo algorithm, and provides a high-dimensional feature selection algorithm based on BRSGA and CS two-stage optimization by combining BRS, GA, CS and SVM algorithms. In the first stage, the BRSGA algorithm is adopted to reduce the original feature space to obtain an optimal feature subset, in the second stage, the CS algorithm is used to optimize the penalty factor and the kernel function parameter of the SVM, the optimal parameter combination is used to construct a CS-SVM classification model, and the lung tumor image is identified.
In order to achieve the above purpose, the invention provides the following technical scheme:
a high-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm comprises the following steps:
s1, obtaining a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image;
s2, extracting high-dimensional feature components of the segmented ROI image, and constructing a decision information table containing feature attributes based on the high-dimensional feature components, wherein the feature attributes correspond to features of different dimensions in the high-dimensional feature components;
s3, based on a Bayes rough set model, constructing a fitness objective function by utilizing the weighted summation of a global relative gain function, an attribute reduction length and a gene coding weight function, and reducing the feature attributes by combining genetic operator combination to obtain a reduced feature subset;
s4, optimizing the penalty factor and the kernel function of the SVM by using a cuckoo algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result.
Preferably, the high-dimensional feature components in S2 include shape features, texture features and gray scale features of the lung tumor image.
Preferably, the S3 specifically includes the following steps:
s31, constructing a fitness objective function:
the first objective function is a global relative gain function of the equivalence relation E with respect to the feature attribute D:
Figure BDA0002461998630000031
measuring the attribute importance of the information system S by adopting global relative gain;
the second objective function is attribute reduction length:
Figure BDA0002461998630000032
wherein | C | is the number of conditional attributes, LrThe number of genes in chromosome r is 1;
the third objective function is a gene coding weight function:
Figure BDA0002461998630000033
wherein, the numerator is the product sum of genes with the length of non-0 and 1, and the denominator is the length of the chromosome;
constructing a fitness objective function F (x) — ω 1 × target1- ω 2 × target2+ ω 3 × target3 to perform feature attribute reduction on the feature attributes;
s32, optimizing the genetic operator according to the fitness objective function:
calculating the fitness value of the characteristic attribute according to the fitness objective function, judging whether a termination condition is met, and if so, obtaining a reduced characteristic subset; if not, the characteristic attributes are sequentially subjected to genetic algorithm operation consisting of non-return remainder random selection, uniform intersection and Gaussian transformation, and S32 is executed again.
Preferably, the specific step of optimizing the SVM parameter by the cuckoo algorithm in S4 includes:
s41, initialization setting: including probability PaIteration times N, bird nest number N, upper and lower limits, penalty factor c of SVM and RBF kernel function parameter sigma;
s42, initializing n bird nest positions, calculating the fitness values of all bird nests and storing the current optimal positions and the fitness values;
s43, updating the position of the bird nest according to a formula, comparing the position with the adaptability value of the bird nest at the corresponding position of the previous generation, and reserving the position with the minimum adaptability value and the adaptability value as the optimal bird nest;
s44, generating a random number r with a given probability PaDiscarding bad bird nest if r > PaIf not, updating the bird nest;
s45, recalculating the fitness value of the bird nest, replacing the bird nest with a high fitness value with the bird nest with a low fitness value, and generating a group of new bird nest positions;
s46, judging whether iteration times are finished, if so, stopping searching to obtain a global optimal fitness value and a corresponding optimal bird nest, and if not, jumping to S43 to continue optimizing;
and S47, constructing an SVM prediction model according to the optimal parameters c and sigma corresponding to the optimal bird nest position.
Compared with the prior art, the high-dimensional feature selection algorithm based on the Bayesian rough set and the Cuckoo algorithm has the advantages that:
the attribute importance degree is analyzed from the perspective of a global relative gain function, an attribute reduction length and the weighting and construction fitness function of a gene coding weight function are combined, an optimal feature subset is generated through genetic operations such as selection, intersection and variation, the feature dimension is reduced on the premise of not reducing the classification accuracy, the constraint of manual parameter setting is eliminated, and the time consumption is reduced to a great extent. The CS is used for carrying out global optimization on Support Vector Machine (SVM) parameters, global search in the CS algorithm has infinite mean value and variance, a search space can be explored more effectively than an algorithm using a standard Gaussian process, the search field is widened, the diversity of population is enriched, and the method has good robustness and strong global search capability. The feature selection is carried out by combining the BRS and an intelligent optimization algorithm, and the parameter of the SVM is optimized by using the CS, so that certain feasibility and effectiveness are achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a high-dimensional feature selection algorithm based on Bayesian rough set and Cuckoo algorithm provided by the present invention;
FIG. 2 is a comparison of ROI before and after segmentation by Otsu algorithm according to an embodiment of the present invention;
FIG. 3 is a flowchart of an optimal feature subset generation process provided by an embodiment of the present invention;
FIG. 4 is a flow chart of a CS optimization SVM parameter provided by an embodiment of the present invention;
fig. 5 is a schematic diagram of a change situation of a fitness function in a process of generating a certain feature subset according to an embodiment of the present invention;
fig. 6 is a comparison diagram of results of different classification algorithms based on BRSGA selection algorithm according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a high-dimensional feature selection algorithm based on a Bayesian rough set and a cuckoo algorithm, wherein a flow chart is shown in figure 1 and comprises the steps of data acquisition, data preprocessing, image segmentation, feature extraction, attribute reduction, classification recognition and the like. And finally, classifying and identifying the lung tumor CT image by adopting a two-stage optimized high-dimensional feature selection algorithm. The specific implementation process is as follows:
and S1, acquiring a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image.
Before the target contour segmentation, the method specifically comprises the following image acquisition and preprocessing processes:
3000 lung tumor CT images with definite diagnosis conclusion are collected by comprehensively considering the popularity of common lung tumor examination imaging methods, the acceptance degree and the cost of doctors and patients and avoiding the influence of factors such as the specification, the model and the environment of examination equipment on the lung tumor CT images, wherein 1500 malignant (benign) tumors are collected. And (3) intercepting subgraphs with strong distinguishing capability from the obtained image as ROI regions, and normalizing all ROI regions into an experimental image with the size of 50-50 pixels.
And (3) target contour segmentation process:
the segmentation of the target contour (including the lesion contour) from the truncated ROI region plays a crucial role in various clinical applications. However, in the current clinical practical application, a manual labeling method for radiologists is still adopted, and a large amount of intensive manual operations are easy to make mistakes, so that the accurate segmentation by using the computer technology has a very great practical value. In the embodiment, an Otsu threshold segmentation method is adopted, and the core idea is to segment an image into two groups, and when the interclass variance between the two groups reaches the maximum, the obtained value is the optimal segmentation threshold. The basic principle of Otsu's algorithm is as follows:
assuming that the size of an image is m × n and the gray level of the image is l, the gray level range is [0, l-1 ]],niRepresenting the number of occurrences of gray level i, the frequency of occurrence of gray level i in all pixels is pi=ni/(m.times.n). Suppose that a pixel having a gray level less than q constitutes A1Class A, i.e. A1Has a gray scale range of [0, q ]]The gray scale range is [ q +1, l-1 ]]The pixel point is A2If P is1(q),P2(q) each represents A1Class A and A2Probability of class occurrence, u1(q),u2(q) represents A1Class A and A2Average of class gray levels, then:
Figure BDA0002461998630000071
Figure BDA0002461998630000072
Figure BDA0002461998630000073
Figure BDA0002461998630000074
between-group variance σ of imagesb(q) is represented by:
Figure BDA0002461998630000075
when the interclass variance between two groups reaches the maximum, the obtained value is the optimal segmentation threshold, i.e. the pixel segmentation threshold is:
Figure BDA0002461998630000076
the ROI region is segmented by Otsu, and as shown in fig. 2, an example of ROI images before and after segmentation by Otsu algorithm is given, where fig. 2(a) is ROI image before segmentation and fig. 2(b) is ROI image after segmentation.
S2, extracting high-dimensional feature components of the segmented ROI image, wherein the high-dimensional feature components comprise 104-dimensional features including shape features, texture features and gray features, and the specific features are shown in Table 1. And constructing a decision information table containing characteristic attributes based on the high-dimensional characteristic components, wherein the characteristic attributes correspond to the characteristics of different dimensions in the high-dimensional characteristic components, the size of the constructed decision information table is 3000 x 105, discretization processing is carried out on the decision information table by adopting a fuzzy C mean value clustering algorithm, and after discretization, numerical labels, namely characteristic attributes, representing the benign and malignant tumors, are given to the tumor characteristics, and the characteristic attributes are located in the last column of the decision information table.
TABLE 1 pulmonary tumor CT image feature set
Figure BDA0002461998630000081
S3, based on a Bayes rough set model, constructing a fitness objective function by utilizing the weighted summation of a global relative gain function, an attribute reduction length and a gene coding weight function, and reducing the feature attributes by combining genetic operator combination to obtain a reduced feature subset; the embodiment combines the BRS algorithm and the GA algorithm to carry out attribute reduction, reduces the time complexity and the space complexity of the classifier and improves the classification performance.
As shown in fig. 3, the reduction specifically includes the following steps:
s31, establishing a BRS model:
1) setting parameters: the chromosome is a sequence consisting of 0, 1, the length of which is equal to the number N of conditional attributes, the cross probability PcProbability of variation PmThe maximum iteration number K is 150, the initial population M is 20, and the fitness function is f (x);
2) and (3) encoding: coding in a binary mode, wherein the length of the coding is equal to the number of condition attributes, 0 represents that the characteristic is not selected, and 1 represents that the characteristic is selected;
3) characteristic attributes, namely generation of initial population: randomly generating M chromosome strings with the length equal to the number of the conditional attributes to form an initial population;
4) genetic operator: the genetic operators comprise selection, crossover and mutation operators, and the genetic operators are combined into playback-free random remainder selection, uniform crossover and Gaussian mutation.
S32, constructing a fitness objective function: comprehensively considering the global relative gain function, the attribute reduction length and the gene coding weight function, and carrying out the optimization process of the genetic algorithm by weighting and constructing a fitness function frame to find the feature subset with the most distinguishing capability.
The global relative gain function of the equivalence relation E with respect to the feature attribute D is
Figure BDA0002461998630000091
Measuring the attribute importance of the information system S by adopting global relative gain;
in the BRS model, the attribute reduction algorithm process taking the global relative gain as heuristic information is as follows:
s321: calculating a kernel attribute set gamma of the condition attributes in the information system S ═ U, A, V, f, and calculating a dependency R of the decision attributes on the condition attributesC(D);
S322: calculating the dependency R of the decision attributes on the core attributesγ(D) If R isγ(D)=RC(D) If yes, go to S324 to find R reduction, otherwise let C equal C- γ, for
Figure BDA0002461998630000092
Computing
Figure BDA0002461998630000093
The values of (a) constitute a set M;
s323: the elements in the set M are sorted in ascending order, and the maximum value is added to the set gamma, namely gamma-U-Ci
Figure BDA0002461998630000094
Go to S322 to continue the calculation;
s324: the result is an R reduction of the BRS.
Attribute reduction length of
Figure BDA0002461998630000095
Wherein | C | is the number of conditional attributes, LrThe number of genes in chromosome r is 1;
the weight function of the gene code is
Figure BDA0002461998630000096
The value of the gene position can only be 0 and 1, otherwise, penalty is carried out, since genes which are more than 1 or less than 0 or less than-1 can appear in the chromosome, for the situation, a gene coding weight function is constructed as target3, the numerator obtains the product sum which is not 0 and 1, if the gene position i is 0, 0 x (0-1) is 0, if the gene position 1, 1 x (1-1) is 0, only the product sum which is not 0 and 1 is calculated, and the denominator is expressed as the length of the chromosome.
Example (c): provided that chromosome r ═ 01-231 ], (r-1) [ -10-320 ], then:
r×(r-1)=[0 1 -2 3 1]×[-1 0 -3 2 0]=[0 0 6 6 0]
Σ abs (r × (r-1)) -12, and a chromosome length of 5, target 3-12/5-2.4.
And constructing a fitness objective function F (x) — ω 1 × target1- ω 2 × target2+ ω 3 × target3 to perform feature attribute reduction on the feature attributes.
S32, optimizing the genetic operator according to the fitness objective function:
calculating the fitness value of the characteristic attribute according to the fitness objective function, and judging whether a termination condition is met, wherein the termination condition is a set fixed value; if yes, obtaining a reduced feature subset; if not, the genetic algorithm operation consisting of random selection of the residue without the return, uniform intersection and Gaussian transformation is sequentially carried out on the characteristic attributes, and S32 is executed again.
S4, optimizing the penalty factors and the kernel function of the SVM by using a cuckoo algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result.
Referring to fig. 4, the specific steps of the cuckoo algorithm for optimizing SVM parameters include:
s41, initialization setting: including the probability PaIteration times N, the number N of bird nests, upper and lower limits, a penalty factor c of the SVM and a RBF kernel function parameter sigma;
s42, initializing n bird nest positions, calculating the fitness values of all bird nests, and storing the current optimal positions and the fitness values; one bird nest is a feasible solution, the fitness value of the bird nest is calculated, the n feasible solutions obtained by initialization are brought into the value obtained by objective function calculation, the value is kept to be optimal (the maximum or minimum can be selected according to specific requirements), and the optimal position and the fitness value of the bird nest are obtained;
s43, updating the position of the bird nest according to the position updating formula, comparing the position with the adaptability value of the bird nest at the corresponding position of the previous generation, and reserving the position of the bird nest with the minimum adaptability value and the adaptability value as the optimal bird nest;
s44, automatically generating a random number r according to a Gaussian random function, and giving a probability PaDiscarding bad bird nest if r > PaIf not, updating the bird nest;
s45, recalculating the fitness value of the bird nest, replacing the bird nest with a high fitness value to generate a new bird nest position;
s46, judging whether iteration times are finished, if so, stopping searching to obtain a global optimal fitness value and a corresponding optimal bird nest, and if not, jumping to S43 to continue optimizing;
and S47, constructing an SVM prediction model according to the optimal parameters c and sigma corresponding to the optimal bird nest position.
The search path of the cuckoo CS algorithm is levy flight, and in the walking in the form, short-distance exploration and occasional long-distance walking are alternated, so that the search range can be expanded, the diversity of the population is increased, and the local optimum is avoided. The relevant definitions are as follows:
the formula for searching the nest position by the CS algorithm is as follows:
Figure BDA0002461998630000111
in the formula:
Figure BDA0002461998630000112
the position of the ith bird nest in the t-th generation is defined as alpha, which is a step control quantity and is generally 0.1, and alpha is used for determining a random search range:
Figure BDA0002461998630000113
wherein alpha is0Is a constant (alpha)0=0.01),xbestRepresenting the current optimal solution.
In the formula for searching the position of the bird nest,
Figure BDA0002461998630000114
denotes a point-to-point product, L (λ) is a random search path, and Levy-u is tIf λ is more than 1 and less than or equal to 3, and levy distribution is obeyed, the corresponding position updating formula is as follows:
Figure BDA0002461998630000115
wherein μ and ν are both subject to normal distribution:
Figure BDA0002461998630000116
Figure BDA0002461998630000117
where Γ is the standard Gamma function.
The performance evaluation of the medical image recognition result comprises two indexes of sensitivity and specificity, but the two indexes are difficult to comprehensively describe the overall performance of the classifier. Therefore, the embodiment sets evaluation indexes in two stages of feature selection and classification identification respectively, and the feature selection stage comprises reduction length, attribute importance and time. The classification identification stage comprises Accuracy (Accuracy), Sensitivity (Sensitivity), Specificity (Specificity), F value, Mazis Correlation Coefficient (MCC), balance F Score (F1Score), Johnson index (YI) and Time (Time), and the calculation formula is as follows:
Figure BDA0002461998630000121
Figure BDA0002461998630000122
Figure BDA0002461998630000123
Figure BDA0002461998630000124
Figure BDA0002461998630000125
Figure BDA0002461998630000126
YI=Sensitivity+Specificity-1
wherein TP represents the number of successfully identified malignant target contours; FP represents the number of false identified malignant tumor target contours; TN represents the number of successfully identified benign tumor target contours; FN represents the number of well-identified benign tumor target contours.
In order to verify the feasibility and the effectiveness of the technical scheme, two groups of comparison experiments are designed, wherein the experiments comprise that in order to verify the feasibility and the effectiveness of the BRSGA feature selection algorithm, the parameters of the SVM are optimized by fixedly adopting a GS algorithm, and the advantages and the disadvantages of the BRSGA algorithm and the VPRSGA under different beta conditions at different stages are compared. And in the second experiment, a feature selection algorithm is fixed on the basis of the first experiment, and the advantages and the disadvantages of the CS-SVM, the GS-SVM, the GA-SVM and the PSO-SVM in the classification stage are compared.
Experiment one: experimental result comparison based on same classification algorithm and different feature selection algorithms
The fixed classification recognition algorithm is GS-SVM, the advantages and disadvantages of BRSGA and VPRS algorithms under different parameters in two stages of feature selection and classification recognition are compared, wherein the VPRS parameter beta is respectively set to be 0.1, 0.2, 0.3 and 0.4. Specific results are shown in table 3, fig. 5 and table 4. And in the optimal feature subset generation stage, each parameter is reduced for 5 times in combination to respectively obtain the reduction length, the attribute importance and the time, and the average value of indexes of the 5-time reduction result of each parameter is obtained and used as an experimental result under the parameter. In the stage of classification and identification, five-fold intersection is performed on each parameter reduction result by using LIBSVM (namely, 300 cases of benign tumors and 300 cases of malignant tumors are selected as a test set each time, and the rest of data are selected as a training set), and five groups of identification results are obtained for each parameter, including: accuracy, sensitivity, specificity, F value, MCC, F1Score, YI and time, calculating the average value of each index after five-fold cross as the classification result after reduction under the parameter, and finally calculating the average value of each index after five-fold cross as the reduction and classification result under the parameter combination.
TABLE 3 comparison of attribute reduction results for different feature selection algorithms
Figure BDA0002461998630000131
As can be seen from table 3, the reduction length of the present invention is 7.8 dimensions without manual setting of the classification error rate β, and between the reduction lengths of the VPRSGA algorithm with different β values, the reduction length is significantly reduced compared to β -0.1, and the reduction length is slightly increased compared to β -0.2 and β -0.4. The fitness value is only slightly higher than the VPRSGA algorithm with β ═ 0.4. The importance is reduced by 0.0002 compared with VPRSGA of β ═ 0.4, higher than the VPRSGA model under other parameters. The VPRSGA model with the reduction time higher than beta-0.2 has 16.54-419.35 seconds lower than that of VPRSGA model with other parameter values, wherein the time is shortened by 2.7 times compared with beta-0.1. As can be seen from fig. 5, the algorithm herein has no premature convergence in the reduction stage, and the VPRSGA algorithm has different degrees of premature convergence under different β values, for example, in fig. 5b, the VPRSGA algorithm has a more serious premature convergence phenomenon at β ═ 0.1 in a certain reduction result. Therefore, compared with the VPRSGA algorithm, the method and the device are free from the constraint of manual parameter setting and achieve a relatively ideal effect.
TABLE 4 comparison of different feature selection algorithm classification recognition results
Figure BDA0002461998630000132
Figure BDA0002461998630000141
As can be seen from table 4: compared with the VPRSGA algorithm with the beta being 0.1, the accuracy, the specificity, the MCC, the F1Score and the YI of the VPRSGA algorithm are respectively reduced by 0.07 percent, 0.43 percent, 0.0015 percent, 0.0006 percent and 0.0013 percent, but the sensitivity is improved by 0.3 percent, and the VPRSGA algorithm with the classification time beta being 0.1 is 3.4 times that of the BRSGA algorithm. Although the accuracy of the BRS algorithm is reduced within an acceptable range, the time consumption is reduced to a great extent, and the overall performance of the BRS algorithm is better than that of the VPRS algorithm with beta being 0.1 by comprehensively considering the accuracy and the time consumption; compared with VPRSGA algorithm with the parameter beta being 0.2, 0.3 and 0.4, the BRSGA algorithm has the advantages that time is reduced, other indexes are improved to different degrees, and the indexes are obviously improved compared with the indexes of the VPRSGA algorithm with the parameter beta being 0.2. As can be seen from the classification result, compared with the VPRSGA model, the BRSGA model is not only free from the constraint of parameters, but also improves the classification performance of the model.
As can be seen from experiments, when the SVM parameters are optimized by using a grid optimization algorithm, compared with the VPRSGA algorithm, the BRSGA feature selection algorithm is free from the constraint of manual parameter setting, and shows a relatively ideal effect in the stages of attribute reduction and classification, so that the BRSGA is used as the fixed feature selection algorithm when the effectiveness of the CS algorithm on the optimization of the SVM parameters is verified.
Experiment two: experimental result comparison of different classification algorithms based on same feature selection algorithm
And fixing the optimal feature subset generation algorithm as BRSGA, optimizing SVM parameters by adopting a CS algorithm, and comparing with GS-SVM, GA-SVM and PSO-SVM. The method comprises the steps of performing classification recognition on results obtained by 5 times of reduction of a BRSGA algorithm in an experiment I, obtaining classification results including accuracy, sensitivity, specificity, an F value, MCC, F1Score, YI and time by five-fold cross verification each time, and calculating an average value of each index of five-fold cross to serve as the classification result after the reduction, wherein the average value of the five-fold cross is the final result of a classification model. In order to quantitatively describe whether the algorithm and the comparison algorithm have statistical significance in identifying the accuracy, a paired t test is adopted for carrying out hypothesis test, the statistical hypothesis test is based on five indexes of comprehensively describing the accuracy, F value, MCC, F1Score and YI, and the significance level is set to be p less than 0.05. The null hypothesis is that the difference between the average values of the same evaluation indexes of the present invention and the comparative algorithm is 0. The average and standard deviation of the recognition results of the five-reduction classification are given for each evaluation index, and the results are shown in table 5, and the average of the five-reduction results of each index is plotted in a line graph, and the results are shown in fig. 6.
TABLE 5 comparison of results of different classification algorithms based on BRSGA selection algorithm
Figure BDA0002461998630000151
Indicates that the marked result has a significance difference from the corresponding index of the algorithm (CS-SVM) herein at a significance level of 0.05.
As can be seen from Table 5, the quantitative analysis results show that the invention is superior to other three comparative algorithms in five evaluation indexes, and all have statistically significant differences. From fig. 6 as a whole: and the classification result after the five times of reduction by BRSGA shows a fluctuation trend, wherein most classification indexes of the 4 th reduction are relatively better, and all indexes of the 3 rd reduction except the classification time are relatively lower. Because the initial population of the genetic algorithm is randomly generated in the generation process of the optimal feature subset, the reduction result of each time is different, each parameter combination is reduced for 5 times, a five-fold cross mode is adopted for classification, the overall performance of the model is finally evaluated by the average value of each index of the reduction result and the classification result, and one-sided evaluation can be effectively avoided.
Fig. 6 is a comparison diagram of classification results at the following classification and identification stages: (a) accuracy; (b) a classification time; (c) f value; (d) sensitivity; (e) specificity; (f) MCC; (g) f1 Score; (h) youden. In the five-time reduction process, the CS-SVM algorithm is higher than the GS-SVM, GA-SVM and PSO-SVM algorithms in 6 evaluation indexes such as accuracy, F value, sensitivity, MCC, F1Score and Youden index, and the classification time is slightly higher than the GS-SVM. The classification time of the PSO-SVM algorithm is far longer than that of the other three algorithms, the other 6 evaluation indexes except the sensibility are higher than those of the GS-SVM and GA-SVM algorithms in the 3 rd reduction, and all indexes in the other 4 th reductions are lower than those of the GS-SVM, the GA-SVM and the CS-SVM. As can be known from the figures 6a, c, f, g and h, the CS-SVM algorithm in the fifth reduction is higher than GS-SVM, GA-SVM and PSO-SVM in all comprehensive evaluation indexes, and has certain robustness and higher popularization value.
The reason why the GS-SVM classification time is lower than that of the CS-SVM is as follows: firstly, due to the difficulty in acquiring medical image data, the test set data only comprises 600 cases, the time complexity is lower than that of a CS algorithm, but the data in real clinical treatment is massive, and is sharply increased every day, even exponentially increased, and when the number of samples is increased, the time complexity of the GS algorithm is greatly increased, so that the requirement of clinical application cannot be met; secondly, the GS algorithm has a certain randomness in the search range given by experience, and the optimal parameters cannot be obtained. The CS algorithm is a group intelligent search algorithm, has local and global search capabilities, widens the search field, enriches the diversity of the group, has good robustness, and can effectively avoid randomness caused by experience compared with the GS algorithm.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (4)

1. A high-dimensional feature selection algorithm based on a Bayesian rough set and a cuckoo algorithm is characterized by comprising the following steps:
s1, obtaining a lung tumor image, and performing target contour segmentation to obtain a segmented ROI image;
s2, extracting high-dimensional feature components of the segmented ROI image, and constructing a decision information table containing feature attributes based on the high-dimensional feature components, wherein the feature attributes correspond to features of different dimensions in the high-dimensional feature components;
s3, based on a Bayes rough set model, constructing a fitness objective function by utilizing the weighted summation of a global relative gain function, an attribute reduction length and a gene coding weight function, and reducing the feature attributes by combining genetic operator combination to obtain a reduced feature subset; the method comprises the following steps: s31, constructing the fitness objective function:
the first objective function is a global relative gain function of the equivalence relation E with respect to the feature attribute D:
Figure FDA0003657514110000011
measuring the attribute importance of the information system S by adopting global relative gain;
the second objective function is attribute reduction length:
Figure FDA0003657514110000012
wherein | C | is the number of conditional attributes, LrThe number of genes in chromosome r is 1;
the third objective function is a gene coding weight function:
Figure FDA0003657514110000013
wherein, the numerator is the product sum of genes with the length of non-0 and 1, and the denominator is the length of the chromosome;
constructing a fitness objective function F (x), wherein the fitness objective function F (x) is-omega 1 × target 1-omega 2 × target2+ omega 3 × target3, and carrying out feature attribute reduction on the feature attributes;
s4, optimizing the penalty factor and the kernel function of the SVM by using a cuckoo algorithm, and inputting the reduced feature subset into the optimized SVM to obtain a classification recognition result.
2. The high-dimensional feature selection algorithm based on Bayesian-coarse-set and Cuckoo-distribution algorithm as recited in claim 1, wherein the high-dimensional feature components in S2 comprise shape features, texture features and gray scale features of lung tumor images.
3. The high-dimensional feature selection algorithm based on bayesian rough set and cuckoo algorithm as claimed in claim 1, wherein said S3 further comprises the steps of:
s32, optimizing the genetic operator according to the fitness objective function:
calculating the fitness value of the characteristic attribute according to the fitness objective function, judging whether a termination condition is met, and if so, obtaining a reduced characteristic subset; if not, the characteristic attributes are sequentially subjected to genetic algorithm operation consisting of non-return remainder random selection, uniform intersection and Gaussian transformation, and S32 is executed again.
4. The high-dimensional feature selection algorithm based on the bayesian rough set and the cuckoo algorithm as claimed in claim 1, wherein the step of optimizing the SVM parameters by the cuckoo algorithm in S4 comprises:
s41, initialization setting: including probability PaIteration times N, the number N of bird nests, upper and lower limits, a penalty factor c of the SVM and a RBF kernel function parameter sigma;
s42, initializing n bird nest positions, calculating the fitness value of all bird nests, and storing the current optimal position and the fitness value;
s43, updating the position of the bird nest according to a formula, comparing the position with the adaptability value of the bird nest at the corresponding position of the previous generation, and keeping the position of the bird nest with the minimum adaptability value and the adaptability value as the optimal bird nest;
s44, generating a random number r with a given probability PaDiscarding bad bird nest if r > PaIf not, updating the bird nest;
s45, recalculating the fitness value of the bird nest, replacing the bird nest with a high fitness value to generate a new bird nest position;
s46, judging whether iteration times are finished, if so, stopping searching to obtain a global optimal fitness value and a corresponding optimal bird nest, and if not, jumping to S43 to continue optimizing;
and S47, constructing an SVM prediction model according to the optimal parameters c and sigma corresponding to the optimal bird nest position.
CN202010322570.8A 2020-04-22 2020-04-22 High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm Active CN111583194B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010322570.8A CN111583194B (en) 2020-04-22 2020-04-22 High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010322570.8A CN111583194B (en) 2020-04-22 2020-04-22 High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm

Publications (2)

Publication Number Publication Date
CN111583194A CN111583194A (en) 2020-08-25
CN111583194B true CN111583194B (en) 2022-07-15

Family

ID=72111635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010322570.8A Active CN111583194B (en) 2020-04-22 2020-04-22 High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm

Country Status (1)

Country Link
CN (1) CN111583194B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111577B (en) * 2021-04-01 2023-05-05 燕山大学 Cement mill operation index decision method based on multi-target cuckoo search
CN114627964B (en) * 2021-09-13 2023-03-24 东北林业大学 Prediction enhancer based on multi-core learning and intensity classification method and classification equipment thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784353A (en) * 2016-08-29 2018-03-09 普天信息技术有限公司 A kind of function optimization method based on cuckoo searching algorithm
CN109186971A (en) * 2018-08-06 2019-01-11 江苏大学 Hub motor mechanical breakdown inline diagnosis method based on dynamic bayesian network
CN109325580A (en) * 2018-09-05 2019-02-12 南京邮电大学 A kind of adaptive cuckoo searching method for Services Composition global optimization
CN109978880A (en) * 2019-04-08 2019-07-05 哈尔滨理工大学 Lung tumors CT image is carried out sentencing method for distinguishing using high dimensional feature selection

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190253558A1 (en) * 2018-02-13 2019-08-15 Risto Haukioja System and method to automatically monitor service level agreement compliance in call centers
US20190318248A1 (en) * 2018-04-13 2019-10-17 NEC Laboratories Europe GmbH Automated feature generation, selection and hyperparameter tuning from structured data for supervised learning problems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784353A (en) * 2016-08-29 2018-03-09 普天信息技术有限公司 A kind of function optimization method based on cuckoo searching algorithm
CN109186971A (en) * 2018-08-06 2019-01-11 江苏大学 Hub motor mechanical breakdown inline diagnosis method based on dynamic bayesian network
CN109325580A (en) * 2018-09-05 2019-02-12 南京邮电大学 A kind of adaptive cuckoo searching method for Services Composition global optimization
CN109978880A (en) * 2019-04-08 2019-07-05 哈尔滨理工大学 Lung tumors CT image is carried out sentencing method for distinguishing using high dimensional feature selection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于贝叶斯粗糙集的肺部肿瘤CT图像高维特征选择算法";张飞飞,等;《生物医学工程研究》;20180430;全文 *

Also Published As

Publication number Publication date
CN111583194A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
Xia et al. Complete random forest based class noise filtering learning for improving the generalizability of classifiers
JP2022538866A (en) System and method for image preprocessing
Xu et al. Texture-specific bag of visual words model and spatial cone matching-based method for the retrieval of focal liver lesions using multiphase contrast-enhanced CT images
CN112464005B (en) Depth-enhanced image clustering method
CN110969626A (en) Method for extracting hippocampus of human brain nuclear magnetic resonance image based on 3D neural network
CN111553127A (en) Multi-label text data feature selection method and device
CN111340135B (en) Renal mass classification method based on random projection
CN111583194B (en) High-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm
CN112149717A (en) Confidence weighting-based graph neural network training method and device
CN114596467A (en) Multimode image classification method based on evidence deep learning
CN111110192A (en) Skin abnormal symptom auxiliary diagnosis system
KR20230029004A (en) System and method for prediction of lung cancer final stage using chest automatic segmentation image
CN110910325B (en) Medical image processing method and device based on artificial butterfly optimization algorithm
Somase et al. Develop and implement unsupervised learning through hybrid FFPA clustering in large-scale datasets
CN117036894B (en) Multi-mode data classification method and device based on deep learning and computer equipment
CN117195027A (en) Cluster weighted clustering integration method based on member selection
US20240144474A1 (en) Medical-image-based lesion analysis method
AU2021102593A4 (en) A Method for Detection of a Disease
CN115310491A (en) Class-imbalance magnetic resonance whole brain data classification method based on deep learning
CN114821157A (en) Multi-modal image classification method based on hybrid model network
CN113177608A (en) Neighbor model feature selection method and device for incomplete data
CN112735596A (en) Similar patient determination method and device, electronic equipment and storage medium
Mehta et al. Soft-computing based diagnostic tool for analyzing demyelination in magnetic resonance images
Hadavi et al. Classification of normal and abnormal lung ct-scan images using cellular learning automata
CN113096828B (en) Diagnosis, prediction and major health management platform based on cancer genome big data core algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant