CN111583194A - High-Dimensional Feature Selection Algorithm Based on Bayesian Rough Sets and Cuckoo Algorithm - Google Patents
High-Dimensional Feature Selection Algorithm Based on Bayesian Rough Sets and Cuckoo Algorithm Download PDFInfo
- Publication number
- CN111583194A CN111583194A CN202010322570.8A CN202010322570A CN111583194A CN 111583194 A CN111583194 A CN 111583194A CN 202010322570 A CN202010322570 A CN 202010322570A CN 111583194 A CN111583194 A CN 111583194A
- Authority
- CN
- China
- Prior art keywords
- algorithm
- feature
- svm
- fitness
- bird nest
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 109
- 241000544061 Cuculus canorus Species 0.000 title claims abstract description 20
- 230000002068 genetic effect Effects 0.000 claims abstract description 18
- 230000011218 segmentation Effects 0.000 claims abstract description 14
- 208000037841 lung tumor Diseases 0.000 claims abstract description 12
- 208000020816 lung neoplasm Diseases 0.000 claims abstract description 11
- 235000005770 birds nest Nutrition 0.000 claims description 50
- 235000005765 wild carrot Nutrition 0.000 claims description 50
- 230000009467 reduction Effects 0.000 claims description 45
- 108090000623 proteins and genes Proteins 0.000 claims description 14
- 210000000349 chromosome Anatomy 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 3
- 230000009191 jumping Effects 0.000 claims 1
- 238000000034 method Methods 0.000 abstract description 20
- 238000005457 optimization Methods 0.000 abstract description 12
- 244000000626 Daucus carota Species 0.000 description 27
- 238000012706 support-vector machine Methods 0.000 description 22
- 230000008569 process Effects 0.000 description 11
- 238000011156 evaluation Methods 0.000 description 10
- 238000002474 experimental method Methods 0.000 description 8
- 230000035945 sensitivity Effects 0.000 description 8
- 206010028980 Neoplasm Diseases 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 6
- 238000007635 classification algorithm Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 201000011510 cancer Diseases 0.000 description 4
- 230000035772 mutation Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000002028 premature Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000004195 computer-aided diagnosis Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000001174 ascending effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002059 diagnostic imaging Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003902 lesion Effects 0.000 description 1
- 230000003211 malignant effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007427 paired t-test Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 238000000551 statistical hypothesis test Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
- G06T7/0012—Biomedical image inspection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2111—Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/12—Edge-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/136—Segmentation; Edge detection involving thresholding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10072—Tomographic images
- G06T2207/10081—Computed x-ray tomography [CT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30004—Biomedical image processing
- G06T2207/30096—Tumor; Lesion
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Radiology & Medical Imaging (AREA)
- Medical Informatics (AREA)
- Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
- Physiology (AREA)
- Quality & Reliability (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本发明涉及医疗影像识别技术领域,更具体的说是涉及一种基于贝叶斯粗糙集和布谷鸟算法的高维特征选择算法。The invention relates to the technical field of medical image recognition, in particular to a high-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm.
背景技术Background technique
随着计算机辅助诊断(computer aided diagnosis,CAD)研究的发展,医学图像处理技术得到了飞速发展。但是医学图像本身的多模态性、灰度模糊性和不确定性使得单一模态的医学影像诊断过程中漏诊率和误诊率居高不下。因此,不同模态医学图像处理技术应运而生,按照不同的层次分为像素级、特征级和决策级。而特征级处理在保留重要信息的基础上又能实现信息量的压缩,处理速度更快。医学图像特征级处理过程中,特征之间的冗余性和相关性使得“维数灾难”成为一个NP-hard问题,特征选择是解决这一问题行之有效的措施,可以有效减少特征空间的维度,降低时间复杂度。With the development of computer aided diagnosis (CAD) research, medical image processing technology has developed rapidly. However, the multi-modality, grayscale ambiguity and uncertainty of medical images make the rate of missed diagnosis and misdiagnosis remain high in the process of single-modality medical imaging diagnosis. Therefore, different modalities of medical image processing technology emerge as the times require, which are divided into pixel level, feature level and decision level according to different levels. The feature-level processing can compress the amount of information on the basis of retaining important information, and the processing speed is faster. In the process of feature-level processing of medical images, the redundancy and correlation between features make the "curse of dimensionality" an NP-hard problem. Feature selection is an effective measure to solve this problem, which can effectively reduce the size of the feature space. dimension, reducing time complexity.
高维特征选择过程存在的问题包括如何生成最优特征子集,效果如何评价,评价所用分类器的选择,分类器参数的优化等,针对这些问题,近年来专家学者们相继提出了很多算法。首先,变精度粗糙集(variable precision rough set,VPRS)的提出可有效克服粗糙集(rough set,RS)只能处理精确分类数据的局限性,通过引入分类错误率β将RS的下近似由“完全包含”放松为“部分包含”,提高了存在噪声的数据集处理结果的鲁棒性和泛化能力。VPRS研究的核心是分类错误率β的选取问题,主要研究领域包括三个方面:第一,不考虑β选取的细节,提出多种扩展VPRS模型,如:变精度模糊粗糙集、变精度多粒度粗糙集、广义VPRS、基于β-公差关系和巴氏距离的扩展VPRS等;第二,通过不同的计算方式获得β的取值,如将平均包含度作为选取上下近似的阈值;第三,引入概率公式提出了很多概率RS模型,如VPRS、博弈粗糙集、决策粗糙集、贝叶斯粗糙集(bayesian rough set,BRS)、0.5概率粗糙集等。概率粗糙集中各种方法之间具有一定的相关性,差异性体现在概率公式的计算和参数设计方式的不同。其中BRS是在VPRS的基础上引入先验概率,用先验概率代替VPRS中的分类错误率β,不需要人工设置参数,既克服了RS对下近似的完全精确划分,又避免了VPRS中参数β对上下近似的影响。对于BRS的研究目前很多还处于理论分析阶段,缺乏成熟独立的模型,未见与其他算法结合处理医学图像高维特征选择问题。The problems existing in the high-dimensional feature selection process include how to generate optimal feature subsets, how to evaluate the effect, the selection of classifiers used for evaluation, and the optimization of classifier parameters. In recent years, experts and scholars have proposed many algorithms. First, the proposal of variable precision rough set (VPRS) can effectively overcome the limitation that rough set (RS) can only handle accurate classification data. By introducing the classification error rate β, the lower approximation of RS is changed from "Full containment" is relaxed to "partial containment", which improves robustness and generalization of results in noisy datasets. The core of VPRS research is the selection of classification error rate β. The main research areas include three aspects: first, without considering the details of β selection, a variety of extended VPRS models are proposed, such as: variable-precision fuzzy rough set, variable-precision multi-granularity Rough set, generalized VPRS, extended VPRS based on β-tolerance relationship and Bavarian distance, etc.; second, the value of β is obtained through different calculation methods, such as the average inclusion degree as the threshold for selecting upper and lower approximations; third, introducing Probabilistic formulas propose many probabilistic RS models, such as VPRS, game rough set, decision rough set, Bayesian rough set (BRS), 0.5 probability rough set, etc. There is a certain correlation among various methods in probability rough sets, and the difference is reflected in the calculation of probability formula and the way of parameter design. Among them, BRS introduces a priori probability on the basis of VPRS, and replaces the classification error rate β in VPRS with prior probability. It does not need to manually set parameters, which not only overcomes the complete and accurate division of RS pair approximation, but also avoids the parameters in VPRS. The effect of β on the upper and lower approximations. Most of the research on BRS is still in the stage of theoretical analysis, lacking mature and independent models, and has not been combined with other algorithms to deal with the problem of high-dimensional feature selection in medical images.
其次,分类器的性能是评价高维特征选择算法的依据,支持向量机(supportvector machine,SVM)是常用的一种二分类算法,核函数的引入更加拓宽了其应用范围,常用的核函数包括多项式核函数、径向基核函数(RBF)和Sigmoid核函数,其中多项式核函数的计算速度较慢,严重影响了其效果,应用较少;RBF相比Sigmoid核函数参数较少,在计算过程中只需要计算核矩阵,时间复杂度较小,参数人工设置工作量大、时间较长,并且最终得到的参数并不一定是最优的,需要将参数的选择转化为优化问题进行分析。Secondly, the performance of the classifier is the basis for evaluating the high-dimensional feature selection algorithm. The support vector machine (SVM) is a commonly used binary classification algorithm. The introduction of the kernel function further broadens its application scope. The commonly used kernel functions include Polynomial kernel function, radial basis kernel function (RBF) and Sigmoid kernel function, among which the calculation speed of the polynomial kernel function is slow, which seriously affects its effect and is less used; Only the kernel matrix needs to be calculated, the time complexity is small, the manual parameter setting workload is large and the time is long, and the final parameters are not necessarily optimal, and the parameter selection needs to be transformed into an optimization problem for analysis.
因此,如何提供一种具有低时间复杂度和较好鲁棒性的基于贝叶斯粗糙集和布谷鸟算法的高维特征选择算法是本领域技术人员亟需解决的问题。Therefore, how to provide a high-dimensional feature selection algorithm with low time complexity and better robustness based on Bayesian rough set and cuckoo algorithm is an urgent problem for those skilled in the art to solve.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本发明提供了一种基于贝叶斯粗糙集和布谷鸟算法的高维特征选择算法,结合BRS、GA、CS和SVM算法,提出一种基于BRSGA和CS两阶段优化的高维特征选择算法。第一阶段的优化采用BRSGA算法对原始特征空间进行约简,得到最优特征子集,第二阶段利用CS算法对SVM的惩罚因子和核函数参数进行优化,使用最优的参数组合构建CS-SVM分类模型,对肺部肿瘤图像进行识别。In view of this, the present invention provides a high-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm, combined with BRS, GA, CS and SVM algorithms, and proposes a high-dimensional feature selection algorithm based on BRSGA and CS two-stage optimization Feature selection algorithm. In the first stage of optimization, the BRSGA algorithm is used to reduce the original feature space to obtain the optimal feature subset. SVM classification model to identify lung tumor images.
为了实现上述目的,本发明提供如下技术方案:In order to achieve the above object, the present invention provides the following technical solutions:
一种基于贝叶斯粗糙集和布谷鸟算法的高维特征选择算法,包括如下步骤:A high-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm, including the following steps:
S1、获取肺部肿瘤图像,并进行目标轮廓分割,得到分割后的ROI图像;S1. Acquire a lung tumor image, and perform target contour segmentation to obtain a segmented ROI image;
S2、提取所述分割后的ROI图像的高维度特征分量,并基于所述高维度特征分量构建包含特征属性的决策信息表,所述特征属性与所述高维度特征分量中不同维度的特征相对应;S2, extracting the high-dimensional feature components of the segmented ROI image, and constructing a decision information table containing feature attributes based on the high-dimensional feature components, the feature attributes and the features of different dimensions in the high-dimensional feature components correspond;
S3、基于贝叶斯粗糙集模型,利用全局相对增益函数、属性约简长度和基因编码权值函数的加权求和构建适应度目标函数,结合遗传算子组合对所述特征属性进行约简,得到约简后的特征子集;S3. Based on the Bayesian rough set model, a fitness objective function is constructed by using the weighted summation of the global relative gain function, the attribute reduction length and the genetic coding weight function, and the characteristic attribute is reduced by combining the genetic operator combination, Get the reduced feature subset;
S4、利用布谷鸟算法对SVM的惩罚因子和核函数进行优化,并将所述约简后的特征子集输入至优化后的SVM,得到分类识别结果。S4, using the cuckoo algorithm to optimize the penalty factor and kernel function of the SVM, and input the reduced feature subset into the optimized SVM to obtain a classification and identification result.
优选的,所述S2中的高维度特征分量包括肺部肿瘤图像的形状特征、纹理特征和灰度特征。Preferably, the high-dimensional feature components in S2 include shape features, texture features and grayscale features of the lung tumor image.
优选的,所述S3具体包括如下步骤:Preferably, the S3 specifically includes the following steps:
S31、构建适应度目标函数:S31. Construct a fitness objective function:
目标函数一为等价关系E相对于特征属性D的全局相对增益函数:采用全局相对增益衡量信息系统S的属性重要度;The first objective function is the global relative gain function of the equivalence relation E relative to the feature attribute D: The global relative gain is used to measure the attribute importance of the information system S;
目标函数二为属性约简长度: The second objective function is the attribute reduction length:
其中,|C|为条件属性个数,Lr为r染色体中基因为1的个数;Among them, |C| is the number of conditional attributes, and L r is the number of 1 genes in the r chromosome;
目标函数三为基因编码权值函数: The third objective function is the gene encoding weight function:
其中,分子为非0、1的基因乘积和,分母为染色体的长度;Among them, the numerator is the product sum of genes other than 0 and 1, and the denominator is the length of the chromosome;
构造适应度目标函数F(x)=-ω1×target1-ω2×target2+ω3×target3对所述特征属性进行特征属性约简;Construct a fitness objective function F(x)=-ω1×target1-ω2×target2+ω3×target3 to perform feature attribute reduction on the feature attribute;
S32、根据所述适应度目标函数对遗传算子进行寻优:S32, optimize the genetic operator according to the fitness objective function:
根据适应度目标函数计算特征属性的适应度值,并判断是否满足终止条件,若是则得到约简后的特征子集;若否则对特征属性依次进行无放回余数随机选择、均匀交叉和高斯变换构成的遗传算法运算,并重新执行S32。According to the fitness objective function, the fitness value of the feature attribute is calculated, and it is judged whether the termination condition is met. If so, the reduced feature subset is obtained; otherwise, the feature attribute is randomly selected without replacement residue, uniform crossover and Gaussian transformation. Constitute the genetic algorithm operation, and re-execute S32.
优选的,所述S4中布谷鸟算法优化SVM参数的具体步骤包括:Preferably, the specific steps of optimizing the SVM parameters by the cuckoo algorithm in the S4 include:
S41、初始化设置:包括概率Pa、迭代次数N、鸟巢数量n、上下界限、SVM的惩罚因子c和RBF核函数参数σ;S41. Initialization settings: including the probability P a , the number of iterations N, the number of nests n, the upper and lower limits, the penalty factor c of the SVM and the RBF kernel function parameter σ;
S42、初始化n个鸟巢位置,计算所有鸟巢的适应度值并保存当前最优位置和适应度值;S42, initialize n bird nest positions, calculate the fitness values of all bird nests, and save the current optimal position and fitness value;
S43、根据公式更新鸟巢位置,并与上一代相应位置的鸟巢适应度值进行对比,保留适应度值最小的鸟巢位置和适应度值作为最优鸟巢;S43. Update the bird's nest position according to the formula, and compare it with the bird's nest fitness value of the corresponding position of the previous generation, and retain the bird's nest position and the fitness value with the smallest fitness value as the optimal bird's nest;
S44、生成随机数r,以给定概率Pa抛弃差的鸟巢,若r>Pa,则更新鸟巢,否则不更新;S44, generating a random number r, discarding the poor bird's nest with a given probability P a , if r > P a , updating the bird's nest, otherwise not updating;
S45、重新计算鸟巢的适应度值,用适应度高的鸟巢替换适应度值低的鸟巢,生成一组新的鸟巢位置;S45. Recalculate the fitness value of the bird's nest, replace the bird's nest with a low fitness value with a bird's nest with a high fitness value, and generate a new set of bird nest positions;
S46、判断是否完成迭代次数,若是,则停止搜索,得到全局最优适应度值和对应的最优鸟巢,如果不满足停止条件,则跳至S43继续寻优;S46, determine whether the number of iterations is completed, if so, stop the search to obtain the global optimal fitness value and the corresponding optimal bird's nest, if the stopping condition is not satisfied, then skip to S43 to continue the optimization;
S47、根据最优鸟巢位置所对应的最优参数c和σ构建SVM预测模型。S47, construct an SVM prediction model according to the optimal parameters c and σ corresponding to the optimal bird's nest position.
本发明设计的一种基于贝叶斯粗糙集和布谷鸟算法的高维特征选择算法与现有技术相比的优点在于:Compared with the prior art, the advantages of a high-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm designed by the present invention are:
从全局相对增益函数的角度分析了属性重要度,结合属性约简长度和基因编码权值函数的加权和构造适应度函数,通过选择、交叉和变异等遗传操作生成最优特征子集,在不降低分类精确度的前提下降低特征维度,摆脱了参数人工设置的束缚,在很大程度上减少了时间消耗。利用CS对支持向量机(SVM)参数进行全局寻优,CS算法中全局搜索,具有无限的均值和方差,可以比使用标准的高斯过程的算法更有效的探索搜索空间,拓宽了搜索领域,丰富了种群的多样性,具有良好的鲁棒性和较强的全局搜索能力。将BRS与智能优化算法结合进行特征选择,使用CS优化SVM的参数具有一定的可行性和有效性。The attribute importance is analyzed from the perspective of the global relative gain function, and the fitness function is constructed by combining the weighted sum of the attribute reduction length and the gene encoding weight function, and the optimal feature subset is generated through genetic operations such as selection, crossover and mutation. On the premise of reducing the classification accuracy, the feature dimension is reduced, and the constraints of manual parameter setting are freed, which greatly reduces the time consumption. Using CS to optimize the parameters of support vector machine (SVM) globally, the global search in the CS algorithm has infinite mean and variance, which can explore the search space more effectively than the algorithm using the standard Gaussian process, broaden the search field, enrich the It has good robustness and strong global search ability. It is feasible and effective to combine BRS with intelligent optimization algorithm for feature selection, and use CS to optimize the parameters of SVM.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative work.
图1为本发明提供的基于贝叶斯粗糙集和布谷鸟算法的高维特征选择算法的流程图;Fig. 1 is the flow chart of the high-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm provided by the present invention;
图2为本发明实施例提供的利用Otsu算法对ROI区域进行分割前后的对比图;Fig. 2 is the contrast diagram before and after the ROI region is segmented by utilizing Otsu algorithm provided by the embodiment of the present invention;
图3为本发明实施例提供的最优特征子集生成流程图;FIG. 3 is a flowchart for generating an optimal feature subset provided by an embodiment of the present invention;
图4为本发明实施例提供的CS优化SVM参数流程图;Fig. 4 is the CS optimization SVM parameter flow chart provided by the embodiment of the present invention;
图5为本发明实施例提供的某次特征子集生成过程中适应度函数变化情况示意图;5 is a schematic diagram of a change in a fitness function during a certain feature subset generation process provided by an embodiment of the present invention;
图6为本发明实施例提供的基于BRSGA选择算法不同分类算法结果的对比图。FIG. 6 is a comparison diagram of the results of different classification algorithms based on the BRSGA selection algorithm provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
本发明实施例公开了一种基于贝叶斯粗糙集和布谷鸟算法的高维特征选择算法,流程图如图1所示,包括数据的获取、数据预处理、图像分割、特征提取、属性约简和分类识别等,在特征约简过程中,采用GA混合BRS算法对原始特征子集进行寻优操作,得到约简后的最优特征子集,在分类识别阶段,采用CS对SVM的核函数参数和惩罚参数进行优化,得到模型拟合度最好的参数并根据参数建立模型。最后采用两阶段优化的高维特征选择算法对肺部肿瘤CT图像进行分类识别。具体执行过程如下:The embodiment of the present invention discloses a high-dimensional feature selection algorithm based on Bayesian rough set and cuckoo algorithm. The flowchart is shown in FIG. 1 , including data acquisition, data preprocessing, image segmentation, feature extraction, and attribute reduction. In the process of feature reduction, the GA hybrid BRS algorithm is used to optimize the original feature subset, and the optimized feature subset after reduction is obtained. The function parameters and penalty parameters are optimized to obtain the parameters with the best model fit and build the model according to the parameters. Finally, a two-stage optimized high-dimensional feature selection algorithm is used to classify and identify lung tumor CT images. The specific execution process is as follows:
S1、获取肺部肿瘤图像,并进行目标轮廓分割,得到分割后的ROI图像。S1. Acquire a lung tumor image, and perform target contour segmentation to obtain a segmented ROI image.
目标轮廓分割之前,具体包括图像获取和预处理过程:Before target contour segmentation, it includes image acquisition and preprocessing:
综合考虑常见肺部肿瘤检查影像学方法的普及型、医患的接受程度、费用以及为避免肺部肿瘤CT图像受检查设备规格、型号和环境等因素的影响,收集带有明确诊断结论的肺部肿瘤CT图像共3000例,其中恶(良)性肿瘤各1500例。从获得的图像中截取具有较强区分能力的子图作为ROI区域,将所有的ROI区域归一化为50*50像素大小的实验图像。Taking into account the popularity of common lung tumor imaging methods, the acceptance of doctors and patients, the cost, and to avoid the influence of lung tumor CT images by the specifications, models, and environment of the inspection equipment, lung tumors with definite diagnostic conclusions were collected. A total of 3000 cases of tumor CT images were obtained, including 1500 cases of malignant (benign) tumors. The sub-images with strong discriminative ability are intercepted from the obtained images as ROI regions, and all ROI regions are normalized to experimental images with a size of 50*50 pixels.
目标轮廓分割过程:Target contour segmentation process:
从截取的ROI区域中分割出目标轮廓(包括病灶轮廓)在各种临床应用中起着至关重要的作用。但是,目前临床实际应用中仍然采用放射科医生手动标注的方法,大量的密集型手动操作容易出错,因此,利用计算机技术进行准确分割具有非常大的实用价值。本实施例采用Otsu阈值分割法,其核心理念是将图像分割成两组,当两组之间的组间方差达到最大时,得到的值为最佳分割阈值。Otsu算法的基本原理如下:Segmenting target contours (including lesion contours) from the intercepted ROI region plays a crucial role in various clinical applications. However, the method of manual annotation by radiologists is still used in clinical practice at present, and a large number of intensive manual operations are prone to errors. Therefore, accurate segmentation using computer technology has great practical value. This embodiment adopts the Otsu threshold segmentation method, the core idea of which is to segment the image into two groups, and when the inter-group variance between the two groups reaches the maximum, the obtained value is the optimal segmentation threshold. The basic principle of Otsu algorithm is as follows:
假设一副图像的大小是m×n,图像灰度级为l,则灰度级范围为[0,l-1],ni表示灰度级i出现的次数,则灰度级i在所有像素中出现的频率为pi=ni/(m×n)。假设灰度级小于q的像素构成A1类,即A1的灰度级范围为[0,q],灰度级范围为[q+1,l-1]的像素点为A2,若P1(q),P2(q)分别表示A1类和A2类出现的概率,u1(q),u2(q)表示A1类和A2类灰度级的平均值,则:Assuming that the size of an image is m×n, and the gray level of the image is l, the gray level range is [0,l-1], and n i represents the number of occurrences of gray level i, then gray level i is present in all The frequency of occurrence in a pixel is p i =n i /(m×n). Assume that the pixels whose gray level is less than q constitute class A 1 , that is, the gray level range of A 1 is [0, q], and the pixel point whose gray level range is [q+1, l-1] is A 2 . P 1 (q), P 2 (q) represent the probability of occurrence of A 1 class and A 2 class, respectively, u 1 (q), u 2 (q) represent the average value of the gray level of A 1 class and A 2 class, but:
图像的组间方差σb(q)表示为:The between-group variance σ b (q) of the image is expressed as:
当两组之间的组间方差达到最大时,得到的值为最佳分割阈值,即像素分割阈值为:When the inter-group variance between the two groups reaches the maximum, the obtained value is the optimal segmentation threshold, that is, the pixel segmentation threshold is:
本文利用Otsu对ROI区域进行分割,如图2所示,给出利用Otsu算法分割前后的ROI图像实例,其中图2(a)为分割前ROI图像,图2(b)为分割后ROI图像。In this paper, Otsu is used to segment the ROI area. As shown in Figure 2, an example of the ROI image before and after segmentation using the Otsu algorithm is given. Figure 2(a) is the ROI image before segmentation, and Figure 2(b) is the ROI image after segmentation.
S2、提取所述分割后的ROI图像的高维度特征分量,包括形状特征、纹理特征和灰度特征共104维特征,具体特征见表1所示。并基于所述高维度特征分量构建包含特征属性的决策信息表,所述特征属性与所述高维度特征分量中不同维度的特征相对应,构建的决策信息表大小为3000*105,并采用模糊C均值聚类算法对决策信息表进行离散化处理,离散化后,对肿瘤特征赋予数值标签,即特征属性,代表肿瘤的良恶性,位于决策信息表最后一列。S2. Extract the high-dimensional feature components of the segmented ROI image, including a total of 104-dimensional features including shape features, texture features, and grayscale features. The specific features are shown in Table 1. And based on the high-dimensional feature components, a decision information table containing feature attributes is constructed, and the feature attributes correspond to the features of different dimensions in the high-dimensional feature components. The size of the constructed decision information table is 3000*105, and the fuzzy The C-means clustering algorithm discretizes the decision information table. After discretization, a numerical label is assigned to the tumor feature, that is, the feature attribute, which represents the benign and malignant tumor, and is located in the last column of the decision information table.
表1肺部肿瘤CT图像特征集合Table 1 Lung tumor CT image feature set
S3、基于贝叶斯粗糙集模型,利用全局相对增益函数、属性约简长度和基因编码权值函数的加权求和构建适应度目标函数,结合遗传算子组合对所述特征属性进行约简,得到约简后的特征子集;本实施例结合BRS和GA算法进行属性约简,降低分类器的时间复杂度和空间复杂度,提高分类性能。S3. Based on the Bayesian rough set model, a fitness objective function is constructed by using the weighted summation of the global relative gain function, the attribute reduction length and the genetic coding weight function, and the characteristic attribute is reduced by combining the genetic operator combination, A reduced feature subset is obtained; this embodiment combines the BRS and GA algorithms to perform attribute reduction, reduces the time complexity and space complexity of the classifier, and improves the classification performance.
如图3所示,约简具体包括如下步骤:As shown in Figure 3, the reduction specifically includes the following steps:
S31、建立BRS模型:S31. Establish a BRS model:
1)参数设定:染色体为0,1组成的序列,其长度等于条件属性的数目N,交叉概率Pc,变异概率Pm,最大迭代次数K=150,初始种群M=20,适应度函数为F(x);1) Parameter setting: a sequence consisting of 0 and 1 chromosomes, the length of which is equal to the number N of conditional attributes, the crossover probability P c , the mutation probability P m , the maximum number of iterations K=150, the initial population M=20, the fitness function is F(x);
2)编码:采用二进制方式进行编码,其长度等于条件属性的个数,“0”代表特征未被选中,“1”代表特征被选中;2) Encoding: encoding in binary mode, its length is equal to the number of conditional attributes, "0" means that the feature is not selected, and "1" means that the feature is selected;
3)特征属性,即初始种群的生成:随机产生M个长度等于条件属性个数的染色体串构成初始种群;3) Characteristic attributes, that is, the generation of the initial population: randomly generate M chromosome strings whose length is equal to the number of conditional attributes to form the initial population;
4)遗传算子:遗传算子包括选择、交叉和变异算子,遗传算子组合为无回放随机余数选择、均匀交叉和高斯变异。4) Genetic operators: Genetic operators include selection, crossover and mutation operators. The combination of genetic operators is random remainder selection without playback, uniform crossover and Gaussian mutation.
S32、构建适应度目标函数:从全局相对增益函数、属性约简长度和基因编码权值函数三个方面综合考虑,通过加权和构造适应度函数框架,进行遗传算法的寻优过程,找到最具区分能力的特征子集。S32. Construct a fitness objective function: comprehensively consider three aspects of global relative gain function, attribute reduction length and gene encoding weight function, construct a fitness function framework through weighted sum, carry out the optimization process of genetic algorithm, and find the most suitable fitness function. Discriminative feature subsets.
等价关系E相对于特征属性D的全局相对增益函数为采用全局相对增益衡量信息系统S的属性重要度;The global relative gain function of the equivalence relation E with respect to the feature attribute D is: The global relative gain is used to measure the attribute importance of the information system S;
在BRS模型中,以全局相对增益为启发式信息的属性约简算法过程如下:In the BRS model, the process of attribute reduction algorithm with global relative gain as heuristic information is as follows:
S321:计算信息系统S=(U,A,V,f)中条件属性的核属性集合γ,并计算是决策属性对条件属性的依赖度RC(D);S321: Calculate the core attribute set γ of the conditional attribute in the information system S=(U, A, V, f), and calculate the dependence degree R C (D) of the decision attribute on the condition attribute;
S322:计算决策属性对核属性的依赖度Rγ(D),若Rγ(D)=RC(D)成立,则转到S324,求得R约简,否则令C=C-γ,对于计算的值,构成集合M;S322: Calculate the degree of dependence R γ (D) of the decision attribute on the core attribute, if R γ (D)=R C (D) is established, then go to S324 to obtain R reduction, otherwise let C=C-γ, for calculate The value of , constitutes a set M;
S323:对集合M中的元素按照升序排序,将其最大值添加到集合γ中,即γ=γ∪Ci,转到S322继续计算;S323: Sort the elements in the set M in ascending order, and add the maximum value to the set γ, that is, γ=γ∪C i , Go to S322 to continue the calculation;
S324:最后所得的就是BRS的一个R约简。S324: The final result is an R reduction of BRS.
属性约简长度为 The attribute reduction length is
其中,|C|为条件属性个数,Lr为r染色体中基因为1的个数;Among them, |C| is the number of conditional attributes, and L r is the number of 1 genes in the r chromosome;
基因编码权值函数为 The gene encoding weight function is
基因位取值只能是0和1,否则进行处罚,由于染色体中会出现大于1或者小于0或小于-1的基因,针对这种情况,构造了基因编码权值函数作为target3,分子求出了非0,1的乘积和,若基因位i上的基因为0,则0×(0-1)=0,若基因位1,则1×(1-1)=0,所以只计算了非0,1的乘积和,分母表示为染色体的长度。The value of the locus can only be 0 and 1, otherwise it will be punished. Since there will be genes greater than 1 or less than 0 or less than -1 in the chromosome, for this situation, the gene encoding weight function is constructed as target3, and the molecule is calculated. If the product sum of non-0, 1 is not 0, if the gene on the locus i is 0, then 0×(0-1)=0, if the locus is 1, then 1×(1-1)=0, so only calculated The sum of products other than 0 and 1, and the denominator is expressed as the length of the chromosome.
例:设有染色体r=[0 1 -2 3 1],(r-1)=[-1 0 -3 2 0],则:Example: Set chromosome r=[0 1 -2 3 1], (r-1)=[-1 0 -3 2 0], then:
r×(r-1)=[0 1 -2 3 1]×[-1 0 -3 2 0]=[0 0 6 6 0]r×(r-1)=[0 1 -2 3 1]×[-1 0 -3 2 0]=[0 0 6 6 0]
∑abs(r×(r-1))=12,染色体长度为5,则target3=12/5=2.4。∑abs(r×(r-1))=12, and the chromosome length is 5, then target3=12/5=2.4.
构造适应度目标函数F(x)=-ω1×target1-ω2×target2+ω3×target3对所述特征属性进行特征属性约简。Construct a fitness objective function F(x)=-ω1×target1-ω2×target2+ω3×target3 to perform feature attribute reduction on the feature attribute.
S32、根据所述适应度目标函数对遗传算子进行寻优:S32, optimize the genetic operator according to the fitness objective function:
根据适应度目标函数计算特征属性的适应度值,并判断是否满足终止条件,终止条件为设定的定值;若是则得到约简后的特征子集;若否则对特征属性依次进行无放回余数随机选择、均匀交叉和高斯变换构成的遗传算法运算,并重新执行S32。Calculate the fitness value of the feature attribute according to the fitness objective function, and judge whether the termination condition is satisfied. The termination condition is the set value; if so, the reduced feature subset is obtained; otherwise, the feature attributes are sequentially performed without replacement The genetic algorithm operation composed of random selection of remainder, uniform crossover and Gaussian transformation, and re-execute S32.
S4、利用布谷鸟算法对SVM的惩罚因子和核函数进行优化,并将所述约简后的特征子集输入至优化后的SVM,得到分类识别结果。S4, using the cuckoo algorithm to optimize the penalty factor and kernel function of the SVM, and input the reduced feature subset into the optimized SVM to obtain a classification and identification result.
参见图4,布谷鸟算法优化SVM参数的具体步骤包括:Referring to Figure 4, the specific steps of optimizing SVM parameters by the cuckoo algorithm include:
S41、初始化设置:包括概率Pa、迭代次数N、鸟巢数量n、上下界限、SVM的惩罚因子c和RBF核函数参数σ;S41. Initialization settings: including the probability P a , the number of iterations N, the number of nests n, the upper and lower limits, the penalty factor c of the SVM and the RBF kernel function parameter σ;
S42、初始化n个鸟巢位置,计算所有鸟巢的适应度值并保存当前最优位置和适应度值;一个鸟巢即为一个可行解,计算鸟巢的适应度值,是将初始化得到的n个可行解带入目标函数计算后得到的值,值最优(根据具体需求可选择最大或最小)的保留,即得到鸟巢的最优位置和适应度值;S42: Initialize n bird nest positions, calculate the fitness values of all bird nests and save the current optimal positions and fitness values; a bird nest is a feasible solution, and calculating the fitness value of the bird nest is the n feasible solutions obtained by initialization The value obtained after the calculation of the objective function is brought in, and the optimal value (maximum or minimum can be selected according to specific needs) is reserved, that is, the optimal position and fitness value of the bird's nest are obtained;
S43、根据位置更新公式更新鸟巢位置,并与上一代相应位置的鸟巢适应度值进行对比,保留适应度值最小的鸟巢位置和适应度值作为最优鸟巢;S43, update the position of the bird's nest according to the position update formula, and compare it with the bird's nest fitness value of the corresponding position of the previous generation, and retain the bird's nest position and the fitness value with the smallest fitness value as the optimal bird's nest;
S44、根据高斯随机函数自动生成随机数r,以给定概率Pa抛弃差的鸟巢,若r>Pa,则更新鸟巢,否则不更新;S44, the random number r is automatically generated according to the Gaussian random function, and the poor bird's nest is discarded with a given probability P a , if r > P a , the bird's nest is updated, otherwise it is not updated;
S45、重新计算鸟巢的适应度值,用适应度高的鸟巢替换适应度值低的鸟巢,生成一组新的鸟巢位置;S45. Recalculate the fitness value of the bird's nest, replace the bird's nest with a low fitness value with a bird's nest with a high fitness value, and generate a new set of bird nest positions;
S46、判断是否完成迭代次数,若是,则停止搜索,得到全局最优适应度值和对应的最优鸟巢,如果不满足停止条件,则跳至S43继续寻优;S46, determine whether the number of iterations is completed, if so, stop the search to obtain the global optimal fitness value and the corresponding optimal bird's nest, if the stopping condition is not satisfied, then skip to S43 to continue the optimization;
S47、根据最优鸟巢位置所对应的最优参数c和σ构建SVM预测模型。S47, construct an SVM prediction model according to the optimal parameters c and σ corresponding to the optimal bird's nest position.
布谷鸟CS算法的搜索路径是levy飞行,在这种形式的行走中,短距离的探索和偶尔较长距离的行走相间,能扩大搜索范围,增加种群的多样性,避免陷入局部最优。相关定义如下:The search path of the cuckoo CS algorithm is levy flight. In this form of walking, short-distance exploration and occasional longer-distance walking are alternated, which can expand the search range, increase the diversity of the population, and avoid falling into local optimum. The relevant definitions are as follows:
CS算法搜索鸟巢位置的公式为:The formula for the CS algorithm to search for the bird's nest location is:
式中:为第i个鸟巢在第t代的鸟巢位置,α为步长控制量,一般取0.1,α用于确定随机搜索范围:where: is the bird's nest position of the i-th bird's nest in the t-th generation, α is the step size control amount, generally 0.1, and α is used to determine the random search range:
其中,α0是常数(α0=0.01),xbest表示当前最优解。Among them, α 0 is a constant (α 0 =0.01), and x best represents the current optimal solution.
搜索鸟巢位置的公式中,表示点对点乘积,L(λ)为随机搜索路径,Levy~u=t-λ,1<λ≤3,服从levy分布,则相应的位置更新公式为:In the formula to search for the location of the bird's nest, Represents the point-to-point product, L(λ) is the random search path, Levy~u=t -λ , 1<λ≤3, obeys the levy distribution, then the corresponding position update formula is:
其中,μ和ν均服从正态分布:where μ and ν both obey a normal distribution:
式中,Γ是标准的Gamma函数。where Γ is the standard Gamma function.
医学影像识别结果的性能评价包括敏感性和特异性两大指标,但是这两个指标很难全面的描述分类器的整体性能。因此,本实施例对在特征选择和分类识别两个阶段分别设置评价指标,特征选择阶段包括约简长度,属性重要度,时间。分类识别阶段包括准确率(Accuracy)、敏感性(Sensitivity)、特异性(Specificity)、F值、马修斯相关性系数(MCC)、平衡F分数(F1Score)、约登指数(YI)和时间(Time),计算公式如下:The performance evaluation of medical image recognition results includes two indicators, sensitivity and specificity, but these two indicators are difficult to comprehensively describe the overall performance of the classifier. Therefore, in this embodiment, evaluation indicators are respectively set in the two stages of feature selection and classification and identification, and the feature selection stage includes reduction length, attribute importance, and time. The classification and identification stage includes Accuracy, Sensitivity, Specificity, F value, Matthews Correlation Coefficient (MCC), Balanced F Score (F1Score), Youden Index (YI) and Time (Time), the calculation formula is as follows:
YI=Sensitivity+Specificity-1YI=Sensitivity+Specificity-1
其中,TP表示成功识别的恶性肿瘤目标轮廓数;FP表示被错误识别的恶性肿瘤目标轮廓数;TN表示成功识别的良性肿瘤目标轮廓数;FN表示被错误识别的良性肿瘤目标轮廓数。Among them, TP refers to the number of successfully identified malignant tumor target contours; FP refers to the number of incorrectly identified malignant tumor target contours; TN refers to the number of successfully identified benign tumor target contours; FN refers to the number of incorrectly identified benign tumor target contours.
为验证本发明技术方案的可行性和有效性,设计了两组对比实验,实验一是为了验证BRSGA特征选择算法的可行性和有效性,固定采用GS算法优化SVM的参数,比较BRSGA算法和不同β情况下的VPRSGA在不同阶段的优劣。实验二在实验一的基础上固定特征选择算法,比较CS-SVM与GS-SVM、GA-SVM和PSO-SVM在分类阶段的优劣。In order to verify the feasibility and effectiveness of the technical solution of the present invention, two sets of comparative experiments are designed. The first experiment is to verify the feasibility and effectiveness of the BRSGA feature selection algorithm. The GS algorithm is fixed to optimize the parameters of the SVM, and the BRSGA algorithm is compared with different The pros and cons of VPRSGA in the β case at different stages.
实验一:基于相同分类算法不同特征选择算法的实验结果比较Experiment 1: Comparison of experimental results of different feature selection algorithms based on the same classification algorithm
固定分类识别算法为GS-SVM,比较BRSGA和不同参数下的VPRS算法在特征选择和分类识别两个阶段的优劣,其中VPRS参数β分别设置为0.1、0.2、0.3和0.4。具体结果见表3、图5和表4所示。最优特征子集生成阶段,每个参数组合约简5次,分别得到约简长度,属性重要度,时间,求每个参数5次约简结果个指标的平均值作为该参数下的实验结果。分类识别阶段,每个参数每次约简结果都采用LIBSVM进行五折交叉(即每次选取300例良性肿瘤和300例恶性肿瘤数据作为测试集,其余数据作为训练集)分类识别,每个参数得到五组识别结果,包括:精确度、敏感性、特异性、F值、MCC、F1Score、YI和时间,求五折交叉各指标的平均值作为该参数下约简后的分类结果,最后求五次约简各指标的平均值作为该参数组合下的约简和分类结果。The fixed classification and recognition algorithm is GS-SVM, and the advantages and disadvantages of BRSGA and VPRS algorithms with different parameters in the two stages of feature selection and classification and recognition are compared, and the VPRS parameter β is set to 0.1, 0.2, 0.3 and 0.4 respectively. The specific results are shown in Table 3, Figure 5 and Table 4. In the optimal feature subset generation stage, each parameter combination is reduced 5 times, and the reduction length, attribute importance, and time are obtained respectively, and the average of the 5 reduction results and indicators for each parameter is obtained as the experimental result under this parameter. . In the classification and identification stage, the results of each parameter reduction are five-fold crossover using LIBSVM (that is, 300 cases of benign tumors and 300 cases of malignant tumors are selected as the test set each time, and the remaining data are used as the training set) for classification and identification. Each parameter Five sets of identification results are obtained, including: accuracy, sensitivity, specificity, F value, MCC, F1Score, YI, and time, and the average value of each index of the five-fold cross is calculated as the reduced classification result under this parameter. The average value of each index of the five reductions is used as the reduction and classification result under this parameter combination.
表3不同特征选择算法属性约简结果的比较Table 3 Comparison of attribute reduction results of different feature selection algorithms
由表3可见,本发明在分类错误率β无需人工设置的情况下,约简长度为7.8维,介于VPRSGA算法不同β值情况下的约简长度之间,相比β=0.1约简长度降低较为显著,相比β=0.2和β=0.4约简长度稍有增加。适应度值只略高于β=0.4的VPRSGA算法。重要度相比β=0.4的VPRSGA降低0.0002,高于其他参数下的VPRSGA模型。约简时间高于β=0.2的VPRSGA模型,相比其他参数值下的VPRSGA约简时间降低16.54~419.35秒,其中相比β=0.1,时间缩短2.7倍。由图5可知,本文算法在约简阶段没有早熟收敛,VPRSGA算法在设置不同的β值情况下出现不同程度的早熟收敛,例如图5b中VPRSGA算法在β=0.1时某次约简结果就出现较为严重的早熟收敛现象。因此,本发明相比VPRSGA算法既摆脱了参数人工设置的束缚,也取得了较为理想的效果。It can be seen from Table 3 that in the present invention, when the classification error rate β does not need to be manually set, the reduction length is 7.8 dimensions, which is between the reduction lengths of the VPRSGA algorithm under different β values, compared with the reduction length of β=0.1 The decrease is more significant, with a slight increase in the reduction length compared to β=0.2 and β=0.4. The fitness value is only slightly higher than the VPRSGA algorithm with β=0.4. The importance is 0.0002 lower than that of VPRSGA with β=0.4, which is higher than the VPRSGA model with other parameters. Compared with the VPRSGA model whose reduction time is higher than β=0.2, the reduction time of VPRSGA under other parameter values is reduced by 16.54-419.35 seconds, and the time is shortened by 2.7 times compared with β=0.1. It can be seen from Figure 5 that the algorithm in this paper does not have premature convergence in the reduction stage, and the VPRSGA algorithm has different degrees of premature convergence under the condition of different β values. More serious premature convergence phenomenon. Therefore, compared with the VPRSGA algorithm, the present invention not only gets rid of the constraints of manual parameter setting, but also achieves a relatively ideal effect.
表4不同特征选择算法分类识别结果的比较Table 4 Comparison of classification and recognition results of different feature selection algorithms
从表4可见:本发明相比β=0.1的VPRSGA算法精确度、特异性、MCC、F1Score、YI分别降低0.07%、0.43%、0.0015、0.0006和0.0013,但是敏感性提高0.3%,分类时间β=0.1的VPRSGA算法是BRSGA算法的3.4倍。虽然BRS算法在可以接受的范围之内降低了精确度,但是在很大程度上减少了时间消耗,综合考虑精确度和时间消耗,BRS算法的整体性能优于β=0.1的VPRS算法;BRSGA算法相比参数β=0.2、0.3和0.4的VPRSGA算法,时间降低,其余各项指标均有不同程度的提高,其中相比β=0.2的VPRSGA算法各指标提高显著。从分类结果可见,BRSGA模型与VPRSGA模型相比,即摆脱了参数的束缚,也提高了模型的分类性能。It can be seen from Table 4: Compared with the VPRSGA algorithm with β=0.1, the present invention reduces the accuracy, specificity, MCC, F1Score, and YI by 0.07%, 0.43%, 0.0015, 0.0006, and 0.0013, respectively, but the sensitivity increases by 0.3%, and the classification time β The VPRSGA algorithm of =0.1 is 3.4 times that of the BRSGA algorithm. Although the BRS algorithm reduces the accuracy within an acceptable range, it greatly reduces the time consumption. Considering the accuracy and time consumption, the overall performance of the BRS algorithm is better than the VPRS algorithm with β=0.1; the BRSGA algorithm Compared with the VPRSGA algorithm with parameters β=0.2, 0.3 and 0.4, the time is reduced, and the other indexes are improved to different degrees, among which the indexes of the VPRSGA algorithm with β=0.2 are improved significantly. It can be seen from the classification results that compared with the VPRSGA model, the BRSGA model not only gets rid of the constraints of parameters, but also improves the classification performance of the model.
由实验一可见,当固定分类算法(即利用网格寻优算法对SVM参数进行优化)时,BRSGA特征选择算法相比VPRSGA算法摆脱了参数人工设置的束缚,而且在属性约简和分类阶段都表现出较为理想的效果,因此,在验证CS算法对SVM参数优化的有效性时固定特征选择算法为BRSGA。It can be seen from
实验二:基于相同特征选择算法不同分类算法的实验结果比较Experiment 2: Comparison of experimental results of different classification algorithms based on the same feature selection algorithm
固定最优特征子集生成算法为BRSGA,采用CS算法优化SVM参数,并与GS-SVM、GA-SVM和PSO-SVM进行比较。即对实验一中BRSGA算法5次约简得到的结果进行分类识别,每次约简五折交叉验证得到分类结果,包括精确度、敏感性、特异性、F值、MCC、F1Score、YI和时间,求五折交叉各指标的平均值作为此次约简后的分类结果,五次约简的平均值最为该分类模型的最终结果。为了定量描述本发明与对比算法在识别正确率上是否具有统计学意义,采用配对t检验进行假设检验,统计假设检验基于精确度、F值、MCC、F1Score、YI五个全面描述正确率的指标,显著性水平设为p<0.05。无效假设是本发明与对比算法在相同的评价指标的平均值之间的差异为0。对于每个评价指标给出五次约简分类识别结果的平均值和标准差,结果见表5所示,各指标五次约简结果的平均值分别绘制折线图,结果图6所示。The fixed optimal feature subset generation algorithm is BRSGA, and the CS algorithm is used to optimize the SVM parameters and compare with GS-SVM, GA-SVM and PSO-SVM. That is to classify and identify the results obtained by the five reductions of the BRSGA algorithm in
表5基于BRSGA选择算法不同分类算法结果的比较Table 5 Comparison of results of different classification algorithms based on BRSGA selection algorithm
*表示标记结果与本文算法(CS-SVM)对应指标在显著性水平为0.05时具有显著性差异。*Indicates that the marked result is significantly different from the corresponding index of the algorithm in this paper (CS-SVM) when the significance level is 0.05.
由表5可知,定量分析结果显示本发明在五个评价指标上都优于其他三种对比算法,均具有统计学显著差异。从图6整体来看:BRSGA五次约简后的分类结果呈现波动趋势,其中第4次约简多数分类指标相对较优,第3次约简除过分类时间其余各指标相对较低。因为最优特征子集生成过程中遗传算法初始种群是随机产生的,导致每次约简结果有所差异,因此本文每个参数组合约简5次,分类采用五折交叉的方式,最终采用约简和分类结果各指标的平均值评价模型的整体性能,可以有效避免片面性的评价。It can be seen from Table 5 that the quantitative analysis results show that the present invention is superior to the other three comparison algorithms in all five evaluation indexes, and all have statistically significant differences. From Figure 6 as a whole, the classification results of BRSGA after five reductions show a fluctuating trend. The fourth reduction is relatively good for most of the classification indicators, and the third reduction is relatively low except for the over-classification time. Because the initial population of the genetic algorithm is randomly generated in the process of generating the optimal feature subset, the results of each reduction are different. Therefore, in this paper, each parameter combination is reduced 5 times, and the classification adopts the method of five-fold crossover. The overall performance of the model is evaluated by the average value of each index of the classification result, which can effectively avoid one-sided evaluation.
图6中分别表示如下分类识别阶段的分类结果对比图:(a)精确度;(b)分类时间;(c)F值;(d)敏感度;(e)特异度;(f)MCC;(g)F1Score;(h)Youden。五次约简过程中CS-SVM算法在精确度、F值、敏感性、MCC、F1Score和Youden指数等6个评价指标上都高于GS-SVM、GA-SVM和PSO-SVM算法,分类时间略高于GS-SVM。PSO-SVM算法分类时间远远高于其他三种算法,除过敏感性其余6项评价指标在第3次约简时高于GS-SVM和GA-SVM算法,其余4次约简中各指标均低于GS-SVM、GA-SVM和CS-SVM。从图6a、c、f、g和h可知,五次约简中CS-SVM算法在各综合评价指标上都高于GS-SVM、GA-SVM和PSO-SVM,具有一定的鲁棒性和更大的推广价值。Figure 6 shows the comparison charts of the classification results in the following classification and recognition stages: (a) accuracy; (b) classification time; (c) F value; (d) sensitivity; (e) specificity; (f) MCC; (g) F1Score; (h) Youden. In the five reduction process, the CS-SVM algorithm is higher than the GS-SVM, GA-SVM and PSO-SVM algorithms in the six evaluation indicators such as accuracy, F value, sensitivity, MCC, F1Score and Youden index, and the classification time slightly higher than GS-SVM. The classification time of the PSO-SVM algorithm is much higher than that of the other three algorithms. Except for oversensitivity, the other six evaluation indicators are higher than those of the GS-SVM and GA-SVM algorithms in the third reduction. lower than GS-SVM, GA-SVM and CS-SVM. It can be seen from Figure 6a, c, f, g and h that the CS-SVM algorithm in the five reductions is higher than GS-SVM, GA-SVM and PSO-SVM in all comprehensive evaluation indicators, and has certain robustness and Greater promotional value.
其中GS-SVM分类时间低于CS-SVM的原因如下:第一,由于医学图像数据获取困难,本文的测试集数据只有600例,时间复杂度相比CS算法较低,但是真正临床中的数据是海量的,每日剧增,甚至成指数级增长,当样本数增大时,GS算法的时间复杂度会大幅增加,就不能满足临床应用的需求;第二,GS算法根据经验给出的搜索范围具有一定的随机性,无法保证得到的是最佳参数。而CS算法是一种群智能搜索算法,具有局部和全局两种搜索能力,拓宽了搜索领域,丰富了种群的多样性,具有良好的鲁棒性,相比GS算法可以有效避免因经验而导致的随机性。The reasons why the classification time of GS-SVM is lower than that of CS-SVM are as follows: First, due to the difficulty in obtaining medical image data, the test set data in this paper is only 600 cases, and the time complexity is lower than that of the CS algorithm, but the real clinical data It is massive, increasing sharply every day, or even increasing exponentially. When the number of samples increases, the time complexity of the GS algorithm will increase significantly, which cannot meet the needs of clinical applications. Second, the GS algorithm is based on experience. The search range has a certain randomness, and there is no guarantee that the optimal parameters are obtained. The CS algorithm is a swarm intelligence search algorithm, which has both local and global search capabilities, broadens the search field, enriches the diversity of the population, and has good robustness. Compared with the GS algorithm, it can effectively avoid the problems caused by experience. randomness.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010322570.8A CN111583194B (en) | 2020-04-22 | 2020-04-22 | High-Dimensional Feature Selection Algorithm Based on Bayesian Rough Sets and Cuckoo Algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010322570.8A CN111583194B (en) | 2020-04-22 | 2020-04-22 | High-Dimensional Feature Selection Algorithm Based on Bayesian Rough Sets and Cuckoo Algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111583194A true CN111583194A (en) | 2020-08-25 |
CN111583194B CN111583194B (en) | 2022-07-15 |
Family
ID=72111635
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010322570.8A Active CN111583194B (en) | 2020-04-22 | 2020-04-22 | High-Dimensional Feature Selection Algorithm Based on Bayesian Rough Sets and Cuckoo Algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111583194B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113111577A (en) * | 2021-04-01 | 2021-07-13 | 燕山大学 | Cement mill operation index decision method based on multi-target cuckoo search |
CN114595713A (en) * | 2022-01-19 | 2022-06-07 | 北京理工大学 | Optimization feature selection-based gas pressure regulating station state monitoring method |
CN114627964A (en) * | 2021-09-13 | 2022-06-14 | 东北林业大学 | A method for predicting enhancers and their strengths based on multi-kernel learning, and a classification device |
CN115238728A (en) * | 2022-04-25 | 2022-10-25 | 中国人民解放军陆军工程大学 | Radar Signal Recognition Algorithm Based on PCA-ICS-SVM |
CN118469733A (en) * | 2024-07-15 | 2024-08-09 | 山东乐谷信息科技有限公司 | A secure accounting book system based on blockchain technology |
CN118672304A (en) * | 2024-08-21 | 2024-09-20 | 四川腾盾科技有限公司 | Fixed wing cluster unmanned aerial vehicle anti-collision method and system based on multi-element game |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784353A (en) * | 2016-08-29 | 2018-03-09 | 普天信息技术有限公司 | A kind of function optimization method based on cuckoo searching algorithm |
CN109186971A (en) * | 2018-08-06 | 2019-01-11 | 江苏大学 | Hub motor mechanical breakdown inline diagnosis method based on dynamic bayesian network |
CN109325580A (en) * | 2018-09-05 | 2019-02-12 | 南京邮电大学 | An Adaptive Cuckoo Search Method for Global Optimization of Service Composition |
CN109978880A (en) * | 2019-04-08 | 2019-07-05 | 哈尔滨理工大学 | Lung tumors CT image is carried out sentencing method for distinguishing using high dimensional feature selection |
US20190253558A1 (en) * | 2018-02-13 | 2019-08-15 | Risto Haukioja | System and method to automatically monitor service level agreement compliance in call centers |
US20190318248A1 (en) * | 2018-04-13 | 2019-10-17 | NEC Laboratories Europe GmbH | Automated feature generation, selection and hyperparameter tuning from structured data for supervised learning problems |
-
2020
- 2020-04-22 CN CN202010322570.8A patent/CN111583194B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784353A (en) * | 2016-08-29 | 2018-03-09 | 普天信息技术有限公司 | A kind of function optimization method based on cuckoo searching algorithm |
US20190253558A1 (en) * | 2018-02-13 | 2019-08-15 | Risto Haukioja | System and method to automatically monitor service level agreement compliance in call centers |
US20190318248A1 (en) * | 2018-04-13 | 2019-10-17 | NEC Laboratories Europe GmbH | Automated feature generation, selection and hyperparameter tuning from structured data for supervised learning problems |
CN109186971A (en) * | 2018-08-06 | 2019-01-11 | 江苏大学 | Hub motor mechanical breakdown inline diagnosis method based on dynamic bayesian network |
CN109325580A (en) * | 2018-09-05 | 2019-02-12 | 南京邮电大学 | An Adaptive Cuckoo Search Method for Global Optimization of Service Composition |
CN109978880A (en) * | 2019-04-08 | 2019-07-05 | 哈尔滨理工大学 | Lung tumors CT image is carried out sentencing method for distinguishing using high dimensional feature selection |
Non-Patent Citations (1)
Title |
---|
张飞飞,等: ""基于贝叶斯粗糙集的肺部肿瘤CT图像高维特征选择算法"", 《生物医学工程研究》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113111577A (en) * | 2021-04-01 | 2021-07-13 | 燕山大学 | Cement mill operation index decision method based on multi-target cuckoo search |
CN113111577B (en) * | 2021-04-01 | 2023-05-05 | 燕山大学 | Cement mill operation index decision method based on multi-target cuckoo search |
CN114627964A (en) * | 2021-09-13 | 2022-06-14 | 东北林业大学 | A method for predicting enhancers and their strengths based on multi-kernel learning, and a classification device |
CN114595713A (en) * | 2022-01-19 | 2022-06-07 | 北京理工大学 | Optimization feature selection-based gas pressure regulating station state monitoring method |
CN115238728A (en) * | 2022-04-25 | 2022-10-25 | 中国人民解放军陆军工程大学 | Radar Signal Recognition Algorithm Based on PCA-ICS-SVM |
CN118469733A (en) * | 2024-07-15 | 2024-08-09 | 山东乐谷信息科技有限公司 | A secure accounting book system based on blockchain technology |
CN118672304A (en) * | 2024-08-21 | 2024-09-20 | 四川腾盾科技有限公司 | Fixed wing cluster unmanned aerial vehicle anti-collision method and system based on multi-element game |
CN118672304B (en) * | 2024-08-21 | 2024-12-13 | 四川腾盾科技有限公司 | Anti-collision method and system for fixed-wing cluster UAVs based on multi-element game |
Also Published As
Publication number | Publication date |
---|---|
CN111583194B (en) | 2022-07-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111583194B (en) | High-Dimensional Feature Selection Algorithm Based on Bayesian Rough Sets and Cuckoo Algorithm | |
CN109949276B (en) | Lymph node detection method for improving SegNet segmentation network | |
CN112381178B (en) | Medical image classification method based on multi-loss feature learning | |
Peng et al. | Segmentation of lung in chest radiographs using hull and closed polygonal line method | |
CN112149717A (en) | Confidence weighting-based graph neural network training method and device | |
Mahapatra et al. | Active learning based segmentation of Crohns disease from abdominal MRI | |
CN110263804B (en) | A medical image segmentation method based on secure semi-supervised clustering | |
CN114596467B (en) | Multimodal image classification method based on evidence-based deep learning | |
CN109978880A (en) | Lung tumors CT image is carried out sentencing method for distinguishing using high dimensional feature selection | |
US12182733B2 (en) | Label inference system | |
JP7632854B2 (en) | Method for analyzing lesions in medical images | |
KR20230029004A (en) | System and method for prediction of lung cancer final stage using chest automatic segmentation image | |
Tian et al. | Radiomics and its clinical application: artificial intelligence and medical big data | |
CN105279508A (en) | Medical image classification method based on KAP digraph model | |
CN118296442B (en) | Multi-omics cancer subtype classification method, system, device, medium and program product | |
Lu et al. | Coarse-to-fine classification via parametric and nonparametric models for computer-aided diagnosis | |
CN113035334B (en) | Automatic delineation method and device for radiotherapy target area of nasal cavity NKT cell lymphoma | |
CN116912212B (en) | Pulmonary nodule CT image analysis method based on YOLO-CSC model | |
US20230230705A1 (en) | Assessment of pulmonary function in coronavirus patients | |
Song et al. | Segmentation of ordinary images and medical images with an adaptive Hidden Markov model and Viterbi algorithm | |
CN114821157B (en) | Multimodal image classification method based on hybrid model network | |
Wang et al. | A novel automated classification and segmentation for COVID-19 using 3D CT scans | |
Lu et al. | Hierarchical learning for tubular structure parsing in medical imaging: A study on coronary arteries using 3D CT Angiography | |
Owais et al. | Volumetric Model Genesis in Medical Domain for the Analysis of Multimodality 2-D/3-D Data Based on the Aggregation of Multilevel Features | |
CN112434790A (en) | Self-interpretation method for convolutional neural network to judge partial black box problem |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |