CN107193993A - The medical data sorting technique and device selected based on local learning characteristic weight - Google Patents
The medical data sorting technique and device selected based on local learning characteristic weight Download PDFInfo
- Publication number
- CN107193993A CN107193993A CN201710419357.7A CN201710419357A CN107193993A CN 107193993 A CN107193993 A CN 107193993A CN 201710419357 A CN201710419357 A CN 201710419357A CN 107193993 A CN107193993 A CN 107193993A
- Authority
- CN
- China
- Prior art keywords
- weight vector
- data
- evaluated
- sample
- sample set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 239000013598 vector Substances 0.000 claims abstract description 163
- 239000006185 dispersion Substances 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 16
- 238000005457 optimization Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 12
- 238000012549 training Methods 0.000 abstract description 15
- 230000009467 reduction Effects 0.000 abstract description 12
- 230000000694 effects Effects 0.000 abstract description 2
- 239000000523 sample Substances 0.000 description 138
- 238000004364 calculation method Methods 0.000 description 18
- 108090000623 proteins and genes Proteins 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 5
- 230000014509 gene expression Effects 0.000 description 5
- 238000007405 data analysis Methods 0.000 description 4
- 238000003745 diagnosis Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000011425 standardization method Methods 0.000 description 3
- 238000000018 DNA microarray Methods 0.000 description 2
- 208000000172 Medulloblastoma Diseases 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 210000001161 mammalian embryo Anatomy 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000013610 patient sample Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及医疗诊断领域,更具体地说,涉及一种基于局部学习特征权重选择的医疗数据分类方法及装置。The present invention relates to the field of medical diagnosis, more specifically, to a medical data classification method and device based on local learning feature weight selection.
背景技术Background technique
随着人工智能的发展,计算机技术也在医疗领域中起到了重要的作用,实现医疗领域中的人工智能。计算机技术与众多领域的人类医学专家的大量权威知识经验相融合,开发出医疗诊断系统,可以有效地解决各种临床问题,起到了辅助医生诊断的作用。With the development of artificial intelligence, computer technology has also played an important role in the medical field, realizing artificial intelligence in the medical field. Combining computer technology with a large amount of authoritative knowledge and experience of human medical experts in many fields, a medical diagnosis system has been developed, which can effectively solve various clinical problems and play a role in assisting doctors in diagnosis.
在医疗诊断系统中,引入了DNA微阵列技术,即基因芯片,应用基因芯片就可以在同一时间定量的分析大量的基因表达数据的水平,通过这些数据就可以研究生物的本质。但是由于DNA微阵列技术的发展,导致了基因表达数据的爆炸性增长,在这些大量的基因表达数据中选择出重要的基因,对于现有技术提出了新的挑战。In the medical diagnosis system, DNA microarray technology, that is, gene chip, is introduced. The application of gene chip can quantitatively analyze the level of a large amount of gene expression data at the same time, and through these data, the essence of biology can be studied. However, due to the development of DNA microarray technology, the explosive growth of gene expression data has led to the selection of important genes in these large amounts of gene expression data, which poses new challenges to the existing technology.
局部超平面(Local Hyperlane,LH-Relief)算法可以实现对大量基因表达数据进行降维,即筛选掉没有用的基因表达数据,选择出重要的基因,减小冗余度的问题。但是该算法对含有噪声的数据以及高位数据的应用中,收敛性不能得到保证,导致算法的计算复杂度高。The Local Hyperlane (LH-Relief) algorithm can reduce the dimensionality of a large amount of gene expression data, that is, filter out useless gene expression data, select important genes, and reduce redundancy. However, in the application of this algorithm to noisy data and high-level data, the convergence cannot be guaranteed, resulting in high computational complexity of the algorithm.
因此,如何实现对大量基因数据降维的同时,降低算法的计算复杂度,是本领域技术人员需要解决的问题。Therefore, how to reduce the dimensionality of a large amount of genetic data while reducing the computational complexity of the algorithm is a problem to be solved by those skilled in the art.
发明内容Contents of the invention
本发明的目的在于提供一种基于局部学习特征权重选择的医疗数据分类方法,以实现对大量基因数据降维的同时降低算法的计算复杂度。The purpose of the present invention is to provide a medical data classification method based on local learning feature weight selection, so as to reduce the dimensionality of a large amount of genetic data and reduce the computational complexity of the algorithm.
为实现上述目的,本发明实施例提供了如下技术方案:In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:
一种基于局部学习特征权重选择的医疗数据分类方法,包括:A medical data classification method based on local learning feature weight selection, including:
S101:获取医疗数据的第一样本集,得到第一样本属性;S101: Obtain a first sample set of medical data, and obtain attributes of the first sample;
S102:设置所述第一样本属性的初始权重向量,将所述初始权重向量作为本次权重向量;S102: Set an initial weight vector of the first sample attribute, and use the initial weight vector as the current weight vector;
S103:通过梯度下降的更新方式对本次权重向量进行更新,得到迭代一次后的下次权重向量;S103: Update the current weight vector by gradient descent to obtain the next weight vector after one iteration;
S104:判断确定规则是否成立,若是,则将所次权重向量作为最终权重向量,执行S105;若否,则将下次权重向量作为本次权重向量,返回S103;其中||wt+1-wt||≤θ为确定规则,wt为本次权重向量,wt+1为下次权重向量,θ为停止准则;S104: Determine whether the determination rule is established, if yes, use the weight vector as the final weight vector, and execute S105; if not, use the next weight vector as the current weight vector, and return to S103; where ||w t+1 - w t ||≤θ is the determination rule, w t is the current weight vector, w t+1 is the next weight vector, and θ is the stop criterion;
S105:根据所述最终权重向量进行特征选择,得到特征索引子集;S105: Perform feature selection according to the final weight vector to obtain a feature index subset;
S106:将所述第一样本集根据所述特征索引子集进行特征选择,得到特征选择后的第二样本集;S106: Perform feature selection on the first sample set according to the feature index subset to obtain a second sample set after feature selection;
S107:获取第一待评估数据,并根据所述特征索引子集进行特征选择得到第二待评估数据;S107: Obtain the first data to be evaluated, and perform feature selection according to the feature index subset to obtain second data to be evaluated;
S108:在所述第二样本集上对第二待评估数据进行分类,得到分类结果。S108: Classify the second data to be evaluated on the second sample set to obtain a classification result.
优选地,所述获取医疗数据的第一样本集,得到第一样本属性,包括:Preferably, said obtaining the first sample set of medical data to obtain the first sample attributes includes:
获取医疗数据的第一样本集,得到第一样本属性,并对所述第一样本集进行离差标准化处理;Acquiring a first sample set of medical data, obtaining attributes of the first sample, and performing dispersion standardization processing on the first sample set;
优选地,所述通过梯度下降的更新方式对本次权重向量进行更新,得到迭代一次后的下次权重向量,包括:Preferably, the update method of gradient descent is used to update the current weight vector to obtain the next weight vector after one iteration, including:
通过规则对本次权重向量进行更新,得到迭代一次后的下次权重向量wt+1,J(w)为优化目标函数,通过最大化J(w)=(zi t+1)Twt+1计算得到。pass the rules Update the current weight vector to obtain the next weight vector w t+1 after one iteration, J(w) is the optimization objective function, by maximizing J(w)=(z i t+1 ) T w t+ 1 is calculated.
优选地,所述获取第一待评估数据,并根据所述特征索引子集进行特征选择得到第二待评估数据,包括:Preferably, the acquiring the first data to be evaluated, and performing feature selection according to the feature index subset to obtain the second data to be evaluated includes:
获取第一待评估数据,进行离差标准化处理,并根据所述特征索引子集进行特征选择得到第二待评估数据。Acquire the first data to be evaluated, perform dispersion standardization processing, and perform feature selection according to the feature index subset to obtain the second data to be evaluated.
优选地,在所述第二样本集上对第二待评估数据进行分类,得到分类结果,包括:Preferably, the second data to be evaluated is classified on the second sample set to obtain classification results, including:
在所述第二样本集上对第二待评估数据利用K近邻分类器进行分类,得到分类结果。Classify the second data to be evaluated on the second sample set using a K-nearest neighbor classifier to obtain a classification result.
一种基于局部学习特征权重选择的医疗数据分类装置,包括:A medical data classification device based on local learning feature weight selection, comprising:
第一样本集获取模块,用于获取医疗数据的第一样本集,得到第一样本属性;The first sample set obtaining module is used to obtain the first sample set of medical data and obtain the first sample attributes;
初始权重限量设置模块,用于设置所述第一样本属性的初始权重向量,将所述初始权重向量作为本次权重向量;The initial weight limit setting module is used to set the initial weight vector of the first sample attribute, and use the initial weight vector as the current weight vector;
下次权重向量获取模块,用于通过梯度下降的更新方式对本次权重向量进行更新,得到迭代一次后的下次权重向量;The next weight vector acquisition module is used to update the current weight vector through gradient descent to obtain the next weight vector after one iteration;
判断模块,用于判断确定规则是否成立,若是,则将所述下次权重向量作为最终权重向量,调用特征索引子集获取模块;若否,则将下次权重向量作为本次权重向量,调用所述下次权重向量获取模块;其中确定规则为||wt+1-wt||≤θ,wt为本次权重向量,wt+1为下次权重向量,θ为停止准则;A judging module, used to judge whether the determination rule is established, if so, use the next weight vector as the final weight vector, and call the feature index subset acquisition module; if not, use the next weight vector as the current weight vector, call The next weight vector acquisition module; wherein the determination rule is ||w t+1 -w t ||≤θ, w t is the current weight vector, w t+1 is the next weight vector, and θ is the stop criterion;
所述特征索引子集获取模块,用于根据所述最终权重向量进行特征选择,得到特征索引子集;The feature index subset acquisition module is configured to perform feature selection according to the final weight vector to obtain a feature index subset;
第二样本集获取模块,用于将所述第一样本集根据所述特征索引子集进行特征选择,得到特征选择后的第二样本集;A second sample set acquisition module, configured to perform feature selection on the first sample set according to the feature index subset, to obtain a second sample set after feature selection;
第二待评估数据获取模块,用于获取第一待评估数据,并根据所述特征索引子集进行特征选择得到第二待评估数据;The second data to be evaluated acquisition module is used to obtain the first data to be evaluated, and perform feature selection according to the feature index subset to obtain the second data to be evaluated;
分类模块,用于在所述第二样本集上对第二待评估数据进行分类,得到分类结果。A classification module, configured to classify the second data to be evaluated on the second sample set to obtain a classification result.
优选地,所述第一样本集获取模块具体用于:Preferably, the first sample set acquisition module is specifically used for:
获取医疗数据的第一样本集,得到第一样本属性,并对所述第一样本集进行离差标准化处理。Acquire a first sample set of medical data, obtain attributes of the first sample, and perform dispersion standardization processing on the first sample set.
优选地,所述下次权重向量获取模块具体用于:Preferably, the next weight vector acquisition module is specifically used for:
通过规则对本次权重向量进行更新,得到迭代一次后的下次权重向量wt+1,J(w)为优化目标函数,通过最大化J(w)=(zi t+1)Twt+1计算得到。pass the rules Update the current weight vector to obtain the next weight vector w t+1 after one iteration, J(w) is the optimization objective function, by maximizing J(w)=(z i t+1 ) T w t+ 1 is calculated.
优选地,所述第二待评估数据获取模块具体用于:Preferably, the second data acquisition module to be evaluated is specifically used for:
获取第一待评估数据,进行离差标准化处理,并根据所述特征索引子集进行特征选择得到第二待评估数据。Acquire the first data to be evaluated, perform dispersion standardization processing, and perform feature selection according to the feature index subset to obtain the second data to be evaluated.
优选地,所述分类模块具体用于:Preferably, the classification module is specifically used for:
在所述第二样本集上对第二待评估数据利用K近邻分类器进行分类,得到分类结果。Classify the second data to be evaluated on the second sample set using a K-nearest neighbor classifier to obtain a classification result.
通过以上方案可知,本发明实施例提供的一种基于局部学习特征权重选择的医疗数据分类方法,首先根据训练样本集得到样本的属性值,根据属性值利用梯度下降的权重更新方式计算属性对应的权重向量,因此可以保证收敛性,可以较快地达到算法的停止准则,减少计算时间,降低计算复杂度;根据计算出的权重向量进行特征选择得到最优特征集,将待评估数据样本进行标准化后再最优特征子集中进行特征选择,再将特征选择后的待评估数据样本进行分类就可以使数据样本实现降维,因此本发明实施例提供的方法实现降维的同时又降低了计算的复杂度,减少了计算时间。本发明还提供了一种基于局部学习特征权重选择的医疗数据分类装置,同样可以实现上述技术效果。From the above schemes, it can be seen that in the embodiment of the present invention, a medical data classification method based on local learning feature weight selection, first obtains the attribute value of the sample according to the training sample set, and uses the weight update method of gradient descent to calculate the corresponding attribute value according to the attribute value. Weight vector, so the convergence can be guaranteed, the stopping criterion of the algorithm can be reached quickly, the calculation time is reduced, and the calculation complexity is reduced; according to the calculated weight vector, the feature selection is performed to obtain the optimal feature set, and the data samples to be evaluated are standardized. Then perform feature selection in the optimal feature subset, and then classify the data samples to be evaluated after feature selection, so that the data samples can achieve dimensionality reduction. Therefore, the method provided by the embodiment of the present invention realizes dimensionality reduction while reducing calculation complexity, reducing computation time. The present invention also provides a medical data classification device based on local learning feature weight selection, which can also achieve the above technical effects.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.
图1为本发明实施例公开的一种医疗数据分类方法流程图;Fig. 1 is a flow chart of a medical data classification method disclosed in an embodiment of the present invention;
图2为本发明实施例公开的一种医疗数据分类装置结构示意图;Fig. 2 is a schematic structural diagram of a medical data classification device disclosed in an embodiment of the present invention;
图3为本发明实施例公开的一种医疗数据分类方法与LH-RELIEF的收敛结果对比图。Fig. 3 is a graph comparing the convergence results of a medical data classification method disclosed in an embodiment of the present invention and LH-RELIEF.
图4为本发明实施例公开的一种医疗数据分类方法与LH-RELIEF的平均性能性能对比图。Fig. 4 is a comparison chart of the average performance between a medical data classification method disclosed in the embodiment of the present invention and LH-RELIEF.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.
参见图1,本发明实施例公开了一种基于局部学习特征权重选择的医疗数据分类方法。具体地:Referring to FIG. 1 , an embodiment of the present invention discloses a medical data classification method based on local learning feature weight selection. specifically:
S101:获取医疗数据的第一样本集,得到第一样本属性。S101: Obtain a first sample set of medical data, and obtain attributes of the first sample.
具体地,获取医疗数据的第一样本集得到第一样本集的样本属性,作为第一样本属性。其中xi∈RI,yi∈{1,2,...,C}是xi的标签,表明xi的类别,N是训练样本的个数,I是样本的维数,C是类别总数。Specifically, to obtain the first sample set of medical data The sample attribute of the first sample set is obtained as the first sample attribute. Where xi ∈ R I , y i ∈ {1,2,...,C} is the label of xi , indicating the category of xi , N is the number of training samples, I is the dimension of the sample, C is total number of categories.
S102:设置所述第一样本属性的初始权重向量,将所述初始权重向量作为本次权重向量。S102: Set an initial weight vector of the first sample attribute, and use the initial weight vector as a current weight vector.
具体地,设置初始权重向量w0=[1/I,1/I,...,1/I]t,其中t为迭代次数,当前t=0,即没有开始迭代,将初始权重向量w0作为本次权重向量wt。Specifically, set the initial weight vector w 0 =[1/I,1/I,...,1/I] t , where t is the number of iterations, and the current t=0, that is, the iteration is not started, the initial weight vector w 0 as this weight vector w t .
S103:通过梯度下降的更新方式对本次权重向量进行更新,得到迭代一次后的下次权重向量。S103: Update the current weight vector by gradient descent to obtain a next weight vector after one iteration.
具体地,进行一次迭代,即将本次权重向量wt使用梯度下降的更新方式进行一次更新,得到下次权重向量wt+1。Specifically, one iteration is performed, that is, the current weight vector w t is updated once using a gradient descent update method to obtain the next weight vector w t+1 .
S104:判断确定规则是否成立,若是,则将所次权重向量作为最终权重向量,执行S105;若否,则将下次权重向量作为本次权重向量,返回S103;其中确定规则为||wt+1-wt||≤θ,wt为本次权重向量,wt+1为下次权重向量,θ为停止准则。S104: Determine whether the determination rule is established, if yes, use the weight vector as the final weight vector, and execute S105; if not, use the next weight vector as the current weight vector, and return to S103; where the determination rule is ||w t +1 -w t ||≤θ, w t is the current weight vector, w t+1 is the next weight vector, and θ is the stopping criterion.
具体地,设置一个停止准则θ,并判断||wt+1-wt||≤θ是否成立,如果成立,则将下次权重向量wt+1作为最终权重向量w,w=[w1,w2,...,wI]t∈R,进行S105;如果不成立,则将下次权重向量wt+1作为本次权重向量wt,并返回S103,进行新的一次迭代。Specifically, set a stopping criterion θ, and judge whether ||w t+1 -w t ||≤θ holds true, and if so, take the next weight vector w t+1 as the final weight vector w, w=[w 1 ,w 2 ,...,w I ] t ∈ R, proceed to S105; if not established, use the next weight vector w t+1 as the current weight vector w t , and return to S103 for a new iteration.
S105:根据所述最终权重向量进行特征选择,得到特征索引子集。S105: Perform feature selection according to the final weight vector to obtain a feature index subset.
具体地,根据最终权重向量w通过分类精度进行特征选择,得到对应的特征索引子集实现对第一样本的特征降维,从而减少计算量以及计算时间。Specifically, feature selection is performed by classification accuracy according to the final weight vector w, and the corresponding feature index subset is obtained Realize feature dimensionality reduction for the first sample, thereby reducing calculation amount and calculation time.
S106:将所述第一样本集根据所述特征索引子集进行特征选择,得到特征选择后的第二样本集。S106: Perform feature selection on the first sample set according to the feature index subset, to obtain a second sample set after feature selection.
具体地,将第一样本集根据特征索引子集进行特征选择,得到第二样本集其中每一个样本xi∈R|F|,|F|<I。Specifically, the first sample set Index subset by feature Perform feature selection to obtain the second sample set Each sample x i ∈R |F| , |F|<I.
S107:获取第一待评估数据,并根据所述特征索引子集进行特征选择得到第二待评估数据。S107: Acquire the first data to be evaluated, and perform feature selection according to the feature index subset to obtain second data to be evaluated.
具体地,获取第一待评估数据样本x,x∈RI,当前样本x未进行降维处理,样本维数为I。将数据样本根据特征索引子集进行特征选择,得到第二待评估数据x′。Specifically, the first data sample x to be evaluated is obtained, x∈R I , the current sample x is not subjected to dimensionality reduction processing, and the sample dimension is I. Subset data samples according to feature index Perform feature selection to obtain the second data x' to be evaluated.
S108:在所述第二样本集上对第二待评估数据进行分类,得到分类结果。S108: Classify the second data to be evaluated on the second sample set to obtain a classification result.
具体地,在第二样本集对第二待评估数据x′进行分类,得到分类结果,得到分类结果。可以利用这个分类结果对第一待评估数据样本x进行诊断。Specifically, in the second sample set Classify the second data x′ to be evaluated to obtain a classification result and obtain a classification result. The classification result can be used to diagnose the first data sample x to be evaluated.
因此,本发明实施例提供的一种基于局部学习特征权重选择的医疗数据分类方法,首先根据训练样本集得到样本的属性值,根据属性值利用梯度下降的权重更新方式计算属性对应的权重向量,因此可以保证收敛性,可以较快地达到算法的停止准则,减少计算时间,降低计算复杂度;根据计算出的权重向量进行特征选择得到最优特征集,将待评估数据样本进行标准化后再最优特征子集中进行特征选择,再将特征选择后的待评估数据样本进行分类就可以使数据样本实现降维,因此本发明实施例提供的方法实现降维的同时又降低了计算的复杂度,减少了计算时间。Therefore, a medical data classification method based on local learning feature weight selection provided by the embodiment of the present invention first obtains the attribute value of the sample according to the training sample set, and calculates the weight vector corresponding to the attribute by using the weight update method of gradient descent according to the attribute value, Therefore, the convergence can be guaranteed, the stopping criterion of the algorithm can be reached quickly, the calculation time is reduced, and the calculation complexity is reduced; according to the calculated weight vector, the feature selection is performed to obtain the optimal feature set, and the data samples to be evaluated are standardized before the final Performing feature selection in the optimal feature subset, and then classifying the data samples to be evaluated after feature selection can reduce the dimensionality of the data samples. Therefore, the method provided by the embodiment of the present invention reduces the complexity of calculation while achieving dimensionality reduction. Reduced computation time.
本发明实施例公开了一种具体的基于局部学习特征权重选择的医疗数据分类方法,区别于上一实施例,本发明对S101做了具体的限定,其他步骤内容与上一实施例大致相同,详细内容可以参见上一实施例,此处不再赘述。具体地,S101包括:The embodiment of the present invention discloses a specific medical data classification method based on local learning feature weight selection. Different from the previous embodiment, the present invention specifically limits S101, and other steps are roughly the same as the previous embodiment. For details, refer to the previous embodiment, which will not be repeated here. Specifically, S101 includes:
获取医疗数据的第一样本集,得到第一样本属性,并对所述第一样本集进行离差标准化处理;Acquiring a first sample set of medical data, obtaining attributes of the first sample, and performing dispersion standardization processing on the first sample set;
具体地,获取医疗数据的第一样本集得到第一样本集的样本属性,作为第一样本属性。其中xi∈RI,yi∈{1,2,...,C}是xi的标签,表明xi的类别,N是训练样本的个数,I是样本的维数,C是类别总数。Specifically, to obtain the first sample set of medical data The sample attribute of the first sample set is obtained as the first sample attribute. Where xi ∈ R I , y i ∈ {1,2,...,C} is the label of xi , indicating the category of xi , N is the number of training samples, I is the dimension of the sample, C is total number of categories.
需要说明的是,不同的特征属性往往具有不同的量纲和量纲单位,这样的情况会影响到数据分析的结果,为了消除不同量纲及量纲单位造成的影响,需要对第一样本集进行离差标准化处理,以解决特征属性数据之间的可比性。离差标准化处理的转换函数为其中,xij为第i个样本的第j个属性,为取所有训练样本数据中属性j的最大值,为所有数据中属性j的最小值。进行标准化处理后,特征数据的各指标都是同一个数量级,更利于对这些数据进行综合对比评价,本发明实施例所用的特征数据均为进行离差标准化处理后的数据。It should be noted that different feature attributes often have different dimensions and dimensional units, which will affect the results of data analysis. In order to eliminate the impact of different dimensions and dimensional units, it is necessary to analyze the first sample set Dispersion standardization is performed to solve the comparability of feature attribute data. The conversion function of dispersion standardization is Among them, x ij is the jth attribute of the i-th sample, In order to take the maximum value of attribute j in all training sample data, is the minimum value of attribute j in all data. After the standardization process, all the indicators of the feature data are of the same order of magnitude, which is more conducive to comprehensive comparison and evaluation of these data. The feature data used in the embodiments of the present invention are all data after the standardization process of dispersion.
本发明实施例公开了一种具体的基于局部学习特征权重选择的医疗数据分类方法,区别于上一实施例,本发明对S103做了具体的限定,其他步骤内容与上一实施例大致相同,详细内容可以参见上一实施例,此处不再赘述。具体地,S103包括:The embodiment of the present invention discloses a specific medical data classification method based on local learning feature weight selection. Different from the previous embodiment, the present invention specifically limits S103, and other steps are roughly the same as the previous embodiment. For details, refer to the previous embodiment, which will not be repeated here. Specifically, S103 includes:
通过规则对本次权重向量进行更新,得到迭代一次后的下次权重向量wt+1,J(w)为优化目标函数,通过最大化J(w)=(zi t+1)Twt+1计算得到。pass the rules Update the current weight vector to obtain the next weight vector w t+1 after one iteration, J(w) is the optimization objective function, by maximizing J(w)=(z i t+1 ) T w t+ 1 is calculated.
具体地,最大化求解J(w),对下次权重向量wt+1进行更新。Specifically, maximizing Solve J(w), and update the next weight vector w t+1 .
其中和分别是样本xi在异类样本以及同类样本中的近邻样本矩阵,k是先验设置的近邻个数。αi和βi分别是异类样本以及同类样本xi关于的系数向量。求解的优化问题可以获得αi;求解的优化问题可以获得βi,in with They are the neighboring sample matrix of the sample x i in heterogeneous samples and similar samples, and k is the number of neighbors set a priori. α i and β i are the coefficient vectors of heterogeneous samples and similar samples xi , respectively. solve The optimization problem of can obtain α i ; solve The optimization problem of can obtain β i ,
因此可以通过优化目标函数J(w),利用公式对本次权重向量wt进行更新得到迭代一次后的下次权重向量wt+1。Therefore, by optimizing the objective function J(w), the formula Update the current weight vector w t to obtain the next weight vector w t+1 after one iteration.
利用梯度下降的权重更新方式能够保证收敛性,当收敛性可以保证时,就能够较快地达到算法的停止准则,因此就可以降低计算的复杂度,减少计算的时间。The weight update method using gradient descent can guarantee the convergence. When the convergence can be guaranteed, the stopping criterion of the algorithm can be reached quickly, so the complexity of the calculation can be reduced, and the calculation time can be reduced.
本发明实施例公开了一种具体的基于局部学习特征权重选择的医疗数据分类方法,区别于上一实施例,本发明对S107做了具体的限定,其他步骤内容与上一实施例大致相同,详细内容可以参见上一实施例,此处不再赘述。具体地,S107包括:The embodiment of the present invention discloses a specific medical data classification method based on local learning feature weight selection. Different from the previous embodiment, the present invention specifically limits S107, and other steps are roughly the same as the previous embodiment. For details, refer to the previous embodiment, which will not be repeated here. Specifically, S107 includes:
获取第一待评估数据,进行离差标准化处理,并根据所述特征索引子集进行特征选择得到第二待评估数据。Acquire the first data to be evaluated, perform dispersion standardization processing, and perform feature selection according to the feature index subset to obtain the second data to be evaluated.
具体地,获得待评估信用数据样本x,作为第一待评估数据,其中x∈RI,对第一待评估数据利用上述实施例介绍的离差标准化的方法进行标准化处理,即 Specifically, the credit data sample x to be evaluated is obtained as the first data to be evaluated, where x∈R I , and the first data to be evaluated is standardized using the deviation standardization method introduced in the above embodiment, namely
需要说明的是,本发明所用的第一待评估数据均为进行利差标准化处理后的数据,对第一待评估数据进行离差标准化处理,同样避免了特征数据之间量纲与量纲单位的不同影响数据分析结果,将数据进行标准化处理,是待评估数据的各指标处于同一数量级,适合进行综合对比评价。It should be noted that the first data to be evaluated used in the present invention are all data after standardization of spread, and standardization of dispersion is carried out on the first data to be evaluated, which also avoids the dimension and dimension unit between characteristic data. The different impact data analysis results, the data are standardized, so that the indicators of the data to be evaluated are in the same order of magnitude, which is suitable for comprehensive comparison and evaluation.
本发明实施例公开了一种具体的基于局部学习特征权重选择的医疗数据分类方法,区别于上一实施例,本发明对S108做了具体的限定,其他步骤内容与上一实施例大致相同,详细内容可以参见上一实施例,此处不再赘述。具体地,S108包括:The embodiment of the present invention discloses a specific medical data classification method based on local learning feature weight selection. Different from the previous embodiment, the present invention specifically limits S108, and the other steps are roughly the same as the previous embodiment. For details, refer to the previous embodiment, which will not be repeated here. Specifically, S108 includes:
在所述第二样本集上对第二待评估数据利用K近邻分类器进行分类,得到分类结果。Classify the second data to be evaluated on the second sample set using a K-nearest neighbor classifier to obtain a classification result.
具体地,在第二样本集的基础上,利用K近邻分类器对第二待评估数据x′进行分类,得到分类结果,得到分类结果。可以利用这个分类结果对第一待评估数据样本x进行诊断。Specifically, in the second sample set On the basis of , use the K-nearest neighbor classifier to classify the second data x′ to be evaluated, and obtain the classification result. The classification result can be used to diagnose the first data sample x to be evaluated.
下面对本发明实施例提供的一种基于局部学习特征权重选择的医疗数据分类装置进行介绍,下文描述的一种医疗数据分类装置与上文描述的一种医疗数据分类方法可以相互参照。A medical data classification device based on local learning feature weight selection provided by an embodiment of the present invention is introduced below. The medical data classification device described below and the medical data classification method described above can be referred to each other.
参见图2,本发明实施例提供的一种基于局部学习特征权重选择的医疗数据分类装置,具体包括:Referring to Fig. 2, an embodiment of the present invention provides a medical data classification device based on local learning feature weight selection, which specifically includes:
第一样本集获取模块201,用于获取医疗数据的第一样本集,得到第一样本属性。The first sample set acquiring module 201 is configured to acquire a first sample set of medical data to obtain a first sample attribute.
具体地,第一样本集获取模块201获取医疗数据的第一样本集得到第一样本集的样本属性,作为第一样本属性。其中xi∈RI,yi∈{1,2,...,C}是xi的标签,表明xi的类别,N是训练样本的个数,I是样本的维数,C是类别总数。Specifically, the first sample set acquisition module 201 acquires the first sample set of medical data The sample attribute of the first sample set is obtained as the first sample attribute. Where xi ∈ R I , y i ∈ {1,2,...,C} is the label of xi , indicating the category of xi , N is the number of training samples, I is the dimension of the sample, C is total number of categories.
初始权重限量设置模块202,用于设置所述第一样本属性的初始权重向量,将所述初始权重向量作为本次权重向量。The initial weight limit setting module 202 is configured to set the initial weight vector of the first sample attribute, and use the initial weight vector as the current weight vector.
具体地,初始权重限量设置模块202对初始权重向量设置,即初始权重向量为w0=[1/I,1/I,...,1/I]t,其中t为迭代次数,当前t=0,即没有开始迭代,将初始权重向量w0作为本次权重向量wt。Specifically, the initial weight limit setting module 202 sets the initial weight vector, that is, the initial weight vector is w 0 =[1/I,1/I,...,1/I] t , where t is the number of iterations, and the current t =0, that is, the iteration is not started, and the initial weight vector w 0 is used as the current weight vector w t .
下次权重向量获取模块203,用于通过梯度下降的更新方式对本次权重向量进行更新,得到迭代一次后的下次权重向量。The next weight vector acquisition module 203 is configured to update the current weight vector by gradient descent to obtain the next weight vector after one iteration.
具体地,通过下次权重向量获取模块203对本次权重向量进行一次迭代,即将本次权重向量wt使用梯度下降的更新方式进行一次更新,得到下次权重向量wt+1。Specifically, the current weight vector is iterated through the next weight vector acquisition module 203 , that is, the current weight vector w t is updated once using gradient descent to obtain the next weight vector w t+1 .
判断模块204,用于判断去定规则是否成立,若是,则将所述下次权重向量作为最终权重向量,调用特征索引子集获取模块;若否,则将下次权重向量作为本次权重向量,调用所述下次权重向量获取模块;其中确定规则为||wt+1-wt||≤θ,wt为本次权重向量,wt+1为下次权重向量,θ为停止准则。Judging module 204, used to judge whether the rule of determination is established, if so, then use the next weight vector as the final weight vector, and call the feature index subset acquisition module; if not, use the next weight vector as the current weight vector , call the next weight vector acquisition module; wherein the determination rule is ||w t+1 -w t ||≤θ, w t is the current weight vector, w t+1 is the next weight vector, and θ is the stop guidelines.
具体地,在判断模块204中设置一个停止准则θ,判断||wt+1-wt||≤θ是否成立,如果是,则将下次权重向量wt+1作为最终权重向量w,w=[w1,w2,...,wI]t∈R,调用特征索引子集获取模块205,;如果否,则将下次权重向量wt+1作为本次权重向量wt,再次调用下次权重向量获取模块203,进行新的一次迭代。Specifically, a stop criterion θ is set in the judging module 204, and it is judged whether ||w t+1 -w t ||≤θ is established, and if yes, the next weight vector w t+1 is taken as the final weight vector w, w=[w 1 ,w 2 ,...,w I ] t ∈ R, call the feature index subset acquisition module 205; if not, use the next weight vector w t+1 as the current weight vector w t , call the next weight vector acquisition module 203 again to perform a new iteration.
所述特征索引子集获取模块205,用于根据所述最终权重向量进行特征选择,得到特征索引子集。The feature index subset obtaining module 205 is configured to perform feature selection according to the final weight vector to obtain a feature index subset.
具体地,特征索引子集获取模块205根据最终权重向量w通过分类精度进行特征选择,得到对应的特征索引子集实现对第一样本的特征降维,从而减少计算量以及计算时间。Specifically, the feature index subset acquisition module 205 performs feature selection according to the final weight vector w through the classification accuracy, and obtains the corresponding feature index subset Realize feature dimensionality reduction for the first sample, thereby reducing calculation amount and calculation time.
第二样本集获取模块206,用于将所述第一样本集根据所述特征索引子集进行特征选择,得到特征选择后的第二样本集。The second sample set acquisition module 206 is configured to perform feature selection on the first sample set according to the feature index subset, to obtain a second sample set after feature selection.
具体地,在第二样本集获取模块206中,将第一样本集根据特征索引子集进行特征选择,得到第二样本集其中每一个样本xi∈R|F|,|F|<I。Specifically, in the second sample set acquisition module 206, the first sample set Index subset by feature Perform feature selection to obtain the second sample set Each sample x i ∈R |F| , |F|<I.
第二待评估数据获取模块207,用于获取第一待评估数据,并根据所述特征索引子集进行特征选择得到第二待评估数据。The second data to be evaluated acquisition module 207 is configured to acquire the first data to be evaluated, and perform feature selection according to the feature index subset to obtain second data to be evaluated.
具体地,第二待评估数据获取模块207获取第一待评估数据样本x,x∈RI,当前样本x未进行降维处理,样本维数为I。将数据样本根据特征索引子集进行特征选择,得到第二待评估数据x′。Specifically, the second data to be evaluated acquisition module 207 acquires the first data to be evaluated sample x, x∈R I , the current sample x has not undergone dimensionality reduction processing, and the sample dimension is I. Subset data samples according to feature index Perform feature selection to obtain the second data x' to be evaluated.
分类模块208,用于在所述第二样本集上对第二待评估数据进行分类,得到分类结果。A classification module 208, configured to classify the second data to be evaluated on the second sample set to obtain a classification result.
具体地,分类模块208将第二待评估数据x′在第二样本集进行分类,得到分类结果,得到分类结果。可以利用这个分类结果对第一待评估数据样本x进行诊断。Specifically, the classification module 208 puts the second data to be evaluated x' in the second sample set Classify, get the classification result, get the classification result. The classification result can be used to diagnose the first data sample x to be evaluated.
因此,本发明实施例提供的一种基于局部学习特征权重选择的医疗数据分类方法,首先通过第一样本集获取模块201得到样本的属性值,根据属性值在下次权重向量获取模块203中,利用梯度下降的权重更新方式计算属性对应的权重向量,因此可以保证收敛性,可以较快地达到算法的停止准则,减少计算时间,降低计算复杂度;第二样本集获取模块206根据计算出的权重向量进行特征选择得到最优特征集,第二待评估数据获取模块207将待评估数据样本进行标准化后再最优特征子集中进行特征选择,再将特征选择后的待评估数据样本进行分类就可以使数据样本实现降维,因此本发明实施例提供的方法实现降维的同时又降低了计算的复杂度,减少了计算时间。Therefore, in the medical data classification method based on local learning feature weight selection provided by the embodiment of the present invention, the attribute value of the sample is first obtained through the first sample set acquisition module 201, and in the next weight vector acquisition module 203 according to the attribute value, Use the weight update method of gradient descent to calculate the weight vector corresponding to the attribute, so the convergence can be guaranteed, the stop criterion of the algorithm can be reached quickly, the calculation time is reduced, and the calculation complexity is reduced; the second sample set acquisition module 206 according to the calculated The weight vector performs feature selection to obtain the optimal feature set, the second data sample to be evaluated 207 standardizes the data samples to be evaluated, and then performs feature selection in the optimal feature subset, and then classifies the data samples to be evaluated after feature selection. Dimensionality reduction can be achieved for data samples, so the method provided by the embodiment of the present invention not only realizes dimensionality reduction, but also reduces computation complexity and computation time.
本发明实施例公开了一种具体的基于局部学习特征权重选择的医疗数据分类装置,区别于上一实施例,本发明对第一样本集获取模块201做了具体的限定,其他步骤内容与上一实施例大致相同,详细内容可以参见上一实施例,此处不再赘述。上述第一样本集获取模块201具体用于:The embodiment of the present invention discloses a specific medical data classification device based on local learning feature weight selection. Different from the previous embodiment, the present invention makes specific limitations on the first sample set acquisition module 201, and the contents of other steps are the same as The previous embodiment is roughly the same, and details can be referred to the previous embodiment, which will not be repeated here. The above-mentioned first sample set acquisition module 201 is specifically used for:
获取医疗数据的第一样本集,得到第一样本属性,并对所述第一样本集进行离差标准化处理。Acquire a first sample set of medical data, obtain attributes of the first sample, and perform dispersion standardization processing on the first sample set.
具体地,第一样本集获取模块201获取医疗数据的第一样本集得到第一样本集的样本属性,作为第一样本属性。其中xi∈RI,yi∈{1,2,...,C}是xi的标签,表明xi的类别,N是训练样本的个数,I是样本的维数,C是类别总数。Specifically, the first sample set acquisition module 201 acquires the first sample set of medical data The sample attribute of the first sample set is obtained as the first sample attribute. Where xi ∈ R I , y i ∈ {1,2,...,C} is the label of xi , indicating the category of xi , N is the number of training samples, I is the dimension of the sample, C is total number of categories.
需要说明的是,不同的特征属性往往具有不同的量纲和量纲单位,这样的情况会影响到数据分析的结果,为了消除不同量纲及量纲单位造成的影响,需要对第一样本集进行离差标准化处理,以解决特征属性数据之间的可比性。离差标准化处理的转换函数为其中,xij为第i个样本的第j个属性,为取所有训练样本数据中属性j的最大值,为所有数据中属性j的最小值。进行标准化处理后,特征数据的各指标都是同一个数量级,更利于对这些数据进行综合对比评价,本发明实施例所用的特征数据均为进行离差标准化处理后的数据。It should be noted that different feature attributes often have different dimensions and dimensional units, which will affect the results of data analysis. In order to eliminate the impact of different dimensions and dimensional units, it is necessary to analyze the first sample set Dispersion standardization is performed to solve the comparability of feature attribute data. The conversion function of dispersion standardization is Among them, x ij is the jth attribute of the i-th sample, In order to take the maximum value of attribute j in all training sample data, is the minimum value of attribute j in all data. After the standardization process, all the indicators of the feature data are of the same order of magnitude, which is more conducive to comprehensive comparison and evaluation of these data. The feature data used in the embodiments of the present invention are all data after the standardization process of dispersion.
本发明实施例公开了一种具体的基于局部学习特征权重选择的医疗数据分类装置,区别于上一实施例,本发明对下次权重向量获取模块203做了具体的限定,其他步骤内容与上一实施例大致相同,详细内容可以参见上一实施例,此处不再赘述。上述下次权重向量获取模块203具体用于:The embodiment of the present invention discloses a specific medical data classification device based on local learning feature weight selection. Different from the previous embodiment, the present invention makes specific limitations on the next weight vector acquisition module 203, and the content of other steps is the same as the above The first embodiment is roughly the same, and for details, refer to the previous embodiment, and details are not repeated here. The above-mentioned next weight vector acquisition module 203 is specifically used for:
通过规则对本次权重向量进行更新,得到迭代一次后的下次权重向量wt+1,J(w)通过最大化优化目标函数J(w)=(zi t+1)Twt+1计算得到。pass the rules Update the current weight vector to obtain the next weight vector w t+1 after one iteration, and J(w) is calculated by maximizing the optimization objective function J(w)=(z i t+1 ) T w t+1 get.
具体地,在下次权重向量获取模块203中,首先最大化求解J(w),对下次权重向量wt+1进行更新。Specifically, in the next weight vector acquisition module 203, first maximize Solve J(w), and update the next weight vector w t+1 .
其中和分别是样本xi在异类样本以及同类样本中的近邻样本矩阵,k是先验设置的近邻个数。αi和βi分别是异类样本以及同类样本xi关于的系数向量。求解的优化问题可以获得αi;求解的优化问题可以获得βi,in with They are the neighboring sample matrix of the sample x i in heterogeneous samples and similar samples, and k is the number of neighbors set a priori. α i and β i are the coefficient vectors of heterogeneous samples and similar samples xi , respectively. solve The optimization problem of can obtain α i ; solve The optimization problem of can obtain β i ,
因此可以通过J(w),利用公式对本次权重向量wt进行更新得到迭代一次后的下次权重向量wt+1。其中,优化目标函数J(w)通过最大化J(w)=(zi t+1)Twt+1计算得到。Therefore, J(w) can be used to use the formula Update the current weight vector w t to obtain the next weight vector w t+1 after one iteration. Wherein, the optimization objective function J(w) is calculated by maximizing J(w)=(z i t+1 ) T w t+1 .
利用梯度下降的权重更新方式能够保证收敛性,当收敛性可以保证时,就能够较快地达到算法的停止准则,因此就可以降低计算的复杂度,减少计算的时间。The weight update method using gradient descent can guarantee the convergence. When the convergence can be guaranteed, the stopping criterion of the algorithm can be reached quickly, so the complexity of the calculation can be reduced, and the calculation time can be reduced.
本发明实施例公开了一种具体的基于局部学习特征权重选择的医疗数据分类装置,区别于上一实施例,本发明对第二待评估数据获取模块207做了具体的限定,其他步骤内容与上一实施例大致相同,详细内容可以参见上一实施例,此处不再赘述。上述第二待评估数据获取模块207具体用于:The embodiment of the present invention discloses a specific medical data classification device based on local learning feature weight selection. Different from the previous embodiment, the present invention makes specific limitations on the second data acquisition module 207 to be evaluated, and the content of other steps is the same as The previous embodiment is roughly the same, and details can be referred to the previous embodiment, which will not be repeated here. The above-mentioned second data acquisition module 207 to be evaluated is specifically used for:
获取第一待评估数据,进行离差标准化处理,并根据所述特征索引子集进行特征选择得到第二待评估数据。Acquire the first data to be evaluated, perform dispersion standardization processing, and perform feature selection according to the feature index subset to obtain the second data to be evaluated.
具体地,第二待评估数据获取模块207获得待评估信用数据样本x,作为第一待评估数据,其中x∈RI,对第一待评估数据利用上述实施例介绍的离差标准化的方法进行标准化处理,即 Specifically, the second data to be evaluated acquisition module 207 obtains the credit data sample x to be evaluated as the first data to be evaluated, where x∈R I uses the deviation standardization method introduced in the above-mentioned embodiments to perform standardized treatment, that is
需要说明的是,本发明所用的第一待评估数据均为进行利差标准化处理后的数据,对第一待评估数据进行离差标准化处理,同样避免了特征数据之间量纲与量纲单位的不同影响数据分析结果,将数据进行标准化处理,是待评估数据的各指标处于同一数量级,适合进行综合对比评价。It should be noted that the first data to be evaluated used in the present invention are all data after standardization of spread, and standardization of dispersion is carried out on the first data to be evaluated, which also avoids the dimension and dimension unit between characteristic data. The different impact data analysis results, the data are standardized, so that the indicators of the data to be evaluated are in the same order of magnitude, which is suitable for comprehensive comparison and evaluation.
本发明实施例公开了一种具体的基于局部学习特征权重选择的医疗数据分类装置,区别于上一实施例,本发明对分类模块208做了具体的限定,其他步骤内容与上一实施例大致相同,详细内容可以参见上一实施例,此处不再赘述。上述分类模块208具体用于:The embodiment of the present invention discloses a specific medical data classification device based on local learning feature weight selection. Different from the previous embodiment, the present invention specifically limits the classification module 208, and other steps are roughly the same as the previous embodiment. Same, for details, refer to the previous embodiment, which will not be repeated here. The above classification module 208 is specifically used for:
在所述第二样本集上对第二待评估数据利用K近邻分类器进行分类,得到分类结果。Classify the second data to be evaluated on the second sample set using a K-nearest neighbor classifier to obtain a classification result.
具体地,分类模块208在第二样本集的基础上,利用K近邻分类器对第二待评估数据x′进行分类,得到分类结果,得到分类结果。可以利用这个分类结果对第一待评估数据样本x进行诊断。Specifically, the classification module 208 in the second sample set On the basis of , use the K-nearest neighbor classifier to classify the second data x′ to be evaluated, and obtain the classification result. The classification result can be used to diagnose the first data sample x to be evaluated.
本发明实施例公开了一种基于局部学习特征权重的医疗数据分类方法,具体包括:The embodiment of the present invention discloses a medical data classification method based on local learning feature weight, which specifically includes:
本发明实施例在胚数据集(CNS)数据集中进行了测试,该数据集中共包含34名患者样本,每个样本有7129个基因。这34个样本包括25个经典型髓母细胞瘤(C)以及9个促结缔组织增生性成神经管细胞瘤(D),因此共有2类。CNS数据集分为两个子集:23个训练样本(6个C,17个D),用来选择基因和调整分类器的权重,11个测试样本(3个C,8个D),用来评价系统所得结果的性能。每个样本均有都有7129个特征。我们将C视为第一类,D视为第二类。具体实施步骤分为两个模块进行,具体如下:The embodiment of the present invention is tested in the embryo data set (CNS) data set, which contains 34 patient samples, and each sample has 7129 genes. These 34 samples included 25 classical medulloblastomas (C) and 9 desmoplastic medulloblastomas (D), so there were 2 categories. The CNS dataset is divided into two subsets: 23 training samples (6 C, 17 D), used to select genes and adjust the weight of the classifier, and 11 test samples (3 C, 8 D), used to Evaluate the performance of the results obtained by the system. Each sample has 7129 features. We consider C as the first class and D as the second class. The specific implementation steps are divided into two modules, as follows:
模型训练模块:Model training module:
S301,输入医疗数据样本集作为第一样本集,其中xi∈RI,yi∈{1,2,...,C}是xi的标签,表明xi的类别,N是训练样本的个数,I是样本的维数,C是类别总数。这里N=23,I=7129,C=2。S301, input medical data sample set As the first sample set, where xi ∈ R I , y i ∈ {1,2,...,C} is the label of xi , indicating the category of xi , N is the number of training samples, I is The dimension of the sample, C is the total number of categories. Here N=23, I=7129, C=2.
S302,对所述第一样本集进行离差标准化处理,转换函数为其中,xij为第i个样本的第j个属性,为取所有训练样本数据中属性j的最大值,为所有数据中属性j的最小值。S302. Perform dispersion standardization processing on the first sample set, and the conversion function is Among them, x ij is the jth attribute of the i-th sample, In order to take the maximum value of attribute j in all training sample data, is the minimum value of attribute j in all data.
S303,设置所述第一样本属性的初始权重向量w0=[1/I,1/I,...,1/I]t,将所述初始权重向量作为本次权重向量。其中t为迭代次数,当前t=0,即没有开始迭代,将初始权重向量w0作为本次权重向量wt,迭代次数一共为30次,即一共进行30次迭代。S303. Set an initial weight vector w 0 =[1/I,1/I,...,1/I] t of the first sample attribute, and use the initial weight vector as the current weight vector. Where t is the number of iterations, currently t=0, that is, the iteration is not started, and the initial weight vector w 0 is used as the weight vector w t of this time, and the number of iterations is 30 in total, that is, a total of 30 iterations are performed.
S304,通过梯度下降的更新方式对本次权重向量进行更新,得到迭代一次后的下次权重向量。S304. Update the current weight vector by gradient descent to obtain a next weight vector after one iteration.
具体地,最大化求解优化目标函数J(w),对下次权重向量wt+1进行更新。Specifically, maximizing Solve the optimization objective function J(w), and update the next weight vector w t+1 .
其中和分别是样本xi在异类样本以及同类样本中的近邻样本矩阵,k是先验设置的近邻个数。αi和βi分别是异类样本以及同类样本xi关于的系数向量。求解的优化问题可以获得αi;求解的优化问题可以获得βi,in with They are the neighboring sample matrix of the sample x i in heterogeneous samples and similar samples, and k is the number of neighbors set a priori. α i and β i are the coefficient vectors of heterogeneous samples and similar samples xi , respectively. solve The optimization problem of can obtain α i ; solve The optimization problem of can obtain β i ,
因此可以通过J(w),利用公式对本次权重向量wt进行更新得到迭代一次后的下次权重向量wt+1。Therefore, J(w) can be used to use the formula Update the current weight vector w t to obtain the next weight vector w t+1 after one iteration.
S305,判断确定规则是否成立,若是,则将所次权重向量作为最终权重向量,执行S306;若否,则将下次权重向量作为本次权重向量,返回S304;其中确定规则为||wt+1-wt||≤θ,wt为本次权重向量,wt+1为下次权重向量,θ为停止准则。S305, judge whether the determination rule is established, if so, use the weight vector as the final weight vector, and execute S306; if not, use the next weight vector as the current weight vector, and return to S304; wherein the determination rule is ||w t +1 -w t ||≤θ, w t is the current weight vector, w t+1 is the next weight vector, and θ is the stopping criterion.
具体地,设置一个停止准则θ=0.001,并判断||wt+1-wt||≤θ是否成立,如果成立,则将下次权重向量wt+1作为最终权重向量w,w=[w1,w2,...,wI]t∈R7129,进行S306;如果不成立,则将下次权重向量wt+1作为本次权重向量wt,并返回S304,进行新的一次迭代。Specifically, set a stopping criterion θ=0.001, and judge whether ||w t+1 -w t ||≤θ holds true, and if so, take the next weight vector w t+1 as the final weight vector w, w= [w 1 ,w 2 ,...,w I ] t ∈ R 7129 , proceed to S306; if not established, take the next weight vector w t+1 as the current weight vector w t , and return to S304 to perform a new one iteration.
S306,根据所述最终权重向量进行特征选择,得到特征索引子集。S306. Perform feature selection according to the final weight vector to obtain a feature index subset.
具体地,根据最终权重向量w通过分类精度进行特征选择,得到对应的特征索引子集实现对第一样本的特征降维,从而减少计算量以及计算时间。Specifically, feature selection is performed by classification accuracy according to the final weight vector w, and the corresponding feature index subset is obtained Realize feature dimensionality reduction for the first sample, thereby reducing calculation amount and calculation time.
S307,将所述第一样本集根据所述特征索引子集进行特征选择,得到特征选择后的第二样本集。S307. Perform feature selection on the first sample set according to the feature index subset to obtain a second sample set after feature selection.
具体地,将第一样本集根据特征索引子集进行特征选择,得到第二样本集其中每一个样本xi∈R|F|,|F|<7129。Specifically, the first sample set Index subset by feature Perform feature selection to obtain the second sample set Wherein each sample x i ∈R |F| , |F|<7129.
评估模块:Evaluation modules:
S308,获取第一待评估数据。S308. Acquire first data to be evaluated.
具体地,输入待评估信用数据样本x作为第一待评估数据样本,x∈RI。Specifically, input the credit data sample x to be evaluated as the first data sample to be evaluated, x∈R I .
S309,对第一待评估数据进行离差标准化处理。S309. Perform dispersion standardization processing on the first data to be evaluated.
具体地,获得待评估信用数据样本x,作为第一待评估数据,其中x∈RI,对第一待评估数据利用上述实施例介绍的离差标准化的方法进行标准化处理,即 Specifically, the credit data sample x to be evaluated is obtained as the first data to be evaluated, where x∈R I , and the first data to be evaluated is standardized using the deviation standardization method introduced in the above embodiment, namely
S310,根据特征索引子集对第一待评估数据进行特征选择,得到第二待评估数据x′。S310, index the subset according to the feature Feature selection is performed on the first data to be evaluated to obtain the second data to be evaluated x′.
S311,在所述第二样本集上对第二待评估数据利用K近邻分类器进行分类,得到分类结果。S311. Classify the second data to be evaluated on the second sample set using a K-nearest neighbor classifier to obtain a classification result.
具体地,在第二样本集的基础上,利用K近邻分类器对第二待评估数据x′进行分类,得到分类结果,得到分类结果。可以利用这个分类结果对第一待评估数据样本x进行诊断。Specifically, in the second sample set On the basis of , use the K-nearest neighbor classifier to classify the second data x′ to be evaluated, and obtain the classification result. The classification result can be used to diagnose the first data sample x to be evaluated.
通过本发明提出一种基于局部学习特征权重的医疗数据分类方法,对LH-RELIEF的特征选择方法进行了改进,提取23个7129维的训练样本中的特征的组合F,1≤length(F)≤7129,对11个7129维的测试样本进行分类。本实验提出的方法与LH-RELIEF算法在相同的数据集上做比较,随机取78个训练样本10次,平均收敛结果如图3所示,平均性能结果如图4所示。可以看到本发明比MSVM-RFE算法收敛得更快,在相同选择了相同基因个数的情况下,具有更好的分类性能。Through the present invention, a medical data classification method based on local learning feature weight is proposed, the feature selection method of LH-RELIEF is improved, and the combination F of features in 23 7129-dimensional training samples is extracted, 1≤length(F) ≤7129, classify 11 test samples of 7129 dimensions. The method proposed in this experiment is compared with the LH-RELIEF algorithm on the same data set, and 78 training samples are randomly selected 10 times. The average convergence results are shown in Figure 3, and the average performance results are shown in Figure 4. It can be seen that the present invention converges faster than the MSVM-RFE algorithm, and has better classification performance under the same selection of the same number of genes.
表1给出了两种方法各自获得的最好平均分类性能时的对比。本发明比LH-RELIEF方法提高了大约2个百分点。Table 1 gives a comparison of the best average classification performance obtained by each of the two methods. The present invention has an improvement of about 2 percentage points over the LH-RELIEF method.
表1 LH-RELIEF和本发明最好分类性能的对比Table 1 Comparison of LH-RELIEF and the best classification performance of the present invention
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710419357.7A CN107193993A (en) | 2017-06-06 | 2017-06-06 | The medical data sorting technique and device selected based on local learning characteristic weight |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710419357.7A CN107193993A (en) | 2017-06-06 | 2017-06-06 | The medical data sorting technique and device selected based on local learning characteristic weight |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107193993A true CN107193993A (en) | 2017-09-22 |
Family
ID=59877175
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710419357.7A Pending CN107193993A (en) | 2017-06-06 | 2017-06-06 | The medical data sorting technique and device selected based on local learning characteristic weight |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107193993A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763873A (en) * | 2018-05-28 | 2018-11-06 | 苏州大学 | A kind of gene sorting method and relevant device |
CN109243561A (en) * | 2018-08-10 | 2019-01-18 | 上海交通大学 | Model optimization method and system of treatment scheme recommendation system |
CN113657499A (en) * | 2021-08-17 | 2021-11-16 | 中国平安财产保险股份有限公司 | Rights and interests allocation method and device based on feature selection, electronic equipment and medium |
JP2022500798A (en) * | 2019-01-29 | 2022-01-04 | 深▲せん▼市商▲湯▼科技有限公司Shenzhen Sensetime Technology Co., Ltd. | Image processing methods and equipment, computer equipment and computer storage media |
CN113971604A (en) * | 2020-07-22 | 2022-01-25 | 中移(苏州)软件技术有限公司 | Data processing method, device and storage medium |
-
2017
- 2017-06-06 CN CN201710419357.7A patent/CN107193993A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763873A (en) * | 2018-05-28 | 2018-11-06 | 苏州大学 | A kind of gene sorting method and relevant device |
CN109243561A (en) * | 2018-08-10 | 2019-01-18 | 上海交通大学 | Model optimization method and system of treatment scheme recommendation system |
CN109243561B (en) * | 2018-08-10 | 2020-07-28 | 上海交通大学 | Model optimization method and system of treatment scheme recommendation system |
JP2022500798A (en) * | 2019-01-29 | 2022-01-04 | 深▲せん▼市商▲湯▼科技有限公司Shenzhen Sensetime Technology Co., Ltd. | Image processing methods and equipment, computer equipment and computer storage media |
JP7076648B2 (en) | 2019-01-29 | 2022-05-27 | 深▲セン▼市商▲湯▼科技有限公司 | Image processing methods and equipment, computer equipment and computer storage media |
CN113971604A (en) * | 2020-07-22 | 2022-01-25 | 中移(苏州)软件技术有限公司 | Data processing method, device and storage medium |
CN113657499A (en) * | 2021-08-17 | 2021-11-16 | 中国平安财产保险股份有限公司 | Rights and interests allocation method and device based on feature selection, electronic equipment and medium |
CN113657499B (en) * | 2021-08-17 | 2023-08-11 | 中国平安财产保险股份有限公司 | Rights and interests distribution method and device based on feature selection, electronic equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zeebaree et al. | Gene selection and classification of microarray data using convolutional neural network | |
CN109543763B (en) | Raman spectrum analysis method based on convolutional neural network | |
CN107193993A (en) | The medical data sorting technique and device selected based on local learning characteristic weight | |
CN102402690B (en) | The data classification method integrated based on intuitionistic fuzzy and system | |
CN104966106B (en) | A kind of biological age substep Forecasting Methodology based on support vector machines | |
CN108416364A (en) | Integrated study data classification method is merged in subpackage | |
Wang et al. | imDC: an ensemble learning method for imbalanced classification with miRNA data | |
CN111061700A (en) | Hospitalizing migration scheme recommendation method and system based on similarity learning | |
CN110827923B (en) | Prediction method of semen protein based on convolutional neural network | |
CN106651574A (en) | Personal credit assessment method and apparatus | |
CN114819056B (en) | A single-cell data integration method based on domain adversarial and variational inference | |
Cengil et al. | A hybrid approach for efficient multi‐classification of white blood cells based on transfer learning techniques and traditional machine learning methods | |
WO2023217290A1 (en) | Genophenotypic prediction based on graph neural network | |
CN116226629B (en) | Multi-model feature selection method and system based on feature contribution | |
CN107403188A (en) | A kind of quality evaluation method and device | |
CN104598774A (en) | Feature gene selection method based on logistic and relevant information entropy | |
Shoohi et al. | DCGAN for Handling Imbalanced Malaria Dataset based on Over-Sampling Technique and using CNN. | |
CN115064217A (en) | Protein immunogenicity classifier construction method, prediction method, device and medium | |
CN105046236A (en) | Iterative tag noise recognition algorithm based on multiple voting | |
CN109409231B (en) | Multi-feature fusion sign language recognition method based on adaptive hidden Markov | |
CN111414930B (en) | Deep learning model training method and device, electronic equipment and storage medium | |
CN111144296A (en) | Retina fundus picture classification method based on improved CNN model | |
CN106951728A (en) | A kind of tumour key gene recognition methods based on particle group optimizing and marking criterion | |
CN107563287A (en) | Face identification method and device | |
CN117877744A (en) | Construction method and system of auxiliary reproductive children tumor onset risk prediction model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170922 |
|
RJ01 | Rejection of invention patent application after publication |