CN106228034A

CN106228034A - A kind of method for mixing and optimizing of tumor-related gene search

Info

Publication number: CN106228034A
Application number: CN201610555700.6A
Authority: CN
Inventors: 李小波; 田中娟; 叶晓平; 叶振
Original assignee: Lishui University
Current assignee: Lishui University
Priority date: 2016-07-12
Filing date: 2016-07-12
Publication date: 2016-12-14

Abstract

The invention discloses the method for mixing and optimizing of a kind of tumor-related gene search, step includes: step 1, utilize support vector machine recursive feature elimination algorithm to obtain " ranked genes collection "；Step 2, set up candidate gene collection Ω_k；Step 3, to candidate gene collection Ω_k, utilize genetic algorithm to search solution space；Step 4, determine optimum gene set, the gene of " optimum gene set " should i.e. be considered tumor-related gene.The method of the present invention, operand is little, and feasibility and effectiveness are all confirmed, and work efficiency and precision of prediction significantly improve.

Description

A hybrid optimization method for tumor-associated gene search

技术领域technical field

本发明属于基因搜索技术领域，涉及一种肿瘤相关基因搜索的混合优化方法。The invention belongs to the technical field of gene search, and relates to a hybrid optimization method for tumor-related gene search.

背景技术Background technique

癌症基因组学研究的最新进展将为个性化癌症医疗提供机会[1]。肿瘤是一种高度异质性、系统性和复杂性的疾病，它仍然是癌症准确诊断和治疗的一个重要障碍。肿瘤患者存在不同的致病通路，如果采用同一种类型的治疗方法治疗某一类肿瘤，则容易出现过度治疗或无效的治疗。一个典型的例子是抗癌药物曲妥珠单抗，它是一种干扰人表皮生长因子受体(HER2)的抗体，只在HER2过度表达的患者使用才有效[2]。因此，肿瘤的个性化医疗强调了肿瘤分子分类的必要性，需要识别可靠的肿瘤生物标志物来预测肿瘤的亚型。Recent advances in cancer genomics research will provide opportunities for personalized cancer medicine [1]. Cancer is a highly heterogeneous, systemic, and complex disease, and it remains an important obstacle to the accurate diagnosis and treatment of cancer. There are different pathogenic pathways in cancer patients. If the same type of treatment is used to treat a certain type of tumor, overtreatment or ineffective treatment will easily occur. A typical example is the anticancer drug trastuzumab, which is an antibody that interferes with the human epidermal growth factor receptor (HER2) and is only effective in patients with HER2 overexpression [2]. Therefore, personalized medicine for tumors emphasizes the necessity of tumor molecular classification and the need to identify reliable tumor biomarkers to predict tumor subtypes.

如今，许多高通量技术，包括微阵列技术，由于可以同时监测成千上万个基因的表达值，因而已被成功地应用在进行肿瘤分子分类和肿瘤生物标志物识别的研究中[3]。然而，微阵列数据通常样本量小(小于100)，基因数目很大(一般超过10000)。需要解决的关键问题是如何从成千上万的基因中选择一组数量较少的基因，随后被用来准确地对肿瘤样本进行分类[4,5]。Nowadays, many high-throughput technologies, including microarray technology, have been successfully applied in the study of tumor molecular classification and tumor biomarker identification because they can monitor the expression values of thousands of genes at the same time[3] . However, microarray data usually has a small sample size (less than 100) and a large number of genes (generally more than 10,000). A key issue to be addressed is how to select a small set of genes from thousands of genes, which can then be used to accurately classify tumor samples [4,5].

发明内容Contents of the invention

本发明的目的是提供一种肿瘤相关基因搜索的混合优化方法，解决了现有技术中存在的计算量大，准确率低，费时费力的问题。The purpose of the present invention is to provide a hybrid optimization method for tumor-related gene search, which solves the problems of large amount of calculation, low accuracy, and time-consuming and labor-intensive problems in the prior art.

本发明采用的技术方案是，一种肿瘤相关基因搜索的混合优化方法，具体按照以下步骤实施：The technical solution adopted in the present invention is a hybrid optimization method for tumor-related gene search, which is specifically implemented according to the following steps:

步骤1、利用支持向量机递归特征消除算法得到“排名基因集”Step 1. Use the support vector machine recursive feature elimination algorithm to obtain the "ranked gene set"

对于一个线性的SVM分类器，存在一个最优的超平面，其分类间隔定义为：For a linear SVM classifier, there is an optimal hyperplane whose classification interval is defined as:

$w w = = {Σ Σ}_{i i = = 11}^{n no} {α α}_{i i} {c c}_{i i} {x x}_{i i},, - - - - - - ((11))$

margin width＝2/||w||， (2)margin width=2/||w||, (2)

其中，w是最优超平面的垂直向量；Among them, w is the vertical vector of the optimal hyperplane;

x_i是样本i在训练集的基因表达向量，i＝1,2,…,k，k是支持向量的数目；x _i is the gene expression vector of sample i in the training set, i=1,2,...,k, k is the number of support vectors;

c_i∈[-1,+1]是样本i的类标签；c _i ∈ [-1,+1] is the class label of sample i;

权重α_i则从训练集计算得到，多数训练向量的权重α_i为零，如果该样本训练向量的权重α_i为非零值，则为支持向量，margin width是指分类间隔；The weight α _i is calculated from the training set. The weight α _i of most training vectors is zero. If the weight α _i of the sample training vector is non-zero, it is a support vector, and the margin width refers to the classification interval;

SVM-RFE采用后向消除的步骤，反复删除每个对SVM分类器贡献最少的基因，SVM-RFE的目标函数J定义为：SVM-RFE uses the step of backward elimination to repeatedly delete each gene that contributes the least to the SVM classifier. The objective function J of SVM-RFE is defined as:

J＝(1/2)||w||²， (3)J＝(1/2)||w|| ² ， (3)

通过近似二阶泰勒级数展开J，以Optimal Brain Damage算法逼近去除每个基因引起J的变化，则有：Expanding J by approximating the second-order Taylor series, and using the Optimal Brain Damage algorithm to approximate and remove the change in J caused by each gene, then:

$Δ Δ J J ((i i)) = = \frac{\partial \partial J J}{\partial \partial {w w}_{i i}} {Δw Δw}_{i i} + + \frac{{\partial \partial}^{22} J J}{\partial \partial {w w}_{i i}^{22}} {(({Δw Δw}_{i i}))}^{22},, - - - - - - ((44))$

在J的优化过程中，其一阶泰勒级数被忽略，于是其二阶泰勒级数变为：In the optimization process of J, its first-order Taylor series is ignored, so its second-order Taylor series becomes:

ΔJ(i)＝(Δw_i)²， (5)ΔJ(i)=(Δw _i ) ² , (5)

由于Δw_i＝w_i的权重变化与移除分类器中第i个特征相关，因此(w_i)²被作为SVM-RFE的得分标准，每次具有最小的特征值(w_i)²的特征将被消除；Since the weight change of Δw _i = _wi is related to the removal of the i-th feature in the classifier, ( _wi ) ² is used as the scoring standard of SVM-RFE, and the feature with the smallest eigenvalue ( _wi ) ² each time will be eliminated;

步骤2、建立候选基因集Ω_k Step 2. Establish candidate gene set Ω _k

选择排名前n的基因作为候选基因集，参数n具体视微阵列基因表达数据集的情况而定；Select the top n genes as the candidate gene set, and the parameter n depends on the situation of the microarray gene expression data set;

步骤3、对候选基因集Ω_k，利用遗传算法搜寻解空间Step 3. For the candidate gene set Ω _k , use the genetic algorithm to search the solution space

基于优胜劣汰的原则，每一代的发展将会产生更多更好的近似解，在每一代中，每个个体通过问题域的适应度函数进行评价，更适应的个体被保留；然后，用交叉和变异的遗传操作，产生了新的解决方案集；循环执行该处理，直到预定终止条件为止；Based on the principle of survival of the fittest, the development of each generation will produce more and better approximate solutions. In each generation, each individual is evaluated by the fitness function of the problem domain, and the more adaptive individuals are retained; then, the crossover and A genetic operation of mutation, resulting in a new solution set; the process is performed in a loop until a predetermined termination condition;

步骤4、确定最优基因集Step 4. Determine the optimal gene set

比较步骤3所得到的各组基因集，比较每个模型的预测精度和平均基因子集大小；在预测精度相同的情况下，选择平均基因子集的大小最小作为最佳参数n，并以此最佳参数n运行，得到基因数量最小且预测精度最高的基因子集，即“最优基因集”，该“最优基因集”中的基因即认为是肿瘤相关基因。Compare each group of gene sets obtained in step 3, and compare the prediction accuracy and average gene subset size of each model; in the case of the same prediction accuracy, select the smallest average gene subset size as the optimal parameter n, and use this The optimal parameter n is run to obtain the gene subset with the smallest number of genes and the highest prediction accuracy, that is, the "optimal gene set", and the genes in the "optimal gene set" are considered to be tumor-related genes.

本发明的有益效果是，该方法结合了遗传算法(GA)和支持向量机递归特征消除算法(SVM-RFE)[5-11]各自的优势，可行性和有效性都得到确认，工作效率和预测精度明显提高。The beneficial effects of the present invention are that the method combines the respective advantages of the genetic algorithm (GA) and the support vector machine recursive feature elimination algorithm (SVM-RFE) [5-11], the feasibility and effectiveness are all confirmed, and the work efficiency and The prediction accuracy is significantly improved.

附图说明Description of drawings

图1是本发明方法对于前列腺和NCI60数据集，当基因数目从100减少至1时，分类器的10折交叉验证预测精度曲线图。Fig. 1 is a 10-fold cross-validation prediction accuracy curve of the classifier when the number of genes is reduced from 100 to 1 for the prostate and NCI60 data sets by the method of the present invention.

具体实施方式detailed description

本发明的方法(以下简称SVM-RFE/GA)，具体按照以下步骤实施：Method of the present invention (hereinafter referred to as SVM-RFE/GA), specifically implement according to the following steps:

步骤1、利用支持向量机递归特征消除算法(SVM-RFE)得到“排名基因集”Step 1. Use the support vector machine recursive feature elimination algorithm (SVM-RFE) to obtain the "ranked gene set"

支持向量机(SVM)是用于解决微阵列基因表达数据分类等稀疏分类问题的有效方法，对于一个线性的SVM分类器，存在一个最优的超平面，其分类间隔定义为：Support Vector Machine (SVM) is an effective method for solving sparse classification problems such as microarray gene expression data classification. For a linear SVM classifier, there is an optimal hyperplane, and its classification interval is defined as:

margin width＝2/||w||， (2)margin width=2/||w||, (2)

SVM-RFE是一种嵌入式的特征基因选择方法[4]，SVM-RFE采用后向消除的步骤，反复删除每个对SVM分类器贡献最少的基因，SVM-RFE的目标函数J定义为：SVM-RFE is an embedded feature gene selection method [4]. SVM-RFE uses the step of backward elimination to repeatedly delete each gene that contributes the least to the SVM classifier. The objective function J of SVM-RFE is defined as:

J＝(1/2)||w||²， (3)J＝(1/2)||w|| ² ， (3)

通过近似二阶泰勒级数展开J，以Optimal Brain Damage(OBD)算法[12]逼近去除每个基因引起J的变化，则有：Expanding J by approximating the second-order Taylor series, and using the Optimal Brain Damage (OBD) algorithm [12] to approximate and remove the change in J caused by each gene, then:

ΔJ(i)＝(Δw_i)²， (5)ΔJ(i)=(Δw _i ) ² , (5)

由于Δw_i＝w_i的权重变化与移除分类器中第i个特征相关，因此(w_i)²被作为SVM-RFE的得分标准，每次具有最小的特征值(w_i)²的特征将被消除，因为其对分类器的影响最小，Since the weight change of Δw _i = _wi is related to the removal of the i-th feature in the classifier, ( _wi ) ² is used as the scoring standard of SVM-RFE, and the feature with the smallest eigenvalue ( _wi ) ² each time will be eliminated because it has the least impact on the classifier,

SVM-RFE算法的具体步骤是：The specific steps of the SVM-RFE algorithm are:

输入为：初始基因集I＝{1；2；...n}，排名基因集O＝{}；The input is: initial gene set I={1;2;...n}, ranked gene set O={};

输出为：排名基因集O；The output is: Ranked gene set O;

重复下列步骤1.1-1.4，直至初始基因集I为空：Repeat the following steps 1.1-1.4 until the initial gene set I is empty:

1.1)以初始基因集I作为输入变量，使用训练数据集训练线性支持向量机；1.1) With the initial gene set I as an input variable, use the training data set to train the linear support vector machine;

1.2)对初始基因集I内的所有基因，计算每个基因得分，计算得分标准r_i＝(w_i)²；1.2) For all genes in the initial gene set I, calculate the score of each gene, and calculate the scoring standard r _i =(w _i ) ² ;

1.3)选择具有最小排名得分的基因：g＝argmin{r_i}；1.3) Select the gene with the smallest ranking score: g=argmin{r _i };

1.4)分别更新排名基因集O和初始基因集I：O＝O∪g，I＝I-g，将基因g从初始基因集I去除，加入排名基因集O；最后输出得到一个排名基因集O；1.4) Update the ranked gene set O and the initial gene set I respectively: O=O∪g, I=I-g, remove the gene g from the initial gene set I, and add the ranked gene set O; finally output a ranked gene set O;

步骤2、建立候选基因集Ω_k Step 2. Establish candidate gene set Ω _k

SVM-RFE算法在每个步骤中消除“最差”的基因，用于产生根据其对分类“重要性”的基因排名；本发明步骤1的SVM-RFE算法施加在初始基因集I，以产生排名基因集O，这个相当于一个前置滤波过程，其目的是除去不相关的和嘈杂的基因，同时保持信息基因。The SVM-RFE algorithm eliminates the "worst" gene at each step for generating a gene ranking according to its "importance" for classification; the SVM-RFE algorithm of step 1 of the present invention is applied to the initial gene set I to generate Ranking gene set O, this is equivalent to a pre-filtering process, the purpose of which is to remove irrelevant and noisy genes while maintaining informative genes.

然而，该SVM-RFE算法忽略了基因之间的相互作用，这也是该算法的缺陷之一。因此，本发明选择排名前n的基因，建立不同基因数量的候选基因集Ω_k，并采用后续的遗传算法(GA)对Ω_k进行优化搜索，以期去除某些冗余基因，达到搜索到的肿瘤相关基因数量更小的目标。However, the SVM-RFE algorithm ignores the interaction between genes, which is also one of the shortcomings of the algorithm. Therefore, the present invention selects the top n genes, establishes candidate gene sets Ω _k with different numbers of genes, and uses the subsequent genetic algorithm (GA) to optimize the search for Ω _k in order to remove some redundant genes and achieve the searched Targets with a smaller number of tumor-associated genes.

在选择排名前n的基因作为候选基因集时，数目n的选择是实现后续遗传算法(GA)优化的一个关键问题，当n过小时，所述分类器不能够获得最高的预测精度；相反，当n过大时，GA可能陷入局部优化，导致选定的基因数量较多，优选的参数n限定在5～100之间，具体视微阵列基因表达数据集的情况而定，比如，参数n被分别设置为10、20、30、50的数值。When selecting the top n genes as the candidate gene set, the selection of the number n is a key issue to realize the optimization of the subsequent genetic algorithm (GA). When n is too small, the classifier cannot obtain the highest prediction accuracy; on the contrary, When n is too large, GA may fall into local optimization, resulting in a large number of selected genes. The preferred parameter n is limited between 5 and 100, depending on the situation of the microarray gene expression data set. For example, the parameter n are set to values of 10, 20, 30, 50 respectively.

遗传算法(GA)[13-15]是一种基于自然选择和遗传学原则的全局自适应概率搜索算法，其模拟生物界中优胜劣汰的进化和自然选择的生物学机制，以及重组和突变的遗传机制，GA从随机生成的初始群开始，每个群包含一定数目的编码个体，The Genetic Algorithm (GA) [13-15] is a global adaptive probability search algorithm based on the principles of natural selection and genetics, which simulates the evolution of the fittest in the biological world and the biological mechanism of natural selection, as well as the genetic mechanism of recombination and mutation. mechanism, GA starts from a randomly generated initial group, each group contains a certain number of coded individuals,

基于优胜劣汰的原则，每一代的发展将会产生更多更好的近似解，在每一代中，每个个体通过问题域的适应度函数进行评价，更适应的个体被保留；然后，用交叉和变异的遗传操作，产生了新的解决方案集；循环执行该处理，直到预定终止条件为止，本方法利用GA算法的具体步骤是：Based on the principle of survival of the fittest, the development of each generation will produce more and better approximate solutions. In each generation, each individual is evaluated by the fitness function of the problem domain, and the more adaptive individuals are retained; then, the crossover and The genetic operation of mutation produces a new solution set; the process is executed in a loop until a predetermined termination condition, and the specific steps of using the GA algorithm in this method are:

3.1)个体的表示：每个个体由一个N位二进制向量编码，其中N是遗传空间的大小，位值为“1”表示选定的基因，位值为“0”则表示该基因未被选择；3.1) Individual representation: each individual is encoded by an N-bit binary vector, where N is the size of the genetic space, a bit value of "1" indicates a selected gene, and a bit value of "0" indicates that the gene is not selected ;

3.2)设置适应度函数：每个个体由支持向量机(SVM)分类器评估，比如WEKA平台的SMO分类器[16]，目标函数使得分类器的分类错误率最小化；3.2) Set the fitness function: each individual is evaluated by a support vector machine (SVM) classifier, such as the SMO classifier [16] of the WEKA platform, and the objective function minimizes the classification error rate of the classifier;

3.3)设置遗传算子：遗传操作通过轮盘赌(Roulette wheel selection)选择，通过单点交叉(single-point crossover)和位翻转突变(bit flip mutation)实施，优选的GA的参数为：交叉概率＝1，变异概率＝0.02，最高世代＝50，和人口规模＝30。3.3) Set the genetic operator: the genetic operation is selected by roulette wheel selection, and implemented by single-point crossover and bit flip mutation. The preferred GA parameters are: crossover probability = 1, mutation probability = 0.02, highest generation = 50, and population size = 30.

采用10折交叉准确度验证分类模型，由于遗传算法是一种随机搜索模型，在每个候选基因集Ω_k执行5次试验，优化寻找到一组分类精度最高的“基因集”。A 10-fold crossover accuracy is used to verify the classification model. Since the genetic algorithm is a random search model, five trials are performed on each candidate gene set Ω _k , and a group of "gene sets" with the highest classification accuracy are optimized.

步骤4、确定最优基因集Step 4. Determine the optimal gene set

比较步骤3所得到的各组基因集，比较每个模型的预测精度和平均基因子集大小；在预测精度相同的情况下，选择平均基因子集的大小最小作为最佳参数n(即步骤2中排名前n的基因)，并以此最佳参数n运行，得到基因数量最小且预测精度最高的基因子集，被定义为“最小基因子集”，即“最优基因集”，该“最优基因集”的基因即认为是肿瘤相关基因。Compare each group of gene sets obtained in step 3, and compare the prediction accuracy and average gene subset size of each model; in the case of the same prediction accuracy, choose the smallest size of the average gene subset as the optimal parameter n (that is, step 2 The top n genes in the middle), and run with this optimal parameter n, get the gene subset with the smallest number of genes and the highest prediction accuracy, which is defined as the "minimum gene subset", that is, the "optimal gene set", the " The genes in the "optimal gene set" were considered as tumor-related genes.

在使用微阵列技术检测的数以万计基因当中，存在以下四种类型的基因[10]：1)信息基因，这类基因对癌症分子分类重要，并且在肿瘤发生发展中起着显著作用；2)冗余基因，这类基因与信息基因类似，可能与癌症相关，但它们对癌症分子分类的作用不显著；3)不相关基因，这类基因与癌症不相关，对癌症分类没有影响；4)嘈杂基因，这类基因具有负面影响，它们的存在可能会降低癌症分类性能。因此，基因选择的方法及目的在于，获得第一类基因即信息基因，同时去除其它三类基因。Among the tens of thousands of genes detected by microarray technology, there are the following four types of genes[10]: 1) informative genes, which are important for the molecular classification of cancer and play a significant role in the occurrence and development of tumors; 2) Redundant genes, which are similar to informative genes, may be related to cancer, but their effect on cancer molecular classification is not significant; 3) Irrelevant genes, such genes are not related to cancer, and have no effect on cancer classification; 4) Noisy genes, such genes have negative effects, and their presence may reduce the performance of cancer classification. Therefore, the method and purpose of gene selection is to obtain the first type of genes, namely informative genes, while removing the other three types of genes.

本发明的优势在于，步骤1的实施，能够有效地去除不相关和嘈杂基因，使得步骤2能够选择数目较少的基因实施后续的优化搜索。而步骤3的优化搜索，则能够有效去除步骤1不能去除的冗余基因。通过步骤4的实施，最后得到分类精度最高且数目最小的“最优基因集”，即是肿瘤相关基因。本发明步骤简单，计算工作量小，简便易行，克服了现有的SVM-RFE算法得到的基因集存在冗余，以及现有的遗传算法计算量大，易陷入局部优化且无法单独实施的问题，结合了SVM-RFE算法和遗传算法的各自优势，所得到的最优基因集数目小，且分类精度高，与肿瘤密切相关，便于后期的实验验证。The advantage of the present invention is that the implementation of step 1 can effectively remove irrelevant and noisy genes, so that in step 2, a small number of genes can be selected for subsequent optimization search. The optimization search in step 3 can effectively remove redundant genes that cannot be removed in step 1. Through the implementation of step 4, the "optimal gene set" with the highest classification accuracy and the smallest number is finally obtained, which is the tumor-related genes. The invention has simple steps, small calculation workload, simple and easy operation, and overcomes the redundancy of the gene set obtained by the existing SVM-RFE algorithm, and the large amount of calculation of the existing genetic algorithm, which is easy to fall into local optimization and cannot be implemented alone For the problem, combining the respective advantages of the SVM-RFE algorithm and the genetic algorithm, the number of optimal gene sets obtained is small, and the classification accuracy is high, which is closely related to tumors and is convenient for later experimental verification.

实验验证Experimental verification

1)提取数据集1) Extract the dataset

SVM-RFE/GA模型的性能在一个二分类和一个多类别微阵列基因表达数据集进行验证。表1给出了该数据集的基本情况。The performance of the SVM-RFE/GA model was validated on a binary and a multiclass microarray gene expression dataset. Table 1 gives the basic situation of this dataset.

表1、一个二分类和一个多类别微阵列基因表达数据集Table 1. A binary and a multiclass microarray gene expression dataset

前列腺(Prostate)数据集是一个二分类的基因表达数据集，其中包含52例肿瘤和50例正常前列腺组织的样品，该数据集从网站(http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi)下载得到。Prostate (Prostate) data set is a binary gene expression data set, which contains 52 cases of tumor and 50 samples of normal prostate tissue, the data set from the website (http://www.broadinstitute.org/cgi-bin/ cancer/datasets.cgi) to download.

NCI60数据集是多类别的基因表达数据集，该数据集包含9种肿瘤类型和60个样本，该数据集从网站(http://www.broadinstitute.org/mpr/NCI60/)下载得到。The NCI60 data set is a multi-category gene expression data set, which contains 9 tumor types and 60 samples. The data set is downloaded from the website (http://www.broadinstitute.org/mpr/NCI60/).

2)实验平台2) Experimental platform

实验在WEKA[16](http://www.cs.waikato.ac.nz/ml/weka/)平台进行。采用SMO分类器执行分类任务，选择多项式核函数(PolyKernel)。分类器的惩罚参数C被设置为100，采用10折交叉验证法评价SMO分类器的性能。 GA的参数设置如下：交叉概率＝1，变异概率＝0.02，最高世代＝50，和人口规模＝30。The experiment was carried out on the WEKA[16] (http://www.cs.waikato.ac.nz/ml/weka/) platform. The SMO classifier is used to perform the classification task, and the polynomial kernel function (PolyKernel) is selected. The penalty parameter C of the classifier is set to 100, and the performance of the SMO classifier is evaluated by 10-fold cross-validation method. The parameters of GA were set as follows: crossover probability=1, mutation probability=0.02, highest generation=50, and population size=30.

对实验数据进行预处理的过程包括：除去管家基因，其中前列腺数据集剩余12533个基因表达值，NCI60数据集剩余7071个基因表达值；对基因表达值进行标准化，使其均值为0和标准偏差为1。The process of preprocessing the experimental data includes: removing housekeeping genes, of which 12533 gene expression values remain in the prostate dataset and 7071 gene expression values in the NCI60 dataset; normalizing the gene expression values so that their mean is 0 and the standard deviation is 1.

3)实验结果3) Experimental results

先利用SVM-RFE算法生成排名基因集，其中的基因以降序排列。通常，在以前的研究中会保留50-100个基因数目的子集。本发明方法保留了排名前100的基因子集。为了测试SVM-RFE算法的性能，基因的数目减少为100至1，每一步消除最低分值的基因，使用10倍交叉验证方法评估该分类器的性能。在两个数据集最初的100个基因，分类器实现了100％的预测精度。如图1所示，在前列腺和NCI60数据集中，分别以9和80的最小基因数目，分类器能获得100％的准确率。结果说明，相比多类别数据集，二分类数据集可以用更少的基因数目，取得令人满意的分类结果。在NCI60数据集，当基因数小于36时，分类精度不超过90％。但在前列腺数据集，仅需9个基因，就可以得到100％的10折交叉验证准确度。First use the SVM-RFE algorithm to generate a ranked gene set, in which the genes are arranged in descending order. Typically, a subset of 50-100 gene numbers are retained from previous studies. The method of the present invention retains a subset of the top 100 genes. To test the performance of the SVM-RFE algorithm, the number of genes was reduced from 100 to 1, the lowest scoring gene was eliminated at each step, and the performance of the classifier was evaluated using 10-fold cross-validation method. On the first 100 genes of both datasets, the classifier achieved 100% prediction accuracy. As shown in Figure 1, in the prostate and NCI60 datasets, with the minimum gene numbers of 9 and 80, respectively, the classifier can obtain 100% accuracy. The results show that, compared with multi-category data sets, binary classification data sets can achieve satisfactory classification results with fewer genes. In the NCI60 dataset, when the number of genes is less than 36, the classification accuracy does not exceed 90%. But in the prostate dataset, only 9 genes are needed to get 100% 10-fold cross-validation accuracy.

通过SVM-RFE算法，排名前n的基因被作为候选基因集，在这里n被分别设置为10、20、30、50。由于遗传算法是一种随机搜索模型，在每个候选基因集执行5次试验，然后结果取平均值。Through the SVM-RFE algorithm, the top n genes are used as candidate gene sets, where n is set to 10, 20, 30, and 50, respectively. Since the genetic algorithm is a random search model, 5 trials are performed on each candidate gene set, and the results are averaged.

在前列腺癌数据集，当前10个基因被保留，遗传算法是搜索到最小数目的基因子集且能达到100％的分类精度(见表2)。基因子集的平均大小为5.4，远小于SVM-RFE方法获得相同准确度所需要的9个基因。In the prostate cancer data set, the current 10 genes are reserved, and the genetic algorithm can search the minimum number of gene subsets and can achieve 100% classification accuracy (see Table 2). The average size of the gene subset was 5.4, much smaller than the 9 genes required for the SVM-RFE method to obtain the same accuracy.

表2、在前列腺癌数据集，SVM-RFE/GA模型所得到的10折交叉准确度Table 2. In the prostate cancer dataset, the 10-fold crossover accuracy obtained by the SVM-RFE/GA model

Top n genesTop n genes 平均预测精度(％)Average prediction accuracy (%) 平均基因子集大小average gene subset size 1010 100100 5.45.4 2020 100100 7.07.0 3030 100100 8.08.0 5050 100100 13.2 13.2

在NCI60数据集中，当前50个基因被保留，遗传算法能够搜索到最小的基因子集，并实现100％的分类精度(见表3)。28个基因的平均子集大小比SVM-RFE方法所需要的少得多。SVM-RFE方法需要80个基因，以获得相同的精度。In the NCI60 dataset, the current 50 genes are reserved, and the genetic algorithm can search for the smallest gene subset and achieve 100% classification accuracy (see Table 3). The average subset size of 28 genes is much less than required by the SVM-RFE method. The SVM-RFE method requires 80 genes to achieve the same accuracy.

表3、在NCI60数据集，SVM-RFE/GA模型所得到的10折交叉准确度Table 3. In the NCI60 dataset, the 10-fold crossover accuracy obtained by the SVM-RFE/GA model

Top n genesTop n genes 平均预测精度(％)Average prediction accuracy (%) 平均基因子集大小average gene subset size 1010 65.865.8 66 2020 84.684.6 13.813.8 3030 94.194.1 2020 5050 100100 28 28

据观察，数目n的选择是GA算法的一个关键问题。当n过小时，所述分类器不能够获得最高的预测精度；相反，n过大时，GA可能陷入局部优化，导致选定的基因数量较多。It is observed that the selection of the number n is a key issue of the GA algorithm. When n is too small, the classifier cannot obtain the highest prediction accuracy; on the contrary, when n is too large, GA may fall into local optimization, resulting in a large number of selected genes.

可以实现最小数量的基因且预测精度最高的基因子集被定义为“最优基因集”。在前列腺癌的数据集，从排名前10的基因搜索，所得到的基因子集包含最小的基因数目(n＝5)，同时实现100％的预测准确性(见表4)。在NCI60癌症数据集，从排名前50个基因中搜索，得到的基因子集包含最小的基因数目(n＝26)，同时实现100％的预测准确性(见表5)。The subset of genes that can achieve the smallest number of genes with the highest prediction accuracy is defined as the "optimal gene set". In the prostate cancer dataset, searched from the top 10 genes, the resulting gene subset contained the smallest number of genes (n=5), while achieving 100% prediction accuracy (see Table 4). In the NCI60 cancer dataset, searching from the top 50 genes, the resulting gene subset contained the smallest number of genes (n=26), while achieving 100% prediction accuracy (see Table 5).

表4、前列腺癌数据集中得到的最优基因集Table 4. The optimal gene set obtained in the prostate cancer dataset

表5、NCI60数据集中得到的最优基因集Table 5. The optimal gene set obtained in the NCI60 dataset

在预测精度和选择的基因数目两方面，将SVM-RFE/GA模型得到的结果与其它的算法进行比较。在前列腺癌数据集(见表6)，只有SVM-RFE/GA模型和SVM-RFE算法可以达到100％的预测精度，但SVM-RFE/GA算法选择较少的基因数目。在NCI60数据集(见表7)中，SVM-RFE/GA算法表现更加突出，在同样实现100％的预测精度情况下，相比SVM-RFE算法(n＝80)，SVM-RFE/GA使用少得多的基因数量(n＝26)。The results obtained by the SVM-RFE/GA model were compared with other algorithms in terms of prediction accuracy and number of selected genes. In the prostate cancer dataset (see Table 6), only the SVM-RFE/GA model and the SVM-RFE algorithm can achieve 100% prediction accuracy, but the SVM-RFE/GA algorithm selects fewer genes. In the NCI60 data set (see Table 7), the performance of the SVM-RFE/GA algorithm is more prominent. Compared with the SVM-RFE algorithm (n=80), the SVM-RFE/GA algorithm uses Much smaller number of genes (n=26).

表6、前列腺癌数据集，SVM-RFE/GA算法与其他算法的结果比较Table 6. Prostate cancer data set, comparison of results between SVM-RFE/GA algorithm and other algorithms

表7、NCI60数据集，SVM-RFE/GA算法与其他算法的结果比较Table 7, NCI60 data set, comparison of results between SVM-RFE/GA algorithm and other algorithms

基因选择已经是微阵列数据分析中的一个重要研究课题。基因选择方法旨在消除嘈杂的，不相关和冗余的基因，这不仅可以降低分类器的计算负担，同时也提高了分类器的分类精度。在另一个方面中，所选择的信息基因子集包含较少的基因数量，更容易在随后的分子生物学实验进行验证。Gene selection has been an important research topic in microarray data analysis. The gene selection method aims to eliminate noisy, irrelevant and redundant genes, which can not only reduce the computational burden of the classifier, but also improve the classification accuracy of the classifier. In another aspect, the selected subset of informative genes contains a smaller number of genes, which is easier to verify in subsequent molecular biology experiments.

综上所述，本发明提出了一个GA算法与SVM-RFE算法相结合的模型，能够结合嵌入式和缠绕式方法的各自优势，该方法同时在一个二分类和多类别的微阵列基因表达数据集进行验证。结果表明，相比其他的算法，本发明所提出的特征基因选择方法能够以较少的信息基因数目，达到最高的分类精度。本次实验所得到的最优基因集(表4和表5)，其中的部分基因文献报道与肿瘤的发生发展关系密切，剩余的部分基因可以通过后期的分子生物学实验，进一步实施验证，以期发现全新的肿瘤基因标志物。In summary, the present invention proposes a model combining the GA algorithm and the SVM-RFE algorithm, which can combine the respective advantages of the embedded and winding methods. set for verification. The results show that, compared with other algorithms, the feature gene selection method proposed by the present invention can achieve the highest classification accuracy with a small number of informative genes. Among the optimal gene sets (Table 4 and Table 5) obtained in this experiment, some of the genes reported in the literature are closely related to the occurrence and development of tumors, and the remaining part of the genes can be further verified through later molecular biology experiments, in order to Discovery of novel tumor gene markers.

参考文献：references:

[1]Chin L,Andersen JN,Futreal PA(2011).Cancer genomics:from discoveryscience to personalized medicine.Nat Med 17(3):297-303.[1] Chin L, Andersen JN, Futreal PA (2011). Cancer genomics: from discovery science to personalized medicine. Nat Med 17(3): 297-303.

[2]Ong FS,Das K,Wang J,Vakil H,Kuo JZ,Blackwell WL,Lim SW,Goodarzi MO,Bernstein KE,Rotter JI,Grody WW(2012).Personalized medicine andpharmacogenetic biomarkers:progress in molecular oncology testing.Expert RevMol Diagn 12(6):593-602.[2] Ong FS, Das K, Wang J, Vakil H, Kuo JZ, Blackwell WL, Lim SW, Goodarzi MO, Bernstein KE, Rotter JI, Grody WW (2012). Personalized medicine and pharmacogenetic biomarkers: progress in molecular oncology testing. Expert Rev Mol Diagn 12(6):593-602.

[3]Golub TR,Slonim DK,Tamayo P,Huard C,Gaasenbeek M,Mesirov JP,Coller H,Loh ML,Downing JR,Caligiuri MA,Bloomfield CD,Lander ES (1999).Molecularclassification of cancer:class discovery and class prediction by geneexpression monitoring.Science 286(5439):531-7.[3] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531-7.

[4]Saeys Y,Inza I,Larranaga P(2007).A review of feature selectiontechniques in bioinformatics.Bioinformatics 23(19):2507-17.[4]Saeys Y, Inza I, Larranaga P(2007). A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507-17.

[5]Li X,Peng S,Chen J,Lu B,Zhang H,Lai M(2012).SVM-T-RFE:A novel geneselection algorithm for identifying metastasis-related genes in colorectalcancer using gene expression profiles.Biochemical and Biophysical ResearchCommunications 419(2):148-53.[5] Li X, Peng S, Chen J, Lu B, Zhang H, Lai M(2012). SVM-T-RFE: A novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles. Biochemical and Biophysical Research Communications 419(2):148-53.

[6]Guyon I,Weston J,Barnhill S,Vapnik V(2002).Gene selection for cancerclassification using support vector machines.Machine Learning 46(1-3):389-422.[6]Guyon I, Weston J, Barnhill S, Vapnik V(2002). Gene selection for cancer classification using support vector machines. Machine Learning 46(1-3):389-422.

[7]Duan KB,Rajapakse JC,Wang HY,Azuaje F(2005).Multiple SVM-RFE for geneselection in cancer classification with expression data.Ieee Transactions onNanobioscience 4(3):228-34.[7] Duan KB, Rajapakse JC, Wang HY, Azuaje F(2005). Multiple SVM-RFE for gene selection in cancer classification with expression data. Ieee Transactions on Nanobioscience 4(3):228-34.

[8]Zhang XG,Lu X,Shi Q,Xu XQ,Leung HCE,Harris LN,D Iglehart J,Miron A,LiuJS,Wong WH(2006).Recursive SVM feature selection and sample classificationfor mass-spectrometry and microarray data.BMC Bioinformatics 7:-.[8] Zhang XG, Lu X, Shi Q, Xu XQ, Leung HCE, Harris LN, D Iglehart J, Miron A, LiuJS, Wong WH (2006). Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics 7:-.

[9]Zhou X,Tuck DP(2007).MSVM-RFE:extensions of SVM-RFE for multiclassgene selection on DNA microarray data(vol 23,pg 1106,2007).Bioinformatics 23(15):2029-.[9] Zhou X, Tuck DP (2007). MSVM-RFE: extensions of SVM-RFE for multiclassgene selection on DNA microarray data (vol 23, pg 1106, 2007). Bioinformatics 23(15): 2029-.

[10]Tang YC,Zhang YQ,Huang Z(2007).Development of two-stage SVM-RFE geneselection strategy for microarray expression data analysis.Ieee-AcmTransactions on Computational Biology and Bioinformatics 4(3):365-81.[10]Tang YC, Zhang YQ, Huang Z(2007).Development of two-stage SVM-RFE geneselection strategy for microarray expression data analysis.Ieee-AcmTransactions on Computational Biology and Bioinformatics 4(3):365-81.

[11]Mundra PA,Rajapakse JC(2010).SVM-RFE With MRMR Filter for GeneSelection.Ieee Transactions on Nanobioscience 9(1):31-7.[11]Mundra PA, Rajapakse JC(2010).SVM-RFE With MRMR Filter for GeneSelection.Ieee Transactions on Nanobioscience 9(1):31-7.

[12]Le Cun Y,Denker J,Solla S,Touretzky DS.Optimal brain damage.Advancesin Neural Information Processing Systems:Morgan Kaufmann；1990.p.598-605.[12] Le Cun Y, Denker J, Solla S, Touretzky DS. Optimal brain damage. Advances in Neural Information Processing Systems: Morgan Kaufmann; 1990.p.598-605.

[13]Tan F,Fu X,Zhang Y,Bourgeois A(2008).A genetic algorithm-based methodfor feature subset selection.Soft Computing 12(2):111-20.[13]Tan F, Fu X, Zhang Y, Bourgeois A(2008).A genetic algorithm-based method for feature subset selection.Soft Computing 12(2):111-20.

[14]Nicoletta D,Barbara P(2009).An evolutionary method for combiningdifferent feature selection criteria in microarray data classification.2009:1-10.[14]Nicoletta D, Barbara P(2009).An evolutionary method for combining different feature selection criteria in microarray data classification.2009:1-10.

[15]Cannas L,Dessi N,Pes B.A Hybrid Model to Favor the Selection of HighQuality Features in High Dimensional Domains.Intelligent Data Engineering andAutomated Learning-IDEAL 2011:Springer Berlin Heidelberg；2011.p.228-35.[15]Cannas L, Dessi N, Pes B.A Hybrid Model to Favor the Selection of High Quality Features in High Dimensional Domains. Intelligent Data Engineering and Automated Learning-IDEAL 2011: Springer Berlin Heidelberg; 2011.p.228-35.

[16]Mark H,Eibe F,Geoffrey H,Bernhard P,Peter R,Ian HW(2009).The WEKAdata mining software:an update.SIGKDD Explor Newsl 11(1):10-8.[16]Mark H, Eibe F, Geoffrey H, Bernhard P, Peter R, Ian HW (2009). The WEKA data mining software: an update. SIGKDD Explor Newsl 11(1): 10-8.

[17]Singh D,Febbo PG,Ross K,Jackson DG,Manola J,Ladd C,Tamayo P,RenshawAA,D'Amico AV,Richie JP,Lander ES,Loda M,Kantoff PW,Golub TR,Sellers WR(2002).Gene expression correlates of clinical prostate cancer behavior.CancerCell 1(2):203-9.[17] Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002 ). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203-9.

[18]Staunton JE,Slonim DK,Coller HA,Tamayo P,Angelo MJ,Park J,Scherf U,Lee JK,Reinhold WO,Weinstein JN,Mesirov JP,Lander ES,Golub TR(2001).Chemosensitivity prediction by transcriptional profiling.Proc Natl Acad SciU S A 98(19):10787-92.[18] Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR (2001). Chemosensitivity prediction by transcriptional profiling .Proc Natl Acad Sci U S A 98(19):10787-92.

[19]Tan AC,Naiman DQ,Xu L,Winslow RL,Geman D(2005).Simple decision rulesfor classifying human cancers from gene expression profiles.Bioinformatics 21(20):3896-904.[19] Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D(2005). Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 21(20): 3896-904.

[20]Peng SH,Xu QH,Ling XB,Peng XN,Du W,Chen LB(2003).Molecularclassification of cancer types from microarray data using the combination ofgenetic algorithms and support vector machines.Febs Letters 555(2):358-62.[20]Peng SH, Xu QH, Ling XB, Peng XN, Du W, Chen LB(2003). Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. Febs Letters 555(2):358- 62.

[21]Ooi CH,Tan P(2003).Genetic algorithms applied to multi-classprediction for the analysis of gene expression data.Bioinformatics 19(1):37-44。[21]Ooi CH, Tan P(2003). Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics 19(1):37-44.

Claims

1. A hybrid optimization method for tumor-related gene search, characterized in that it is specifically implemented according to the following steps:

Step 1. Use the support vector machine recursive feature elimination algorithm to obtain the "ranked gene set"

For a linear SVM classifier, there is an optimal hyperplane whose classification interval is defined as:

w w = = {Σ Σ}_{i i = = 11}^{n no} {α α}_{i i} {c c}_{i i} {x x}_{i i},, - - - - - - ((11))

margin width=2/||w||, (2)

Among them, w is the vertical vector of the optimal hyperplane;

x _i is the gene expression vector of sample i in the training set, i=1,2,...,k, k is the number of support vectors;

c _i ∈ [-1,+1] is the class label of sample i;

The weight α _i is calculated from the training set. The weight α _i of most training vectors is zero. If the weight α _i of the sample training vector is non-zero, it is a support vector, and the margin width refers to the classification interval;

SVM-RFE uses the step of backward elimination to repeatedly delete each gene that contributes the least to the SVM classifier. The objective function J of SVM-RFE is defined as:

J＝(1/2)||w|| ² ， (3)

Expanding J by approximating the second-order Taylor series, and using the Optimal Brain Damage algorithm to approximate and remove the change in J caused by each gene, then:

Δ Δ J J ((i i)) = = \frac{\partial \partial J J}{\partial \partial {w w}_{i i}} {Δw Δw}_{i i} + + \frac{{\partial \partial}^{22} J J}{\partial \partial {w w}_{i i}^{22}} {(({Δw Δw}_{i i}))}^{22},, - - - - - - ((44))

In the optimization process of J, its first-order Taylor series is ignored, so its second-order Taylor series becomes:

ΔJ(i)=(Δw _i ) ² , (5)

Since the weight change of Δw _i = _wi is related to the removal of the i-th feature in the classifier, ( _wi ) ² is used as the scoring standard of SVM-RFE, and the feature with the smallest eigenvalue ( _wi ) ² each time will be eliminated;

Step 2. Establish candidate gene set Ω _k

Select the top n genes as the candidate gene set, and the parameter n depends on the situation of the microarray gene expression data set;

Step 3. For the candidate gene set Ω _k , use the genetic algorithm to search the solution space

Based on the principle of survival of the fittest, the development of each generation will produce more and better approximate solutions. In each generation, each individual is evaluated by the fitness function of the problem domain, and the more adaptive individuals are retained; then, the crossover and A genetic operation of mutation, resulting in a new solution set; the process is performed in a loop until a predetermined termination condition;

Step 4. Determine the optimal gene set

Compare each group of gene sets obtained in step 3, and compare the prediction accuracy and average gene subset size of each model; in the case of the same prediction accuracy, select the smallest average gene subset size as the optimal parameter n, and use this The optimal parameter n is run to obtain the gene subset with the smallest number of genes and the highest prediction accuracy, that is, the "optimal gene set", and the genes in the "optimal gene set" are considered to be tumor-related genes.

2. the hybrid optimization method of tumor-related gene search according to claim 1, is characterized in that, in the described step 1, the specific steps of utilizing the SVM-RFE algorithm are:

Input initial gene set I={1;2;...n} and ranked gene set O={};

Repeat the following steps 1.1-1.4 until the initial gene set I is empty:

1.1) With the initial gene set I as an input variable, use the training data set to train the linear support vector machine;

1.2) For all genes in the initial gene set I, calculate the score of each gene, and calculate the scoring standard r _i =(w _i ) ² ;

1.3) Select the gene with the smallest ranking score: g=arg min{r _i };

1.4) Respectively update the ranked gene set O and the initial gene set I: O=O∪g, I=I-g, remove the gene g from the initial gene set I, and add the ranked gene set O; finally output a ranked gene set O.

3. The hybrid optimization method of tumor-related gene search according to claim 1, characterized in that, in the step 3, the specific steps of utilizing the GA algorithm are:

3.1) Individual representation: each individual is encoded by an N-bit binary vector, where N is the size of the genetic space, a bit value of "1" indicates a selected gene, and a bit value of "0" indicates that the gene is not selected ;

3.2) Set the fitness function: each individual is evaluated by a support vector machine classifier, such as the SMO classifier of the WEKA platform, and the objective function minimizes the classification error rate of the classifier;

3.3) Setting up genetic operators: Genetic operations are selected by roulette, implemented by single-point crossover and bit-flip mutations.