CN107577918A

CN107577918A - CpG Island Identification Method and Device Based on Genetic Algorithm and Hidden Markov Model

Info

Publication number: CN107577918A
Application number: CN201710725585.7A
Authority: CN
Inventors: 刘弘; 何演林; 郑元杰; 赵丹丹; 陆佃杰; 吕晨
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2017-08-22
Filing date: 2017-08-22
Publication date: 2018-01-12

Abstract

The present invention relates to a kind of CpG islands recognition methods based on genetic algorithm and hidden Markov model, comprise the following steps：1）Multiple chromosomes for including gene elements are obtained, each gene elements use real number representation, and multiple Encoded Chromosomes form one group of hidden Markov model parameter；2) fitness value of each chromosome is determined using fitness function, the fitness value is used for representing chromosome quality degree；3）Using genetic algorithm, according to the fitness value, searching process is performed to the chromosome, then redefines the chromosome fitness value after optimizing again；4）Iteration is applicable step 3）, after meeting to set end condition, export optimal hidden Markov model parameter；5）Using the optimal HMM parameter of output, on the basis of given observation sequence, it is determined that the maximum probability hidden state sequence of the observation sequence is generated, for representing the position on CpG islands.

Description

CpG Island Identification Method and Device Based on Genetic Algorithm and Hidden Markov Model

技术领域technical field

本发明属于生物信息领域，具体涉及一种基于遗传算法和隐马尔可夫模型的CpG岛识别方法、装置。The invention belongs to the field of biological information, and in particular relates to a method and a device for identifying a CpG island based on a genetic algorithm and a hidden Markov model.

背景技术Background technique

随着生物基因测序的完成，在基因序列识别上面临着诸多问题和挑战。在许多基因组中最少见的二核苷酸是CG，CG中的C最容易被甲基化，这会导致C突变成T。但是甲基化作用常常被一个区域的基因所抑制，这个区域就是CpG岛。它是一类长度在几百bp的特殊DNA序列，其中CG核苷酸出现的频率非常高。每发现一个CpG岛就意味着其序列可能包含基因转录的启动子及其第一外显子而且CpG岛的识别有助于在基因组序列中确定我们感兴趣的区域。因此，CpG岛对基因序列识别具有至关重要的意义。With the completion of biological gene sequencing, there are many problems and challenges in gene sequence identification. The least common dinucleotide in many genomes is CG, and the C in CG is most likely to be methylated, which causes the C to mutate into a T. But methylation is often repressed by a region of the gene called a CpG island. It is a special DNA sequence with a length of several hundred bp, in which the frequency of CG nucleotides is very high. Every discovery of a CpG island means that its sequence may contain the promoter of gene transcription and its first exon, and the identification of CpG island helps to determine the region of interest in the genome sequence. Therefore, CpG islands are of crucial importance for gene sequence recognition.

CpG岛的识别主要面临两个问题：1.给定一条短基因组序列，如何判断它是否来自CpG岛。2.给定一条长序列，如果含有CpG岛，如何识别。The identification of CpG islands mainly faces two problems: 1. Given a short genome sequence, how to judge whether it comes from CpG islands. 2. Given a long sequence, how to identify if it contains CpG islands.

目前的研究主要集中在第二个问题上。研究者认为长度大于200bp，CG在50％以上，实际CpG含量与期望CpG含量的比值大于0.6的区域被称为CpG岛。传统的CpG岛的识别算法是定义一个滑动窗口，通过计算窗口内基因序列的CG含量和实际CpG含量与期望CpG含量的比值实现的。我们可以发现窗口大小的设定，对识别效果的影响较大，而且计算复杂度也很大。并且提出的判别标准都是人为定义的，因此识别出的CpG岛生物学意义不大。为了能够正确找出更具生物学意义的判别标准，有研究者提出基于隐马尔可夫模型(HMM)的方法来识别CpG岛的位置。HMM是一种概率模型，它由一个隐含状态变化序列和由该隐含状态产生的可观察符号序列组成。Current research mainly focuses on the second question. Researchers believe that the region with a length greater than 200bp, a CG content of more than 50%, and a ratio of the actual CpG content to the expected CpG content greater than 0.6 is called a CpG island. The traditional CpG island identification algorithm is to define a sliding window, which is realized by calculating the CG content of the gene sequence within the window and the ratio of the actual CpG content to the expected CpG content. We can find that the setting of the window size has a great influence on the recognition effect, and the computational complexity is also great. Moreover, the proposed criteria are all artificially defined, so the identified CpG islands have little biological significance. In order to correctly identify more biologically meaningful criteria, some researchers proposed a method based on Hidden Markov Model (HMM) to identify the location of CpG islands. An HMM is a probabilistic model that consists of a sequence of hidden state changes and a sequence of observable symbols produced by that hidden state.

一个隐马尔可夫模型是有字母表∑、一个状态集合Q、一个状态概率矩阵A和一个发出概率矩阵B定义的，其中：A Hidden Markov Model is defined by an alphabet Σ, a state set Q, a state probability matrix A, and an emission probability matrix B, where:

●∑是一个字母表；●∑ is an alphabet;

●Q表示从字母表中发出的符号的集合；● Q represents the set of symbols emitted from the alphabet;

●A描述的是HMM从状态t转移到状态t+1状态的概率；A describes the probability of HMM transitioning from state t to state t+1;

●B描述的是HMM在状态t时刻发出的符号s的概率；●B describes the probability of the symbol s issued by the HMM at state t;

一旦一个系统可以作为HMM被描述，就可以用来解决三个基本问题。Once a system can be described as an HMM, it can be used to solve three basic problems.

·解码问题：给定模型和字符序列，在模型中寻找一条最优路径。该路径从起始状态出发，路径中每个状态都选择释放一个字符，实现解码操作。Decoding problem: Given a model and a sequence of characters, find an optimal path in the model. The path starts from the initial state, and each state in the path chooses to release a character to realize the decoding operation.

·评估问题：对于给定模型，求产生字符序列的概率。一般情况下选择前向算法来计算给定HMM后的一个观测序列的概率，并因此选出最合适的HMM。• Evaluation problem: For a given model, find the probability of generating a sequence of characters. In general, the forward algorithm is selected to calculate the probability of an observation sequence after a given HMM, and thus select the most suitable HMM.

·学习问题：根据观测序列生成HMM。Learning problem: Generate HMMs from sequences of observations.

其中前两个是模式识别的问题：给定HMM求一个观察序列的概率(评估)；搜索最有可能生成一个观察序列的隐藏状态序列(解码)。第三个问题是给定观察序列生成一个HMM(学习)。第三个问题，也是与HMM相关的问题中最难的，根据一个观察序列(来自于已知的集合)，以及与其有关的一个隐藏状态集，估计一个最合适的隐马尔科夫模型。HMM中总共有八种状态：{A+,G+,C+,T+,A-,G-,C-,T-}，A+表示此状态在CpG岛内部，A-表示此状态在CpG岛外部。模型中每个碱基对应着两种状态。在给定碱基序列情况下，不能确定碱基对应于何种状态值。模型中，状态之间是允许相互转换的。隐马尔可夫模型的使用方法如下：The first two of these are pattern recognition problems: given the HMM, find the probability of an observation sequence (evaluation); search for the hidden state sequence that is most likely to generate an observation sequence (decoding). The third problem is to generate an HMM given a sequence of observations (learning). The third problem, and the hardest of the HMM-related problems, is to estimate a best-fit hidden Markov model given a sequence of observations (from a known set), and a set of hidden states associated with it. There are a total of eight states in HMM: {A+, G+, C+, T+, A-, G-, C-, T-}, A+ indicates that this state is inside the CpG island, and A- indicates that this state is outside the CpG island. Each base in the model corresponds to two states. In the case of a given base sequence, it cannot be determined which state value the base corresponds to. In the model, transitions between states are allowed. The Hidden Markov Model is used as follows:

首先收集一定数量的已经确定的CpG岛的DNA序列，利用这些真实的数据训练出模型的参数，即隐马尔可夫模型的学习问题。通过建立隐马尔可夫模型从训练数据中得到模型参数，进一步用训练得到的模型识别CpG岛。First, a certain number of DNA sequences of identified CpG islands are collected, and these real data are used to train the parameters of the model, that is, the learning problem of the hidden Markov model. The parameters of the model were obtained from the training data by establishing a hidden Markov model, and the trained model was further used to identify CpG islands.

对于HMM及一个相应的观察序列，我们希望找出生成此序列的最可能的隐藏状态序列。我们可以通过列出所有可能的隐藏状态序列并计算对应每个组合相应的观察序列的概序来找最可能的隐藏状态，但是这种方法计算复杂度很高。Given an HMM and a corresponding sequence of observations, we wish to find the most likely sequence of hidden states that generated this sequence. We can find the most likely hidden state by listing all possible hidden state sequences and calculating the approximate order of the corresponding observation sequence for each combination, but this method is computationally expensive.

隐马尔可夫模型是基于时序的概率模型，它依赖初始状态概率向量，转移概率矩阵和观察概率矩阵。经过研究发现，虽然隐马尔可夫模型在解决过度拟合问题上能取得较好的效果，但是仍然存在很多问题。它依赖于强烈的假设，下一状态仅受上一状态的影响，这种假设过于简化，因此，只有在假设和实际数据一致的情况下，隐马尔可夫模型才能根据最大似然估计做出有效和精准的识别。但是通常情况下，实际数据不仅仅受上一状态的影响。这使得HMM容易陷入局部最优的情况，且计算复杂度较高。为了能提高HMM对CpG岛的识别能力，需要对HMM参数进行优化设计。Hidden Markov model is a time-series-based probability model, which depends on the initial state probability vector, transition probability matrix and observation probability matrix. After research, it is found that although the hidden Markov model can achieve better results in solving the overfitting problem, there are still many problems. It relies on the strong assumption that the next state is only influenced by the previous state, which is an oversimplification, so HMMs can only be made from maximum likelihood estimates if the assumptions and actual data are consistent Effective and precise identification. But usually the actual data is not only affected by the previous state. This makes HMM easy to fall into a local optimal situation, and the computational complexity is high. In order to improve the ability of HMM to recognize CpG islands, it is necessary to optimize the design of HMM parameters.

发明内容Contents of the invention

针对现有技术中存在的不足，本发明提供了一种基于遗传算法和隐马尔可夫模型的CpG岛识别方法，可以综合考虑HMM空间中的解，从而得出全局最优解，可以更好的优化HMM参数，从而提高对CpG岛识别能力。Aiming at the deficiencies in the prior art, the present invention provides a CpG island identification method based on genetic algorithm and hidden Markov model, which can comprehensively consider the solution in HMM space, so as to obtain the global optimal solution, which can be better Optimize the HMM parameters to improve the ability to identify CpG islands.

本发明的技术方案为：Technical scheme of the present invention is:

一种基于遗传算法和隐马尔可夫模型的CpG岛识别方法，包括以下步骤：A method for identifying CpG islands based on genetic algorithms and hidden Markov models, comprising the following steps:

1)获取多个包括有基因元素的染色体，每一基因元素均采用实数表示，多个染色体构成一组隐马尔可夫模型参数；1) Obtain a plurality of chromosomes including gene elements, each gene element is represented by a real number, and a plurality of chromosomes constitute a set of Hidden Markov Model parameters;

2)采用适应度函数确定所述每一染色体的适应度值，所述适应度值用来表示染色体优劣程度；2) Using a fitness function to determine the fitness value of each chromosome, the fitness value is used to represent the degree of chromosome pros and cons;

3)采用遗传算法，根据所述适应度值，对所述染色体执行寻优过程，然后再重新确定寻优后的染色体适应度值；3) using a genetic algorithm to perform an optimization process on the chromosome according to the fitness value, and then re-determine the optimized chromosome fitness value;

4)迭代适用步骤3)，当满足设定终止条件后，输出最优隐马尔可夫模型参数；4) Iteratively apply step 3), when the set termination condition is met, the optimal hidden Markov model parameters are output;

5)采用输出的最优隐马尔科夫模型参数，在给定观察序列的基础上，确定生成所述观察序列的最大概率隐藏状态序列，用来表示CpG岛的位置。5) Using the output optimal hidden Markov model parameters, on the basis of a given observation sequence, determine the maximum probability hidden state sequence for generating the observation sequence, which is used to represent the position of the CpG island.

其中，我们采用Viterbi算法确定生成所述观察序列的最大概率隐藏状态序列，包括：Wherein, we use the Viterbi algorithm to determine the maximum probability hidden state sequence that generates the observation sequence, including:

根据所述观察序列中每个碱基状态对应的一个局部概率和一个局部最佳路径，通过隐藏状态的初始概率和对应的观察概率之积，选择当前时刻最大局部概率及其对应的局部最佳路径，按照当前时刻的局部最佳路径进行回溯，得到CpG岛的位置识别结果。According to a local probability and a local optimal path corresponding to each base state in the observation sequence, the maximum local probability and its corresponding local optimal path at the current moment are selected through the product of the initial probability of the hidden state and the corresponding observation probability The path is backtracked according to the local optimal path at the current moment, and the position recognition result of the CpG island is obtained.

采用Viterbi算法的具体公式包括：The specific formulas using the Viterbi algorithm include:

对于每一个状态i,i＝1,...,n,Viberbi算法定义为：For each state i, i=1,...,n, the Viberbi algorithm is defined as:

X_i＝(X_i1,X_i2,…,X_iT)X _i ＝(X _i1 ,X _i2 ,...,X _iT )

通过隐藏状态的初始概率和对应的观察概率之积计算出t＝1时刻的局部概率。对于其他时刻，每个状态Viterbi算法都保存了一个反向指针而且每个状态中都存储了一个局部概率δ，观察状态为k_t，观察概率b，状态转移概率为a，到达状态i的最近局部路径的概率为δ_t(i)：The local probability at time t=1 is calculated by the product of the initial probability of the hidden state and the corresponding observation probability. For other moments, each state Viterbi algorithm saves a back pointer Moreover, a local probability δ is stored in each state, the observation state is k _t , the observation probability b is, the state transition probability is a, and the probability of the nearest local path to state i is δ _t (i):

通过如上公式可以确定到达下一状态的最佳路径。为了确定t＝T时刻最可能的隐藏状态，令i_t：The best path to the next state can be determined by the above formula. To determine the most likely hidden state at time t=T, let it _t :

i_t＝argmax(δ_T(i))i _t = argmax(δ _T (i))

对于其他时刻i_t，For other time it _t ,

采用Viterbi算法可以通过递归减少计算复杂度，对观察序列的上下文得到了最好的解释。The context of the observed sequence was best explained using the Viterbi algorithm which reduces computational complexity through recursion.

进一步的，所述遗传算法包括选择操作、交叉操作和变异操作，通过依次采用选择操作、交叉操作和变异操作，对所述染色体执行寻优。Further, the genetic algorithm includes a selection operation, a crossover operation and a mutation operation, and the chromosome is optimized by sequentially adopting the selection operation, the crossover operation and the mutation operation.

其中，选择操作包括：根据每一染色体的适应度值，选择适应度值满足遗传需求的染色体进行遗传，删除未被选中的染色体。Wherein, the selection operation includes: according to the fitness value of each chromosome, select the chromosome whose fitness value meets the genetic requirement for inheritance, and delete the unselected chromosome.

如果种群大小是N，染色体是x_i适应度函数是f(x_i)，则x_i被选择的概率是： If the population size is N, the chromosome is _xi and the fitness function is f( _xi ), then the probability that _xi is selected is:

q_i用来计算染色体x_i(i＝1,2,3….n)的累加概率: q _i is used to calculate the cumulative probability of chromosome x _i (i=1,2,3….n):

我们基于上述累加概率，采用轮盘赌的方法选择染色体，得到满足遗传需求的样本。Based on the above cumulative probability, we use the roulette method to select chromosomes to obtain samples that meet the genetic requirements.

其中，交叉操作包括：在所述适应度值满足遗传需求的染色体中，选择适应度值较优的部分染色体作为父代，在相邻两个父代染色体之间进行交叉操作，产生子代染色体。Wherein, the crossover operation includes: among the chromosomes whose fitness value satisfies the genetic requirement, select some chromosomes with better fitness value as the parent generation, perform crossover operation between two adjacent parent generation chromosomes, and generate offspring chromosomes .

同样的，在选择父代染色体的过程中，也采用轮盘赌的方法。Similarly, in the process of selecting parent chromosomes, the roulette method is also used.

变异操作包括：在所述子代染色体中，首先确定基因变异位点，根据设定的突变率，改变所述基因变异位点的基因值。The mutation operation includes: in the chromosome of the progeny, first determine the gene variation site, and change the gene value of the gene variation site according to the set mutation rate.

具体的，设定p为选中的基因变异位点，r为[0，1]之间的随机数，ct为当前的代数，Specifically, set p as the selected gene mutation site, r as a random number between [0, 1], ct as the current algebra,

mt为总的代数，b＝2，C是变异后的基因值。基因变异位点的改变为：mt is the total algebra, b=2, and C is the gene value after mutation. The changes in gene mutation sites are:

变异操作通过及其小的概率去改变染色体的值，产生的HMM参数与变异前的HMM参数及其接近。The mutation operation changes the value of the chromosome with a very small probability, and the generated HMM parameters are extremely close to the HMM parameters before the mutation.

进一步的，所述适应度函数为：Further, the fitness function is:

进一步的，迭代适用步骤3)，采用的是Baum-Welch算法，包括：Further, apply step 3) iteratively, using the Baum-Welch algorithm, including:

我们定义重新估计的HMM模型为以上公式左端表示重估计HMM模型的三个参数。公式中γ_t(j)表示t时刻位于隐藏状态S_j的概率，ξ_t(i,j)表示t时刻位于隐藏状态S_i及t+1时刻位于隐藏状态S_j的概率，O表示观察序列。通过多次迭代后，可以得到关于HMM的最大似然估计。输出解即为最优的隐马尔可夫模型参数。We define the re-estimated HMM model as The left end of the above formula represents the three parameters of the re-estimated HMM model. In the formula, γ _t (j) represents the probability of being in hidden state S _j at time t, ξ _t (i, j) represents the probability of being in hidden state S _i at time t and hidden state S _j at time t+1, and O represents the observation sequence . After multiple iterations, the maximum likelihood estimation of the HMM can be obtained. The output solution is the optimal hidden Markov model parameter.

本发明还提出了一种计算机存储介质，存储有多条指令，所述指令适于由处理器加载并执行以下处理：The present invention also proposes a computer storage medium, which stores a plurality of instructions, and the instructions are suitable for being loaded by a processor and performing the following processing:

1)获取多个包括有基因元素的染色体，每一基因元素均采用实数表示，多个编码染色体构成一组隐马尔可夫模型参数；1) Obtain a plurality of chromosomes including gene elements, each gene element is represented by a real number, and a plurality of coding chromosomes constitute a set of Hidden Markov Model parameters;

本发明又提出了一种基于遗传算法和隐马尔可夫模型的CpG岛识别装置，包括处理器，用于实现各指令；以及计算机存储介质，用于存储多条指令，所述指令适于由处理器加载并执行以下处理：The present invention also proposes a CpG island identification device based on a genetic algorithm and a hidden Markov model, including a processor for implementing instructions; and a computer storage medium for storing multiple instructions, and the instructions are suitable for use by The processor loads and performs the following processing:

本发明的有益效果：Beneficial effects of the present invention:

本发明首先采用HMM进行参数优化，以提高识别CpG岛的能力。其次，通过使用遗传算法来估计HMM的参数。最后，此方法可以获得HMM的最大可能性估计，该模型对于识别CGIS的位置非常有用。在实验的基础上表明，遗传算法和隐马尔可夫模型结合的CpG岛识别方法提高了精确度和召回率。The present invention first uses HMM to optimize parameters to improve the ability to identify CpG islands. Second, the parameters of the HMM are estimated by using a genetic algorithm. Finally, this method can obtain the maximum likelihood estimate of the HMM, which is very useful for identifying the location of the CGIS. On the basis of experiments, it is shown that the CpG island identification method combined with genetic algorithm and hidden Markov model improves the precision and recall rate.

附图说明Description of drawings

图1遗传算法和隐马尔可夫模型结合的CpG岛识别方法的流程图；Figure 1 is a flow chart of the CpG island identification method combined with genetic algorithm and hidden Markov model;

图2遗传算法中的交叉操作；Crossover operation in Fig. 2 genetic algorithm;

图3遗传算法中的变异操作；The mutation operation in the genetic algorithm of Fig. 3;

图4遗传算法相关参数控制；Fig. 4 genetic algorithm related parameter control;

图5随着迭代次数增加对应的适应度值；Figure 5 corresponds to the fitness value as the number of iterations increases;

图6结合遗传算法的HMM结果；Fig. 6 HMM result combined with genetic algorithm;

图7遗传算法优化的HMM与HMM对CpG岛识别能力对比；Figure 7 Comparison of the ability to identify CpG islands between the HMM optimized by the genetic algorithm and the HMM;

具体实施方式：Detailed ways:

下面结合附图与实施例对本发明作进一步说明：Below in conjunction with accompanying drawing and embodiment the present invention will be further described:

应该指出，以下详细说明都是例示性的，旨在对本申请提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be pointed out that the following detailed description is exemplary and intended to provide further explanation to the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本申请的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terminology used here is only for describing specific implementations, and is not intended to limit the exemplary implementations according to the present application. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural, and it should also be understood that when the terms "comprising" and/or "comprising" are used in this specification, they mean There are features, steps, operations, means, components and/or combinations thereof.

正如背景技术所提到的，隐马尔可夫模型在解决过度拟合问题上容易陷入局部最优的情况，且计算复杂度较高。为了能提高HMM对CpG岛的识别能力，本发明提出了一种基于遗传算法和隐马尔可夫模型的CpG岛识别方法，包括以下步骤，如图1所示：As mentioned in the background art, the hidden Markov model is prone to fall into a local optimal situation in solving the overfitting problem, and the computational complexity is relatively high. In order to improve the ability of HMM to identify CpG islands, the present invention proposes a method for identifying CpG islands based on genetic algorithms and hidden Markov models, including the following steps, as shown in Figure 1:

1)模型参数初始化：获取多个包括有基因元素的染色体，每一基因元素均采用实数表示，多个染色体构成一组隐马尔可夫模型参数；1) Model parameter initialization: obtain multiple chromosomes including gene elements, each gene element is represented by a real number, and multiple chromosomes constitute a set of hidden Markov model parameters;

染色体通常表示一个字符串的元素，每个元素又被称为基因。由于HMM参数A，B，Pi是三个实数矩阵，且维数较高，采用二进制编码难以实现，为了直接真实的反应模型参数的变化，所以采用实数染色体字符串表示。Chromosomes usually represent a string of elements, each of which is called a gene. Since the HMM parameters A, B, and Pi are three real-number matrices with high dimensions, it is difficult to implement them using binary coding. In order to directly and truly reflect the changes of model parameters, they are represented by real-number chromosome strings.

5)采用输出的最优隐马尔科夫模型参数，应用于测试集中，在给定观察序列的基础上进行解码操作，确定生成所述观察序列的最大概率隐藏状态序列，用来表示CpG岛的位置。5) Apply the output optimal hidden Markov model parameters to the test set, perform decoding operation on the basis of a given observation sequence, and determine the maximum probability hidden state sequence for generating the observation sequence, which is used to represent the CpG island Location.

本发明中的遗传算法包括三种遗传算子，分别是选择操作、交叉操作和变异操作。The genetic algorithm in the present invention includes three kinds of genetic operators, which are selection operation, crossover operation and mutation operation.

选择操作就是从多个染色体中选择出优秀的个体来，选择机制模拟自然选择中的适者生存机制。适应度值高的染色体比适应度值低的染色体存活能力更强，在随后的进化过程中，未被选中的染色体被删除。选用轮盘赌的方式决定是否被遗传到下一代。根据染色体自身的适应度值，分别对应不同大小的区域面积。如果种群大小是N，染色体是x_i适应度函数是f(x_i)，则x_i被选择的概率是：The selection operation is to select excellent individuals from multiple chromosomes, and the selection mechanism simulates the survival mechanism of the fittest in natural selection. Chromosomes with high fitness values are more survivable than chromosomes with low fitness values, and chromosomes that are not selected are deleted during subsequent evolution. The method of roulette is used to determine whether it is passed on to the next generation. According to the fitness value of the chromosome itself, it corresponds to the area of different sizes. If the population size is N, the chromosome is _xi and the fitness function is f( _xi ), then the probability that _xi is selected is:

q_i用来计算染色体x_i(i＝1,2,3….n)的累加概率:q _i is used to calculate the cumulative probability of chromosome x _i (i=1,2,3….n):

交叉操作是重组父代染色体的结合。被选中的父代是适应度值最大的染色体，如图2所示。因此，可以看出该操作可以把最优的父代进行交叉从而得出更加优秀的后代。父代是基于轮盘赌机制选择出来的。The crossover operation is the combination of recombined parental chromosomes. The selected parent is the chromosome with the largest fitness value, as shown in Figure 2. Therefore, it can be seen that this operation can cross the optimal parent to obtain a better offspring. Parents are selected based on a roulette mechanism.

染色体突变增加了模型参数的变化。通过改变染色体中基因的值，使遗传算法具有全局搜索的能力。它使遗传算法在初始化阶段和模型参数进化阶段恢复丢失的信息，使遗传算法能够搜索到最优的模型参数。在模型参数改变之前，如果突变率大于或者等于随机生成的概率，突变率将会与随机生成的概率来测试。如果结果是正确的，模型参数将被修改，采用单点随机变异如图3所示。p为选中变异的基因位原来的值，r为[0，1]之间的随机数，ct为当前的代数，mt为总的代数，b＝2，C是变异后的值。变异位的改变为：Chromosomal mutations increase the variation of model parameters. By changing the value of the gene in the chromosome, the genetic algorithm has the ability of global search. It enables the genetic algorithm to restore the lost information in the initialization stage and the model parameter evolution stage, and enables the genetic algorithm to search for the optimal model parameters. Before the model parameters are changed, if the mutation rate is greater than or equal to the randomly generated probability, the mutation rate will be tested against the randomly generated probability. If the result is correct, the model parameters will be modified using single-point random mutation as shown in Figure 3. p is the original value of the gene bit selected for mutation, r is a random number between [0, 1], ct is the current algebra, mt is the total algebra, b=2, and C is the value after mutation. The mutated bits are changed to:

值得注意的是在突变操作前，首先确定基因突变位点，然后根据一定的概率改变变异点的原始基因。变异操作通过及其小的概率去改变染色体的值，产生的HMM参数应该与变异前的HMM参数及其接近。It is worth noting that before the mutation operation, the gene mutation site is first determined, and then the original gene of the mutation point is changed according to a certain probability. The mutation operation changes the value of the chromosome with a very small probability, and the generated HMM parameters should be extremely close to the HMM parameters before the mutation.

本实施例中，适应度值能反映出遗传算法的性能和并被用来评估染色体的适应能力。每一个个体都有一个适应度评分，这决定了是否被选择。In this embodiment, the fitness value can reflect the performance of the genetic algorithm and is used to evaluate the adaptability of the chromosome. Each individual has a fitness score, which determines whether it is selected.

一般情况下，如果目标函数是最大化问题，则适应度函数定义如下：In general, if the objective function is a maximization problem, the fitness function is defined as follows:

如果目标函数是最小化问题，则适应度函数定义如下：If the objective function is a minimization problem, the fitness function is defined as follows:

其中c_min(c_max)是相关系数。where c _min (c _max ) is the correlation coefficient.

为了降低适应度函数的复杂度在本方法中采用适应度函数计算公式：In order to reduce the complexity of the fitness function, the calculation formula of the fitness function is adopted in this method:

通过调整CpG岛中训练数据集的个数，就可以调整适应度函数的复杂度。因此，该方法By adjusting the number of training data sets in the CpG island, the complexity of the fitness function can be adjusted. Therefore, the method

有效地降低了适应度的函数的复杂性。Effectively reduces the complexity of the fitness function.

为了筛选出最优的隐马尔科夫模型参数，我们采用Baum-Welch算法进行迭代，具体实现方法为：In order to screen out the optimal hidden Markov model parameters, we use the Baum-Welch algorithm to iterate, and the specific implementation method is as follows:

在得出最优的模型参数后，为了验证该最优的模型参数，我们将其应用于测试集中，对于HMM及一个相应的观察序列，希望找出生成此序列的最可能的隐藏状态序列，也就是解码操作。DNA序列中每个碱基状态都对应一个局部概率和一个局部最佳路径，我们可以通过选择此时刻最大局部概率的状态及其相应的局部最佳路径来确定全局最佳路径。利用状态转移概率和相应的观察概率的积，选择最大的概率，得到的概率值也就是最有可能的隐藏状态序列。本方法中选用Viterbi算法进行解码。Viberbi算法可以定义为：After obtaining the optimal model parameters, in order to verify the optimal model parameters, we apply it to the test set. For HMM and a corresponding observation sequence, we hope to find the most likely hidden state sequence that generates this sequence. That is, the decoding operation. Each base state in the DNA sequence corresponds to a local probability and a local optimal path. We can determine the global optimal path by selecting the state with the maximum local probability at this moment and its corresponding local optimal path. Using the product of the state transition probability and the corresponding observation probability, the maximum probability is selected, and the obtained probability value is the most likely hidden state sequence. In this method, the Viterbi algorithm is selected for decoding. The Viberbi algorithm can be defined as:

X_i＝(X_i1,X_i2,…,X_iT)X _i ＝(X _i1 ,X _i2 ,...,X _iT )

通过隐藏状态的初始概率和相应的观察概率之积计算出t＝1时刻的局部概率。对于其他时刻：The local probability at time t = 1 is calculated by the product of the initial probability of the hidden state and the corresponding observation probability. For other moments:

通过如上公式可以确定到达下一状态的最可能的路径。为了确定t＝T时刻最可能的隐藏状态，令i_t：The most probable path to the next state can be determined by the above formula. To determine the most likely hidden state at time t=T, let it _t :

i_t＝argmax(δ_T(i))i _t = argmax(δ _T (i))

对于其他时刻i_t，For other time it _t ,

按照最可能的路径进行回溯，完成后路径中穿过的“+”状态将对应一个CpG岛。最终对CpG岛的识别结果如图6，图7所示。Follow the most probable path backtracking, and the "+" state traversed in the path will correspond to a CpG island after completion. The final recognition results of CpG islands are shown in Fig. 6 and Fig. 7 .

Viterbi算法通过递归减少了计算复杂度，通过对观察序列的整个上下文进行了最好的解释。路径中穿过的“+”状态将对应一个CpG岛。假定测试数据集T＝ATTAGCGAT，Viterbi算法找到的最优路径为状态序列{A-,T-,T+,A+,G+C+,G+,A-,T},则可以判断TAGCG为一个CpG岛。可以看出HMM在CpG岛识别领域，已经具有非常大的价值。The Viterbi algorithm reduces computational complexity through recursion, best explained by the entire context of the observed sequence. A "+" state traversed in the pathway will correspond to a CpG island. Assuming the test data set T=ATTAGCGAT, and the optimal path found by the Viterbi algorithm is the state sequence {A-, T-, T+, A+, G+C+, G+, A-, T}, it can be judged that TAGCG is a CpG island. It can be seen that HMM has great value in the field of CpG island identification.

本发明还提出了一种计算机存储介质，存储有多条指令，所述指令由处理器加载并执行以下处理：The present invention also proposes a computer storage medium, which stores a plurality of instructions, and the instructions are loaded by a processor and perform the following processing:

进一步的，本发明又提出了一种基于遗传算法和隐马尔可夫模型的CpG岛识别装置，包括处理器，用于实现各指令；以及计算机存储介质，用于存储多条指令，所述指令由处理器加载并执行以下处理：Further, the present invention also proposes a CpG island identification device based on genetic algorithm and hidden Markov model, including a processor, used to implement each instruction; and a computer storage medium, used to store multiple instructions, the instructions Loaded by the processor and performs the following processing:

以上所述仅为本申请的优选实施例而已，并不用于限制本申请，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, there may be various modifications and changes in the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included within the protection scope of this application.

Claims

1. A CpG island identification method based on genetic algorithm and hidden Markov model, is characterized in that, comprises the following steps:

1) Obtain a plurality of chromosomes including gene elements, each gene element is represented by a real number, and a plurality of chromosomes constitute a set of Hidden Markov Model parameters;

2) Using a fitness function to determine the fitness value of each chromosome, the fitness value is used to represent the degree of chromosome pros and cons;

3) using a genetic algorithm to perform an optimization process on the chromosome according to the fitness value, and then re-determine the optimized chromosome fitness value;

4) Iteratively apply step 3), when the set termination condition is met, the optimal hidden Markov model parameters are output;

5) Using the output optimal hidden Markov model parameters, on the basis of a given observation sequence, determine the maximum probability hidden state sequence for generating the observation sequence, which is used to represent the position of the CpG island.

2. The method according to claim 1, wherein a Viterbi algorithm is used to determine the maximum probability hidden state sequence generating the observation sequence.

3. The method according to claim 2, wherein the Viterbi algorithm determines to generate the maximum probability hidden state sequence of the observation sequence comprising:

According to a local probability and a local optimal path corresponding to each base state in the observation sequence, the maximum local probability and its corresponding local optimal path at the current moment are selected through the product of the initial probability of the hidden state and the corresponding observation probability The path is backtracked according to the local optimal path at the current moment, and the position recognition result of the CpG island is obtained.

4. The method according to claim 1, wherein the genetic algorithm comprises a selection operation, a crossover operation and a mutation operation, and the chromosome is optimized by sequentially adopting the selection operation, the crossover operation and the mutation operation.

5. The method according to claim 4, wherein the selecting operation comprises: according to the fitness value of each chromosome, selecting a chromosome whose fitness value meets the genetic requirement for inheritance, and deleting unselected chromosomes.

6. The method according to claim 5, wherein the crossover operation comprises: among the chromosomes whose fitness values meet the genetic requirements, select some chromosomes with better fitness values as parents, and A crossover operation is performed between two parent chromosomes to generate offspring chromosomes.

7. The method according to claim 6, wherein the mutation operation comprises: in the offspring chromosome, first determining the gene mutation site, and changing the gene mutation site according to the set mutation rate gene value.

8. The method according to claim 1, wherein the fitness function is:

By adjusting the number of CpG islands in the training data set, the complexity of the fitness function can be adjusted.

9. A computer storage medium storing a plurality of instructions, wherein the instructions are adapted to be loaded by a processor and perform the following processing:

1) Obtain a plurality of chromosomes including gene elements, each gene element is represented by a real number, and a plurality of coding chromosomes constitute a set of Hidden Markov Model parameters;

10. A CpG island identification device based on a genetic algorithm and a hidden Markov model, comprising a processor for implementing instructions; and a computer storage medium for storing multiple instructions, characterized in that: the instructions are suitable for Loaded by the processor and performs the following processing:

4) Step 3) is applied iteratively, and when the set termination condition is satisfied, the optimal hidden Markov model parameters are output;