CN106446604A

CN106446604A - Protein structure ab into prediction method based on firefly algorithm

Info

Publication number: CN106446604A
Application number: CN201610908691.4A
Authority: CN
Inventors: 张贵军; 郝小虎; 周晓根; 王柳静; 李章维
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2016-10-19
Filing date: 2016-10-19
Publication date: 2017-02-22

Abstract

The invention discloses a protein structure ab into prediction method based on a firefly algorithm. The method includes that under a basic firefly algorithm frame, a coarseness energy model is adopted to effectively lower conformational space dimension, group property of the firefly algorithm is utilized to guarantee diversity of protein conformation, segment assembling technology is adopted to initialize conformational group, a dihedral angle is used to express position of conformation in space according to a coarseness expression model of the protein conformation, energy ranking is adopted to determine a strongest luminous individual, position of the conformation is updated by calculating attraction degree among individuals, and approximately-natural-state conformation with lowest energy is acquired by searching in the conformational space. By applying the method in protein structure prediction, conformation high in predication accuracy and low in complexity can be acquired.

Description

An Ab Initio Method for Protein Structure Prediction Based on the Firefly Algorithm

技术领域technical field

本发明涉及生物信息学、计算机应用领域，尤其涉及的是一种基于萤火虫算法的蛋白质结构预测从头方法。The invention relates to the fields of bioinformatics and computer applications, and in particular to a method for protein structure prediction from scratch based on firefly algorithm.

背景技术Background technique

生物信息学是生命科学和计算机科学交叉领域的一个研究热点。生物信息学研究成果目前已经被广泛应用于基因发现和预测、基因数据的存储管理、数据检索与挖掘、基因表达数据分析、蛋白质结构预测、基因和蛋白质同源关系预测、序列分析与比对等。基因组规定了所有构成该生物体的蛋白质，基因规定了组成蛋白质的氨基酸序列。虽然蛋白质由氨基酸的线性序列组成，但是，它们只有折叠形成特定的空间结构才能具有相应的活性和相应的生物学功能。了解蛋白质的空间结构不仅有利于认识蛋白质的功能，也有利于认识蛋白质是如何执行功能的。确定蛋白质的结构的是非常重要的。目前，蛋白质序列数据库的数据积累的速度非常快，但是，已知结构的蛋白质相对比较少。尽管蛋白质结构测定技术有了较为显著的进展，但是，通过实验方法确定蛋白质结构的过程仍然非常复杂，代价较高。因此，实验测定的蛋白质结构比已知的蛋白质序列要少得多。另一方面，随着DNA测序技术的发展，人类基因组及更多的模式生物基因组已经或将要被完全测序，DNA序列数量将会急增，而由于DNA序列分析技术和基因识别方法的进步，我们可以从DNA推导出大量的蛋白质序列。这意味着已知序列的蛋白质数量和已测定结构的蛋白质数量(如蛋白质结构数据库PDB中的数据)的差距将会越来越大。人们希望产生蛋白质结构的速度能够跟上产生蛋白质序列的速度，或者减小两者的差距。Bioinformatics is a research hotspot in the intersection of life science and computer science. Bioinformatics research results have been widely used in gene discovery and prediction, gene data storage and management, data retrieval and mining, gene expression data analysis, protein structure prediction, gene and protein homology relationship prediction, sequence analysis and comparison, etc. . The genome specifies all the proteins that make up the organism, and the genes specify the sequence of amino acids that make up the proteins. Although proteins are composed of linear sequences of amino acids, they can only have corresponding activities and corresponding biological functions if they are folded to form a specific spatial structure. Knowing the spatial structure of proteins is not only conducive to understanding the functions of proteins, but also helps to understand how proteins perform their functions. Determining the structure of a protein is very important. At present, the data accumulation speed of the protein sequence database is very fast, but there are relatively few proteins with known structures. Although protein structure determination technology has made significant progress, the process of determining protein structure through experimental methods is still very complicated and expensive. Consequently, experimentally determined protein structures are far fewer than known protein sequences. On the other hand, with the development of DNA sequencing technology, the human genome and more model organism genomes have been or will be completely sequenced, and the number of DNA sequences will increase rapidly. Due to the advancement of DNA sequence analysis technology and gene identification methods, we A large number of protein sequences can be deduced from DNA. This means that the gap between the number of proteins with known sequences and the number of proteins with determined structures (such as data in the protein structure database PDB) will become larger and larger. It is hoped that the rate at which protein structures can be produced can keep up with the rate at which protein sequences can be produced, or close the gap between the two.

目前主要的技术瓶颈在于两个方面，第一方面在于采样方法，现有技术对构象空间采样能力不强，另一方面在于构象更新方法，现有技术对构象的更新精度仍然不足。因此，现有的构象空间搜索方法存在不足，需要改进。At present, the main technical bottleneck lies in two aspects. The first aspect lies in the sampling method. The existing technology is not strong enough to sample the conformation space. Therefore, existing conformational space search methods are deficient and need to be improved.

发明内容Contents of the invention

为了克服现有的蛋白质结构预测构象空间优化方法存在采样效率较低、复杂度较高、预测精度较低的不足，本发明提出一种基于萤火虫算法的蛋白质结构预测从头方法。在基本萤火虫算法框架下，采用粗粒度能量模型来有效降低构象空间维数，利用萤火虫算法的群体特性来保证蛋白质构象的多样性，采用片段组装技术对构象群体进行初始化，依据蛋白质构象的粗粒度表达模型，以一组二面角表示构象在空间中的位置，采用能量排名来确定最强发光个体，并通过计算个体间的吸引度来更新构象的位置，最终在构象空间中搜索得到最小能量的近天然态构象。In order to overcome the disadvantages of low sampling efficiency, high complexity, and low prediction accuracy in existing methods for protein structure prediction and conformation space optimization, the present invention proposes an ab initio method for protein structure prediction based on the firefly algorithm. Under the framework of the basic firefly algorithm, the coarse-grained energy model is used to effectively reduce the dimension of the conformational space, the population characteristics of the firefly algorithm are used to ensure the diversity of protein conformations, and the fragment assembly technology is used to initialize the conformational population. Express the model, use a set of dihedral angles to represent the position of the conformation in space, use energy ranking to determine the strongest luminescent individual, and update the position of the conformation by calculating the attraction between individuals, and finally search for the minimum energy in the conformation space near-native conformation.

本发明解决其技术问题所采用的技术方案是：The technical solution adopted by the present invention to solve its technical problems is:

一种基于萤火虫算法的蛋白质结构预测从头方法，所述方法包括以下步骤：A de novo method for protein structure prediction based on the firefly algorithm, the method comprising the following steps:

1)给定输入序列信息；1) given input sequence information;

2)参数初始化：设置群体规模popSize，迭代次数generation，光强吸引因子γ，位置更新步长因子α；2) Parameter initialization: set the population size popSize, the number of iterations generation, the light intensity attraction factor γ, and the position update step factor α;

3)群体构象初始化：根据给定输入序列，随机生成popSize个个体，对群体中的每个个体做length次片段组装，并计算其荧光亮度Io，其中length为序列长度，Io＝-E，E为通过RosettaSscore3能量函数计算得到的蛋白质构象能量值；3) Population conformation initialization: According to the given input sequence, randomly generate popSize individuals, perform length fragment assembly for each individual in the population, and calculate its fluorescence brightness Io, where length is the sequence length, Io=-E, E is the protein conformational energy value calculated by the RosettaSscore3 energy function;

4)对步骤3)中计算的荧光亮度从大到小排序，令荧光亮度最大的个体为p_g；4) sort the fluorescence brightness calculated in step 3) from large to small, so that the individual with the largest fluorescence brightness is p _g ;

5)开始迭代：5) Start iteration:

5.1)对群体中的每个个体，计算p_g对它的吸引度β；5.1) For each individual in the group, calculate the attractiveness β of p _g to it;

5.2)根据x_i(t+1)＝x_i(t)+β(x_j(t)–x_i(t))+α(rand–0.5)更新每个个体在空间中的位置，其中x_i(t+1)，x_i(t)表示个体p_i更新后的位置和当前的位置，x_j(t)表示个体p_g的当前位置，其中β₀为最大吸引度因子，r_ij表示个体p_i与p_g之间的距离，rand为0到1之间的随机数，群体中每个个体的位置x_i(t)表示为 ψ为输入序列的氨基酸残基的二面角，L为片段长度；5.2) Update the position of each individual in the space according to x _i (t+1)= _xi (t)+β(x _j (t) _–xi (t))+α(rand–0.5), where x _i (t+1), x _i (t) represents the updated position and current position of individual p _i , x _j (t) represents the current position of individual p _g , where β ₀ is the maximum attractiveness factor, r _ij represents the distance between individual p _i and p _g , rand is a random number between 0 and 1, and the position x _i (t) of each individual in the group is expressed as ψ is the dihedral angle of amino acid residues in the input sequence, and L is the fragment length;

5.3)对群体中的每个个体进行L次随机片段组装，完成群体随机摆动；5.3) Perform L times of random fragment assembly for each individual in the group to complete the random swing of the group;

5.4)重新计算每个个体的荧光亮度，更新p_g；5.4) Recalculate the fluorescence brightness of each individual and update p _g ;

6)判断是否达到最大迭代次数generation；6) Determine whether the maximum number of iterations generation is reached;

6.1)若当前迭代次数小于generation，返回步骤5.1)；6.1) If the current number of iterations is less than generation, return to step 5.1);

6.2)若当前迭代次数等于generation，结束；6.2) If the current number of iterations is equal to generation, end;

本发明的技术构思为：在基本萤火虫算法框架下，采用粗粒度能量模型来有效降低构象空间维数，利用萤火虫算法的群体特性来保证蛋白质构象的多样性，采用片段组装技术对构象群体进行初始化，依据蛋白质构象的粗粒度表达模型，以一组二面角表示构象在空间中的位置，采用能量排名来确定最强发光个体，并通过计算个体间的吸引度来更新构象的位置，最终在构象空间中搜索得到最小能量的近天然态构象。The technical idea of the present invention is: under the framework of the basic firefly algorithm, the coarse-grained energy model is used to effectively reduce the dimension of the conformational space, the population characteristics of the firefly algorithm are used to ensure the diversity of protein conformations, and the fragment assembly technology is used to initialize the conformational population , according to the coarse-grained expression model of protein conformation, the position of the conformation in space is represented by a set of dihedral angles, the energy ranking is used to determine the strongest luminescent individual, and the position of the conformation is updated by calculating the attractiveness between individuals, and finally in The near-native conformation with the minimum energy is searched in the conformational space.

本发明的有益效果为：本发明在蛋白质结构预测中应用，可以得到预测精度较高、复杂度较低的构象。The beneficial effects of the present invention are: the present invention is applied in protein structure prediction, and can obtain a conformation with high prediction accuracy and low complexity.

附图说明Description of drawings

图1是蛋白质2L0G预测结构和实验室测定结构最接近的构象三维示意图。Figure 1 is a three-dimensional schematic diagram of the closest conformation between the protein 2LOG predicted structure and the laboratory determined structure.

具体实施方式detailed description

下面结合附图对本发明作进一步描述。The present invention will be further described below in conjunction with the accompanying drawings.

参照图1，一种基于萤火虫算法的蛋白质结构预测从头方法，所述方法包括以下步骤：With reference to Fig. 1, a kind of protein structure prediction based on firefly algorithm de novo method, described method comprises the following steps:

1)给定输入序列信息；1) given input sequence information;

5)开始迭代：5) Start iteration:

本实施例以PDB名称为2L0G的测试蛋白为实施例，一种基于萤火虫算法的蛋白质结构预测从头方法，所述方法包括以下步骤：This embodiment takes the test protein whose PDB name is 2LOG as an example, a method for protein structure prediction based on the firefly algorithm from scratch, and the method includes the following steps:

1)给定输入序列信息；1) given input sequence information;

2)参数初始化：设置群体规模popSize＝50，迭代次数generation＝10000，光强吸引因子γ＝0.5，位置更新步长因子α＝0.7；2) Parameter initialization: set the group size popSize=50, the number of iterations generation=10000, the light intensity attraction factor γ=0.5, and the location update step factor α=0.7;

5)开始迭代：5) Start iteration:

5.2)根据x_i(t+1)＝x_i(t)+β(x_j(t)–x_i(t))+α(rand–0.5)更新每个个体在空间中的位置，其中x_i(t+1)，x_i(t)表示个体p_i更新后的位置和当前的位置，x_j(t)表示个体p_g的当前位置，其中β₀为最大吸引度因子，r_ij表示个体p_i与p_g之间的距离，rand为0到1之间的随机数，群体中每个个体的位置x_i(t)表示为 ψ为输入序列的氨基酸残基的二面角，L＝3为片段长度；5.2) Update the position of each individual in the space according to x _i (t+1)= _xi (t)+β(x _j (t) _–xi (t))+α(rand–0.5), where x _i (t+1), x _i (t) represents the updated position and current position of individual p _i , x _j (t) represents the current position of individual p _g , where β ₀ is the maximum attractiveness factor, r _ij represents the distance between individual p _i and p _g , rand is a random number between 0 and 1, and the position x _i (t) of each individual in the group is expressed as ψ is the dihedral angle of amino acid residues in the input sequence, and L=3 is the fragment length;

以上阐述的是本发明给出的一个实施例表现出来的优良效果，显然本发明不仅适合上述实施例，在不偏离本发明基本精神及不超出本发明实质内容所涉及内容的前提下可对其做种种变化加以实施。What set forth above is the excellent effect shown by an embodiment of the present invention. Obviously, the present invention is not only suitable for the above-mentioned embodiment, but it can be used under the premise of not departing from the basic spirit of the present invention and not exceeding the content involved in the essence of the present invention. Make changes and implement them.

Claims

1. a kind of protein structure prediction based on glowworm swarm algorithm from the beginning method it is characterised in that：Methods described includes following Step：

1) give list entries information；

2) parameter initialization：Setting population size popSize, iterationses generation, light intensity attracting factor γ, position is more New step factor α；

3) colony's conformation initialization：According to given list entries, random generate popSize individuality, in colony every each and every one Body does length fragment assembling, and calculates its fluorescent brightness Io, and wherein length is sequence length, and Io=-E, E are to pass through RosettaSscore3 energy function calculated protein conformation energy value；

4) to step 3) in the fluorescent brightness that calculates sort from big to small, make the maximum individuality of fluorescent brightness be p_g；

5) start iteration：

5.1) individual to each in colony, calculate p_gAttraction Degree β to it；

5.2) according to x_i(t+1)=x_i(t)+β(x_j(t)–x_i(t))+α (rand 0.5) updates each individuality position in space Put, wherein x_i(t+1), x_iT () represents individual p_iPosition after renewal and current position, x_jT () represents individual p_gPresent bit Put, whereinβ₀For the maximum Attraction Degree factor, r_ijRepresent individual p_iWith p_gThe distance between, rand be 0 to 1 it Between random number, each individual position x in colony_iT () is expressed as For the dihedral angle of the amino acid residue of list entries, L is fragment length；

5.3) each individuality in colony is carried out with L random fragment assembling, completes colony and swing at random；

5.4) recalculate each individual fluorescent brightness, update p_g；

6) judge whether to reach maximum iteration time generation；

6.1) if current iteration number of times is less than generation, return to step 5.1)；

6.2) if current iteration number of times is equal to generation, terminate.