CN107609342A

CN107609342A - A kind of protein conformation searching method based on the constraint of secondary structure space length

Info

Publication number: CN107609342A
Application number: CN201710683896.1A
Authority: CN
Inventors: 张贵军; 王小奇; 马来发; 周晓根; 谢腾宇; 王柳静; 孙科
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2017-08-11
Filing date: 2017-08-11
Publication date: 2018-01-19
Anticipated expiration: 2037-08-11
Also published as: CN107609342B

Abstract

A protein conformation search method based on the spatial distance constraints of the secondary structure. Under the basic framework of the genetic algorithm, the spatial length of each secondary structure in the target protein and the spatial distance between the central residues of two adjacent secondary structures are used. The information constitutes the eigenvector as a space constraint condition, so that under the condition of a given energy function, the solution space is searched in a smaller conformation space, and the spatial distance information is added to the selection operator to make up for the inaccuracy of the energy function Therefore, the accuracy of structural modeling is effectively improved. The invention proposes a protein conformation search method based on secondary structure space distance constraints with high sampling efficiency, high prediction accuracy and low calculation cost.

Description

A Protein Conformation Search Method Based on Space Distance Constraints of Secondary Structure

技术领域technical field

本发明涉及一种生物学信息学、人工智能优化、计算机应用领域，尤其涉及的是一种基于二级结构空间距离约束的蛋白质构象搜索方法。The invention relates to the fields of biological informatics, artificial intelligence optimization and computer application, and in particular to a protein conformation search method based on the spatial distance constraints of secondary structures.

背景技术Background technique

蛋白质是由氨基酸脱水缩合形成的生物大分子，对人类的健康起着决定性作用，准确掌握蛋白质的结构和功能对疾病研究、生物制药等方面都有重要意义。目前蛋白质结构预测的方法主要有两种：实验方法和理论预测。实验方法包括X射线晶体学、核磁共振光谱、和电子显微镜等；虽然这些方法能够准确地测定某些蛋白质的三维结构，但是通过实验的方法来测定结构是耗时且昂贵的，同时有些蛋白质的结构通过实验方法根本无法获得。所以，利用计算的方法来预测蛋白质结构已成为生物信息学研究中的热点。理论预测方法主要利用计算机技术和智能优化算法从氨基酸一级序列来预测蛋白质三维结构，从而有效的节约了预测成本，减少了预测时间，因此这类方法相比于实验方法更能得到广泛应用。但由于蛋白质结构本身的复杂性，到目前为止蛋白质三维结构的预测问题仍是一个有待解决的难题。Protein is a biomacromolecule formed by the dehydration condensation of amino acids, which plays a decisive role in human health. Accurately grasping the structure and function of protein is of great significance to disease research and biopharmaceuticals. Currently, there are two main methods for protein structure prediction: experimental methods and theoretical predictions. Experimental methods include X-ray crystallography, nuclear magnetic resonance spectroscopy, and electron microscopy; although these methods can accurately determine the three-dimensional structure of some proteins, it is time-consuming and expensive to determine the structure through experimental methods, and some proteins The structure is simply not accessible by experimental methods. Therefore, using computational methods to predict protein structures has become a hot spot in bioinformatics research. Theoretical prediction methods mainly use computer technology and intelligent optimization algorithms to predict the three-dimensional structure of proteins from the primary sequence of amino acids, thereby effectively saving the cost of prediction and reducing the time of prediction. Therefore, such methods are more widely used than experimental methods. However, due to the complexity of the protein structure itself, the prediction of the three-dimensional protein structure is still a difficult problem to be solved so far.

在从头预测蛋白质结构的方法中，进化算法是研究蛋白质分子构象优化的重要方法，例如遗传算法、差分进化等算法，这些算法拥有收敛速度快、结构简单以及鲁棒性强等优点。然而，当蛋白质序列比较长时，因构象空间太大，如果按照特定的能量函数来搜索，由于能量函数的不精确性，并不能保证所找到的能量最小的构象最接近天然态结构，因此往往不能形成正确的折叠。In the method of predicting protein structure from scratch, evolutionary algorithm is an important method for studying protein molecular conformation optimization, such as genetic algorithm, differential evolution and other algorithms. These algorithms have the advantages of fast convergence speed, simple structure and strong robustness. However, when the protein sequence is relatively long, because the conformation space is too large, if you search according to a specific energy function, due to the inaccuracy of the energy function, it cannot guarantee that the conformation with the lowest energy found is the closest to the natural state structure, so often The correct fold cannot be formed.

因此，现有的构象空间搜索方法在预测精度和采样效率方面存在着缺陷，需要改进。Therefore, the existing conformational space search methods are deficient in prediction accuracy and sampling efficiency and need to be improved.

发明内容Contents of the invention

为了克服现有的蛋白质结构预测构象空间搜索方法存在采样效率较低、预测精度较低的不足，本发明提出一种采样效率较高、预测精度较高的基于二级结构空间距离约束的蛋白质构象搜索方法。In order to overcome the shortcomings of low sampling efficiency and low prediction accuracy in existing protein structure prediction conformation space search methods, the present invention proposes a protein conformation based on secondary structure space distance constraints with high sampling efficiency and high prediction accuracy Search method.

本发明解决其技术问题所采用的技术方案是：The technical solution adopted by the present invention to solve its technical problems is:

一种基于二级结构空间距离约束的蛋白质构象搜索方法，所述方法包括以下步骤：A protein conformation search method based on secondary structure space distance constraints, said method comprising the following steps:

1)给定输入序列信息；1) given input sequence information;

2)参数初始化：设置种群规模NP，最大遗传代数G_max，确定交叉概率P_c，初始种群迭代次数iteration，交叉片段长度frag_length，组装计数器reject_number，最大组装次数reject_max，先验知识中二级结构的空间长度以及相邻两个二级结构中心残基间的空间距离构成的特征向量D＝{d₁,…,d_m,d_1,2,…,d_k,k+1}，其中d_m是目标蛋白的第m个二级结构块的长度，d_k,k+1是第k个二级结构块和第k+1个二级结构中心残基的空间距离，最大距离约束范围δ，选择概率P_s；2) Parameter initialization: set the population size NP, the maximum genetic algebra G _max , determine the crossover probability P _c , the initial population iteration number iteration, the crossover fragment length frag_length, the assembly counter reject_number, the maximum assembly number reject_max, and the secondary structure in prior knowledge The eigenvector D={d ₁ ,…,d _m ,d _1,2 ,…,d _k,k+1 } composed of the spatial length and the spatial distance between the central residues of two adjacent secondary structures, where d _m is the length of the mth secondary structure block of the target protein, d _k,k+1 is the spatial distance between the kth secondary structure block and the k+1th secondary structure central residue, the maximum distance constraint range δ, selection probability P _s ;

3)初始化种群：启动NP条Monte Carlo轨迹，每条轨迹搜索iteration次，即生成NP个初始个体；3) Initialize the population: start NP Monte Carlo trajectories, search iteration times for each trajectory, and generate NP initial individuals;

4)对每个目标个体x_i和随机选取的个体x_j进行如下操作，i,j∈(1,...,NP)且j≠i：4) Perform the following operations on each target individual x _i and randomly selected individual x _j , i, j∈(1,...,NP) and j≠i:

4.1)按概率P_c对个体x_i和x_j进行交叉操作，过程如下：4.1) Carry out crossover operation on individuals x _i and x _j according to probability P _c , the process is as follows:

4.1.1)在允许范围[1,total_residue-frag_length]内随机选择交叉起始点begin_position，同时计算出交叉终止点end_position＝begin_position+frag_length，其中total_residue为残基总数；4.1.1) Randomly select the intersection start point begin_position within the allowable range [1, total_residue-frag_length], and calculate the intersection end point end_position=begin_position+frag_length at the same time, where total_residue is the total number of residues;

4.1.2)在每个交叉位点position∈[begin_position,end_position]处进行扭转角度交换，生成新个体x′_i,x′_j，即交叉个体x′_i,x′_j；4.1.2) Perform twist angle exchange at each intersection position ∈ [begin_position, end_position] to generate new individuals x′ _i , x′ _j , that is, cross individuals x′ _i , x′ _j ;

4.2)对交叉个体x′_i,x′_j进行如下变异操作，过程如下：4.2) Carry out the following mutation operation on cross individuals x′ _i , x′ _j , the process is as follows:

4.2.1)利用片段组装技术对交叉个体x′_i进行空间构象搜索，计算出交叉个体x′_i片段组装后的二级结构的长度以及相邻两个二级结构中心残基间的空间距离，并构成距离向量其中是交叉个体x′_i中第m个二级结构块的长度，是第k个二级结构块中心残基和第k+1个二级结构块中心残基的空间距离；4.2.1) Use the fragment assembly technology to search for the spatial conformation of the crossover individual _x'i , and calculate the length of the secondary structure of the crossover individual _x'i after fragment assembly and the spatial distance between the central residues of two adjacent secondary structures , and form a distance vector in is the length of the mth secondary structure block in the cross individual x′ _i , is the spatial distance between the central residue of the kth secondary structure block and the central residue of the k+1th secondary structure block;

4.2.2)根据公式计算出个体x′_i的特征向量与先验知识中的特征向量D＝{d₁,…,d_m,d_1,2,…,d_k,k+1}的Manhattan距离，若similarity_mutation_1≤δ则变异生成的个体x″_i满足二级结构空间距离约束，转至步骤4.2.4)，否则转至4.2.3)；4.2.2) According to the formula Calculate the eigenvector of individual x′ _i The Manhattan distance from the eigenvector D={d ₁ ,…,d _m ,d _1,2 ,…,d _k,k+1 } in prior knowledge, if similarity_mutation_1≤δ, the individual x″ _i generated by mutation satisfies Secondary structure space distance constraint, go to step 4.2.4), otherwise go to 4.2.3);

4.2.3)计数器reject_number开始计数，如果reject_number≤reject_max则依次执行步骤4.2.1)和4.2.2)生成新个体x″_i，直到满足similarity_mutation_1≤δ停止；否则执行步骤4.2.1)生成新个体x″_i；4.2.3) The counter reject_number starts counting, if reject_number≤reject_max, execute steps 4.2.1) and 4.2.2) to generate a new individual x″ _i in sequence, until the similarity_mutation_1≤δ is satisfied; otherwise, execute step 4.2.1) to generate a new individual x″ _i ;

4.2.4)与步骤4.2.1)和4.2.2)同理对个体x′_j进行片段组装并计算相应的Manhattan距离值similarity_mutation_2，最后得到新个体x″_j；4.2.4) In the same way as in steps 4.2.1) and 4.2.2), perform fragment assembly on the individual _{x'j and calculate the corresponding Manhattan distance value similarity_mutation_2, and finally obtain the new individual x"j} _;

4.2.5)根据公式计算出目标个体x_i的距离向量与先验知识中的特征向量D＝{d₁,…,d_m,d_1,2,…,d_k,k+1}的Manhattan距离；4.2.5) According to the formula Calculate the distance vector of the target individual x _i The Manhattan distance from the eigenvector D={d ₁ ,…,d _m ,d _1,2 ,…,d _k,k+1 } in the prior knowledge;

5)根据目标个体x_i和变异个体x″_i、x″_j的能量和距离相似度进行选择，选出优势个体并更新种群，过程如下：5) Select according to the energy and distance similarity of the target individual x _i and the mutant individual x″ _i , x″ _j , select the dominant individual and update the population, the process is as follows:

5.1)根据Rosetta Score3函数E(x_i)分别计算目标个体x_i和变异个体x″_i、x″_j的能量E(x_i)、E(x″_i)和E(x″_j)；5.1) According to the Rosetta Score3 function E( _xi ), calculate the energy E( _xi ), E(x″ _i ) and E(x″ _j ) of the target individual _xi and the mutant individual x″ _i , x″ _j respectively;

5.2)在目标个体x_i和变异个体x″_i、x″_j中，若某一个体X,X∈{x_i,x″_i,x″_j}的能量值小于其他两个个体的能量值，同时对应的Manhattan距离值也比其他两个个体对应的Manhattan距离值小，则该个体为优势个体；若某一个体X′,X′∈{x_i,x″_i,x″_j}只有能量值比其他两个个体的能量值小，则按选择概率P_s将该个体设为优势个体；同理，若某一个体X″,X″∈{x_i,x″_i,x″_j}只有对应的Manhattan距离值比其他两个个体对应的Manhattan距离值小，则按选择概率P_s将该个体设为优势个体；最后，优势个体替代目标个体，更新种群；5.2) Among the target individual x _i and the mutant individual x″ _i , x″ _j , if the energy value of an individual X, X∈{ _xi ,x″ _i ,x″ _j } is smaller than the energy value of the other two individuals , and the corresponding Manhattan distance value is also smaller than that of the other two individuals, the individual is the dominant individual; if an individual X′,X′∈{ _xi ,x″ _i ,x″ _j } has only If the energy value is smaller than that of the other two individuals, the individual is set as the dominant individual according to the selection probability P _s ; similarly, if an individual X″,X″∈{ _xi ,x″ _i ,x″ _j } Only the corresponding Manhattan distance value is smaller than the corresponding Manhattan distance value of the other two individuals, then the individual is set as the dominant individual according to the selection probability P _s ; finally, the dominant individual replaces the target individual and the population is updated;

6)判断是否达到最大遗传代数G_max，若满足终止条件，则输出结果，否则转至步骤4)。6) Judging whether the maximum genetic algebra G _max is reached, if the termination condition is met, then output the result, otherwise go to step 4).

本发明的技术构思为：在遗传算法的基本框架下，利用目标蛋白中每个二级结构的空间长度以及相邻两个二级结构中心残基间的空间距离信息构成特征向量作为空间限制条件，使得在给定能量函数的条件下，在一个较小的构象空间中搜索解空间，同时在选择算子中加入了空间距离信息，弥补了能量函数的不精确性，进而有效提高了结构建模的精确度。The technical idea of the present invention is: under the basic framework of the genetic algorithm, use the spatial length of each secondary structure in the target protein and the spatial distance information between the central residues of two adjacent secondary structures to form a feature vector as a spatial constraint , so that under the condition of a given energy function, the solution space is searched in a small conformation space, and at the same time, the spatial distance information is added to the selection operator to make up for the inaccuracy of the energy function, thereby effectively improving the structure construction. The accuracy of the model.

本发明的有益效果表现在：一方面通过二级结构的空间长度以及相邻两个二级结构中心残基间的空间距离构成特征向量作为空间限制条件，降低了构象搜索空间，同时降低了能量函数不精确带来的误差，进而大大提高了预测精度；另一方面，在遗传算法的框架下，通过个体间的信息交互、父代个体的变异选择操作，加快了收敛速度、增加了种群的多样性。The beneficial effect of the present invention is manifested in: on the one hand, the space length of the secondary structure and the space distance between the central residues of two adjacent secondary structures form the eigenvector as the space restriction condition, which reduces the conformational search space and reduces the energy The error caused by the inaccuracy of the function greatly improves the prediction accuracy; on the other hand, under the framework of the genetic algorithm, through the information interaction between individuals and the mutation selection operation of the parent individual, the convergence speed is accelerated and the population density is increased. diversity.

附图说明Description of drawings

图1是基于二级结构空间距离约束的蛋白质构象搜索方法的基本流程图。Figure 1 is a basic flowchart of the protein conformation search method based on the spatial distance constraints of the secondary structure.

图2是基于二级结构空间距离约束的蛋白质构象搜索方法对蛋白质1AIL进行结构预测时的构象更新示意图。Fig. 2 is a schematic diagram of the conformation update of the protein 1AIL when the protein conformation search method based on the space distance constraint of the secondary structure is used for structure prediction.

图3是基于二级结构空间距离约束的蛋白质构象搜索方法对蛋白质1AIL进行结构预测得到的三维结构图。Figure 3 is a three-dimensional structure diagram obtained by predicting the structure of protein 1AIL based on the protein conformation search method based on the spatial distance constraints of the secondary structure.

具体实施方式detailed description

下面结合附图对本发明作进一步描述。The present invention will be further described below in conjunction with the accompanying drawings.

参照图1～图3，一种基于二级结构空间距离约束的蛋白质构象搜索方法，所述方法包括以下步骤：Referring to Figures 1 to 3, a protein conformation search method based on secondary structure space distance constraints, the method includes the following steps:

1)给定输入序列信息；1) given input sequence information;

本实施例序列长度为73的α折叠蛋白质1AIL为实施例，一种基于二级结构空间距离约束的蛋白质构象搜索方法，其中包含以下步骤：The α-fold protein 1AIL with a sequence length of 73 in this example is an example, a protein conformation search method based on the space distance constraint of the secondary structure, which includes the following steps:

1)给定输入序列信息；1) given input sequence information;

2)参数初始化：设置种群规模NP＝200，最大遗传代数G_max＝2000，确定交叉概率P_c＝0.1，初始种群迭代次数iteration＝2000，交叉片段长度frag_length＝9，组装计数器reject_number＝0，最大组装次数reject_max＝100，先验知识中二级结构的空间长度以及相邻两个二级结构中心残基间的空间距离构成的特征向量D＝{3.81085,33.8066,8.38603,30.3193,6.69076,22.1852,19.6409,17.2739,15.4455,14.6372,15.5907,12.43}，最大距离约束范围δ＝15，选择概率P_s＝0.3；2) Parameter initialization: set the population size NP = 200, the maximum genetic algebra G _max = 2000, determine the crossover probability P _c = 0.1, the initial population iteration number iteration = 2000, the crossover fragment length frag_length = 9, the assembly counter reject_number = 0, the maximum The number of assemblies reject_max=100, the eigenvector D={3.81085,33.8066,8.38603,30.3193,6.69076,22.1852, 19.6409, 17.2739, 15.4455, 14.6372, 15.5907, 12.43}, maximum distance constraint range δ=15, selection probability P _s =0.3;

4.2)对交叉个体x′_i，x′_j进行如下变异操作，过程如下：4.2) Carry out the following mutation operation on crossover individuals x′ _i and x′ _j , the process is as follows:

以序列长度为73的α折叠蛋白质1AIL为实施例，运用以上方法得到了该蛋白质的近天然态构象，最小均方根偏差为平均均方根偏差为预测结构如图3所示。Taking the α-fold protein 1AIL with a sequence length of 73 as an example, the near-native conformation of the protein was obtained by using the above method, and the minimum root mean square deviation is The average root mean square deviation is The prediction structure is shown in Figure 3.

以上说明是本发明以1AIL蛋白质为实例所得出的优化效果，并非限定本发明的实施范围，在不偏离本发明基本内容所涉及范围的前提下对其做各种变形和改进，不应排除在本发明的保护范围之外。The above description is the optimization effect obtained by taking 1AIL protein as an example in the present invention, and does not limit the implementation scope of the present invention. Various deformations and improvements are made to it without departing from the scope involved in the basic content of the present invention, and should not be excluded. outside the protection scope of the present invention.

Claims

A kind of 1. protein conformation searching method based on the constraint of secondary structure space length, it is characterised in that：The conformation is empty Between searching method comprise the following steps：

1) list entries information is given；

2) parameter initialization：Population scale NP, maximum genetic algebra G are set_max, determine crossover probability P_c, initial population iteration time Number iteration, intersects fragment length frag_length, assembles counter reject_number, maximum assembling number Reject_max, space in priori between the space length of secondary structure and two neighboring secondary structure center residue away from From the characteristic vector D={ d of composition₁,…,d_m,d_1,2,…,d_k,k+1, wherein d_mIt is m-th of secondary structure block of target protein Length, d_k,k+1It is the space length of+1 secondary structure center residue of k-th of secondary structure block and kth, ultimate range constrains model Enclose δ, select probability P_s；

3) population is initialized：Start NP bar Monte Carlo tracks, every track search iteration time, that is, generate NP it is individual at the beginning of Begin individual；

4) to each target individual x_iWith the individual x randomly selected_jProceed as follows, i, j ∈ (1 ..., NP) and j ≠ i：

4.1) probability P is pressed_cTo individual x_iAnd x_jCrossover operation is carried out, process is as follows：

4.1.1) random selection intersects starting point begin_ in allowed band [1, total_residue-frag_length] Position, while cross termination point end_position=begin_position+frag_length is calculated, wherein Total_residue is total number of residues；

4.1.2) windup-degree is carried out at each intersection site position ∈ [begin_position, end_position] place Exchange, generation new individual x '_i,x′_j, that is, intersect individual x '_i,x′_j；

4.2) to intersecting individual x '_i,x′_jFollowing mutation operation is carried out, process is as follows：

4.2.1) using fragment package technique to intersecting individual x '_iSpace conformation search is carried out, calculates and intersects individual x '_iFragment Space length between the length of secondary structure after assembling and two neighboring secondary structure center residue, and form distance vectorWhereinIt is to intersect individual x '_iIn m-th of secondary structure block length,It is The space length of+1 secondary structure block center residue of k-th of secondary structure block center residue and kth；

4.2.2) according to formulaCalculate individual x '_i Characteristic vectorWith the characteristic vector D={ d in priori₁,…,d_m,d_1,2,…, d_k,k+1Manhattan distances, the individual x " for the generation that made a variation if similarity_mutation_1≤δ_iMeet two level knot Conformational space distance restraint, goes to step 4.2.4), otherwise go to 4.2.3)；

4.2.3) counter reject_number is started counting up, and is performed successively if reject_number≤reject_max Step 4.2.1) and 4.2.2) generate new individual x "_i, until meeting that similarity_mutation_1≤δ stops；Otherwise perform Step 4.2.1) generation new individual x "_i；

4.2.4) with step 4.2.1) and 4.2.2) similarly to individual x '_jCarry out fragment assembling and calculate corresponding Manhattan away from From value similarity_mutation_2, new individual x " is finally obtained_j；

4.2.5) according to formulaCalculate target individual x_i Distance vectorWith the characteristic vector D={ d in priori₁,…,d_m,d_1,2,…, d_k,k+1Manhattan distances；

5) according to target individual x_iWith the individual x " that makes a variation_i、x″_jEnergy and Distance conformability degree selected, the advantage individual of selecting is simultaneously Population Regeneration, process are as follows：

5.1) according to Rosetta Score3 function E (x_i) target individual x is calculated respectively_iWith the individual x " that makes a variation_i、x″_jENERGY E (x_i)、E(x″_i) and E (x "_j)；

5.2) in target individual x_iWith the individual x " that makes a variation_i、x″_jIn, if a certain individual X, X ∈ { x_i,x″_i,x″_jEnergy value be less than Other two individual energy values, while corresponding Manhattan distance values are also than Manhattan corresponding to other two individuals Distance value is small, then the individual is advantage individual；If a certain individual X ', X ' ∈ { x_i,x″_i,x″_jThere was only energy value than other two The energy value of individual is small, then by select probability P_sThe individual is set to advantage individual；Similarly, if a certain individual X ", X " ∈ { x_i, x″_i,x″_jOnly have corresponding Manhattan distance values smaller than Manhattan distance values corresponding to other two individuals, then by choosing Select probability P_sThe individual is set to advantage individual；Finally, advantage individual substitutes target individual, Population Regeneration；

6) judge whether to reach maximum genetic algebra G_maxIf meeting end condition, output result, step 4) is otherwise gone to.