CN105808972A

CN105808972A - Method for predicting protein structure from local to global on basis of knowledge spectrum

Info

Publication number: CN105808972A
Application number: CN201610139514.4A
Authority: CN
Inventors: 张贵军; 俞旭锋; 周晓根; 郝小虎; 王柳静; 李章维
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2016-03-11
Filing date: 2016-03-11
Publication date: 2016-07-27

Abstract

A protein structure prediction method based on spectral knowledge from local to global, including the following steps: first, for the query sequence, a high-quality fragment library is obtained through a multi-feature seamless threading method, and residue-residue is obtained through statistical consistency analysis based on the fragment library. The distance spectrum knowledge between the bases; then, the query sequence is divided into several segments according to the residue information recorded in the distance spectrum; after that, for each segment of the structure, the low energy and residue-residue between residues are obtained by fragment assembly The spatial distance approximates the predicted distance in the distance spectrum; finally, fragment assembly is performed on the unsegmented structure, and the global energy is calculated to obtain a metastable conformation with low energy and a more reasonable structure. The invention has good conformational space sampling capability and high prediction accuracy.

Description

A Local-to-Global Protein Structure Prediction Method Based on Spectral Knowledge

技术领域technical field

本发明涉及生物信息学、计算机应用领域，尤其涉及的是一种基于谱知识从局部到全局的蛋白质结构预测方法。The invention relates to the field of bioinformatics and computer application, in particular to a protein structure prediction method based on spectrum knowledge from local to global.

背景技术Background technique

蛋白质分子在生物细胞化学反应过程中起着至关重要的作用。它们的结构模型和生物活性状态对我们理解和治愈多种疾病有重要的意义。蛋白质只有折叠成特定的三维结构才能产生其特有的生物学功能。因此，要了解蛋白质的功能，就必须获得其三维空间结构。Protein molecules play a vital role in the process of biological and cellular chemical reactions. Their structural models and bioactive states have important implications for our understanding and cure of many diseases. Only when proteins are folded into a specific three-dimensional structure can they produce their unique biological functions. Therefore, to understand the function of a protein, it is necessary to obtain its three-dimensional structure.

蛋白质三级结构预测是生物信息学的一个重要任务。蛋白质构象优化问题现在面临最大的挑战是对极其复杂的蛋白质能量函数曲面进行搜索。蛋白质能量模型考虑了分子体系成键作用以及范德华力、静电、氢键、疏水等非成键作用，致使其形成的能量曲面极其粗糙，构象对应局部极小解数目随序列长度的增加呈指数增长。而蛋白质构象预测算法能够找到蛋白质稳定结构的机理是，大量的蛋白质亚稳定结构构成了低能量区域，所以能否找到蛋白质全局最稳定结构的关键是算法能够找到大量的蛋白质亚稳定结构，即增加算法的种群多样性。因此，针对更加精确的蛋白质力场模型，选取有效的构象空间优化算法，使新的蛋白质结构预测算法更具有普遍性和高效性成为生物信息学中蛋白质结构预测的焦点问题。Protein tertiary structure prediction is an important task in bioinformatics. The biggest challenge facing the protein conformation optimization problem is to search the extremely complex protein energy function surface. The protein energy model takes into account the bonding of molecular systems and non-bonding interactions such as van der Waals forces, electrostatics, hydrogen bonds, and hydrophobicity, resulting in extremely rough energy surfaces, and the number of local minimum solutions corresponding to conformations increases exponentially with sequence length . The mechanism by which the protein conformation prediction algorithm can find protein stable structures is that a large number of protein metastable structures constitute low-energy regions, so the key to finding the most stable protein structure globally is that the algorithm can find a large number of protein metastable structures, that is, increasing Algorithm population diversity. Therefore, for a more accurate protein force field model, selecting an effective conformational space optimization algorithm to make the new protein structure prediction algorithm more universal and efficient has become the focus of protein structure prediction in bioinformatics.

目前，蛋白质结构预测方法大致可以分为两类，基于模板的方法和不基于模板的方法。其中，不基于模板的从头预测(Ab-inito)方法应用最为广泛。它适用于同源性小于25％的大多数蛋白质，仅从序列产生全新结构，对蛋白质分子设计及蛋白质折叠的研究等具有重要意义。当前有以下几种比较成功的从头预测方法：张阳与JeffreySkolnick合作的TASSER(Threading/Assembly/Refinement)方法、DavidBaker及团队设计的Rosetta方法、Shehu等设计的FeLTr方法等。但是到目前还没有一种十分完善的方法来预测蛋白质的三维结构，即使获得了很好的预测结果，但也只是针对某些蛋白质而言的，目前主要的技术瓶颈在于两个方面，第一方面在于采样方法，现有技术对构象空间采样能力不强，另一方面在于构象更新方法，现有技术对构象的更新精度仍然不足。At present, protein structure prediction methods can be roughly divided into two categories, template-based methods and non-template-based methods. Among them, the non-template-based ab initio prediction (Ab-inito) method is the most widely used. It is suitable for most proteins whose homology is less than 25%, and only generates a new structure from the sequence, which is of great significance to the study of protein molecular design and protein folding. At present, there are several relatively successful ab initio prediction methods: the TASSER (Threading/Assembly/Refinement) method jointly developed by Zhang Yang and Jeffrey Skolnick, the Rosetta method designed by David Baker and his team, and the FeLTr method designed by Shehu et al. But so far there is no perfect method to predict the three-dimensional structure of proteins. Even if good prediction results are obtained, it is only for some proteins. At present, the main technical bottlenecks lie in two aspects. First, On the one hand, it lies in the sampling method, and the prior art is not strong in sampling the conformation space; on the other hand, it lies in the conformation updating method, and the accuracy of the prior art on updating the conformation is still insufficient.

因此，现有的蛋白质结构预测方法存在不足，需要改进。Therefore, existing protein structure prediction methods are deficient and need to be improved.

发明内容Contents of the invention

为了克服现有的蛋白质结构预测方法的构象空间采样能力不强、预测精度较低的不足，本发明提出一种构象空间采样能力较好、预测精度高的基于谱知识从局部到全局的蛋白质结构预测方法。In order to overcome the disadvantages of weak conformational space sampling ability and low prediction accuracy of existing protein structure prediction methods, the present invention proposes a protein structure based on spectral knowledge from local to global with better conformational space sampling ability and high prediction accuracy method of prediction.

本发明解决其技术问题所采用的技术方案是：The technical solution adopted by the present invention to solve its technical problems is:

一种基于谱知识从局部到全局的蛋白质结构预测方法，所述优化方法包括以下步骤：A protein structure prediction method from local to global based on spectral knowledge, the optimization method includes the following steps:

1)给定查询序列信息；1) given query sequence information;

2)从蛋白质数据库(PDB)网站上下载分辨率小于的高精度蛋白质，其中为距离单位，米，根据序列比对算法NW-Align去除序列相似度大于30％的氨基酸链，得到非冗余蛋白质模板库；2) Download the resolution less than from the protein database (PDB) website high-precision protein, where is the unit of distance, m, according to the sequence alignment algorithm NW-Align to remove amino acid chains with a sequence similarity greater than 30%, to obtain a non-redundant protein template library;

3)根据多特征相似度函数：3) According to the multi-feature similarity function:

通过无缝穿线法比对非冗余模板库中的蛋白质链相对于查询序列每个残基位置上的得分f(i,j)，其中i为查询序列残基位置，j为片段结构；在f(i,j)中，下标q表示查询序列特征得分项，下标t表示模板蛋白质特征得分项，P_q(i,k)为查询序列通过PSI-BLAST得到的序列频率谱，其中k为预设数量氨基酸类型；L_q(i,k)和L_t(j,k)是通过PSI-BLAST得到的查询序列和模板序列对数谱；ss_t(j)为模板蛋白质二级结构分类，由DSSP计算得到；ss_q(i)为查询序列二级结构分类，由二层神经网络训练可得；sa_t(j)和sa_q(i)为模板结构和查询序列的溶剂可及性指标，由EDTSurf和神经网络程序训练得到；ψ_q(j)为查询序列二面角对可以通过二层神经网络训练得到；ψ_t(j)通过查询蛋白质字典得到；SP_t(j,k)为模板蛋白质的结构谱；w₁、w₂、w₃、w₄和w₅为权重值；Compare the score f(i,j) of each residue position of the protein chain in the non-redundant template library relative to the query sequence by seamless threading method, where i is the residue position of the query sequence, and j is the fragment structure; In f(i,j), the subscript q represents the query sequence feature score item, the subscript t represents the template protein feature score item, P _q (i,k) is the sequence frequency spectrum of the query sequence obtained by PSI-BLAST, where k is the preset number of amino acid types; L _q (i,k) and L _t (j,k) are the query sequence and template sequence logarithmic spectrum obtained by PSI-BLAST; ss _t (j) is the template protein secondary structure classification , calculated by DSSP; ss _q (i) is the secondary structure classification of the query sequence, which can be obtained by two-layer neural network training; sa _t (j) and sa _q (i) are the template structure and solvent accessibility of the query sequence Indicators, trained by EDTSurf and neural network programs; ψ _q (j) is the query sequence dihedral angle pair can be obtained through two-layer neural network training; ψ _t (j) is obtained by querying the protein dictionary; SP _t (j,k) is the structural spectrum of the template protein; w ₁ , w ₂ , w ₃ , w ₄ and w ₅ are weight values;

4)根据相似度得分f(i,j)选取查询序列每个位置上得分最高的M个片段得到片段库文件；4) According to the similarity score f(i, j), select the M fragments with the highest score at each position of the query sequence to obtain the fragment library file;

5)统计查询序列残基对来自于同个模板片段间的距离，在这里只统计小于的残基对之间距离，画出直方图得到距离谱，直方图横坐标的距离间隔为当模板中残基对之间的距离在某个区间内，则该区间总数就加1，若折线图在内的某个距离区间出现峰值，则该峰值对应的距离区间即为目标序列中残基i到残基j的预测距离，记录下该分布即为两残基间的距离谱(profile)；5) Statistical query sequence residue pairs come from the distance between the same template fragment, here only statistics less than The distance between the residue pairs, draw the histogram to get the distance spectrum, the distance interval of the abscissa of the histogram is When the distance between residue pairs in the template is within a certain interval, the total number of the interval will be increased by 1. If the line graph is in A peak appears in a certain distance interval within the peak, then the distance interval corresponding to the peak is the predicted distance from residue i to residue j in the target sequence, and the distribution is recorded as the distance profile between the two residues;

6)根据所得距离谱中残基的位置，将查询序列分为n段；6) dividing the query sequence into n segments according to the positions of the residues in the obtained distance spectrum;

7)令l＝1，l∈{1,2,3,…,n}，对分段结构执行以下操作：7) Let l = 1, l ∈ {1,2,3,...,n}, perform the following operations on the segmented structure:

7.1)对第l段片段结构进行片段组装；7.1) Fragment assembly is carried out to the segment l fragment structure;

7.2)计算其中含有距离谱信息的残基位置间的距离，并与预测距离求偏差，累加偏差值并取平均记为ΔD；7.2) Calculate the distance between the residue positions containing the distance spectrum information, and calculate the deviation from the predicted distance, accumulate the deviation value and take the average and record it as ΔD;

7.3)若ΔD<R，则将该构象存储记为cell，其中R为结构精确度约束条件；7.3) If ΔD<R, store the conformation as a cell, where R is the structural accuracy constraint;

7.4)重复7.1)到7.3)直至存储cell个数达到100个cell，基于RosettaScore3比对cell中分段结构的能量，选取能量最低的结构即为该分段的预测结构；7.4) Repeat 7.1) to 7.3) until the number of stored cells reaches 100 cells, compare the energy of the segmented structure in the cell based on RosettaScore3, and select the structure with the lowest energy as the predicted structure of the segment;

7.5)l＝l+1，判断l是否大于等于n，是则进入8)，否则返回到7.1)；7.5) l=l+1, judge whether l is greater than or equal to n, then enter 8), otherwise return to 7.1);

8)设置迭代次数为G，令s＝1，执行以下操作：8) Set the number of iterations as G, let s=1, perform the following operations:

8.1)计算目标诱导构象能量E(P_target)；8.1) Calculate the target-induced conformational energy E(P _target );

8.2)对未分段结构进行片段组装，计算能量值E(P_trail)；8.2) Fragment assembly is performed on the unsegmented structure, and the energy value E(P _trail ) is calculated;

8.3)若E(P_target)>E(P_trail)则用P_trail替换P_target；8.3) If E(P _target )>E(P _trail ), replace P _target with P _trail ;

8.4)s＝s+1；判断s是否大于等于G，是则进入9)，否则返回到8.1)；8.4) s=s+1; judge whether s is greater than or equal to G, if so, enter 9), otherwise return to 8.1);

9)输出诱导构象，得到查询序列近天然态结构。9) Output the induced conformation, and obtain the near-native state structure of the query sequence.

本发明的技术构思为：首先，对于查询序列通过多特征无缝穿线法获取高质量片段库，基于片段库通过统计一致性分析获取残基-残基间的距离谱知识；然后，将查询序列分为根据距离谱中记录的残基信息分为几段结构；之后，针对每一段结构通过片段组装得到能量较低且残基-残基间的空间距离逼近距离谱中预测距离；最后，对于未分段结构进行片段组装，计算全局能量，得到能量低且结构更为合理的亚稳态构象。The technical concept of the present invention is as follows: firstly, for the query sequence, a high-quality fragment library is obtained through the multi-feature seamless threading method, and the residue-residue distance spectrum knowledge is obtained through statistical consistency analysis based on the fragment library; then, the query sequence is According to the residue information recorded in the distance spectrum, it is divided into several segments of structure; after that, for each segment of the structure, the energy is low and the spatial distance between residues is approximated by the predicted distance in the distance spectrum through fragment assembly; finally, for The unsegmented structure is assembled into fragments, the global energy is calculated, and a metastable conformation with low energy and a more reasonable structure is obtained.

本发明的有益效果为：构象空间采样能力较好、预测精度高。附图说明The beneficial effect of the invention is that the conformational space sampling ability is good and the prediction precision is high. Description of drawings

图1是测试序列在种群更新过程中RMSD和能量值的关系示意图。Figure 1 is a schematic diagram of the relationship between the RMSD and the energy value of the test sequence during the population update process.

图2是1ENH算法预测所得蛋白质三维结构示意图。Figure 2 is a schematic diagram of the three-dimensional structure of the protein predicted by the 1ENH algorithm.

具体实施方式detailed description

下面结合附图对本发明作进一步描述。The present invention will be further described below in conjunction with the accompanying drawings.

参照图1～图2，一种基于谱知识从局部到全局的蛋白质结构预测方法，包括以下步骤：Referring to Figures 1 to 2, a protein structure prediction method based on spectral knowledge from local to global, including the following steps:

1)给定查询序列信息；1) given query sequence information;

7.4)重复7.1)到7.3)直至存储cell个数达到x个cell，基于RosettaScore3比对cell中分段结构的能量，选取能量最低的结构即为该分段的预测结构；7.4) Repeat 7.1) to 7.3) until the number of stored cells reaches x cells, compare the energy of the segmented structure in the cell based on RosettaScore3, and select the structure with the lowest energy as the predicted structure of the segment;

本实施例以序列长度为54的蛋白质1ENH为实施例，一种基于谱知识从局部到全局的蛋白质结构预测方法，其中包含以下步骤：This example takes protein 1ENH with a sequence length of 54 as an example, a method for predicting protein structure from local to global based on spectral knowledge, which includes the following steps:

1)给定查询序列信息；1) given query sequence information;

2)从蛋白质数据库(PDB)网站上下载分辨率小于的高精度蛋白质，其中为距离单位，米，根据序列比对算法NW-Align去除序列相似度大于30％的氨基酸链，得到8619条蛋白质链构成非冗余模板库；2) Download the resolution less than from the protein database (PDB) website high-precision protein, where is the unit of distance, m, according to the sequence comparison algorithm NW-Align to remove amino acid chains with a sequence similarity greater than 30%, and obtain 8619 protein chains to form a non-redundant template library;

4)根据相似度得分f(i,j)选取查询序列每个位置上得分最高的200个片段得到片段库文件；4) Select the 200 fragments with the highest score at each position of the query sequence according to the similarity score f(i,j) to obtain the fragment library file;

8)设置迭代次数为G＝50000，令s＝1，执行以下操作：8) Set the number of iterations as G=50000, let s=1, perform the following operations:

8.4)s＝s+1；判断s是否大于等于G＝50000，是则进入9)，否则返回到8.1)；8.4) s=s+1; judge whether s is greater than or equal to G=50000, if so, enter 9), otherwise return to 8.1);

以序列长度为54的蛋白质1ENH为实施例，运用以上方法得到了该蛋白质的近天然态构象，构象系综中构象更新图如图1所示，算法预测所得蛋白质三维结构展示如图2所示。Taking the protein 1ENH with a sequence length of 54 as an example, the near-native conformation of the protein was obtained by using the above method. The conformation update map in the conformation ensemble is shown in Figure 1, and the three-dimensional structure of the protein predicted by the algorithm is shown in Figure 2. .

以上阐述的是本发明给出的一个实施例表现出来的优良效果，显然本发明不仅适合上述实施例，在不偏离本发明基本精神及不超出本发明实质内容所涉及内容的前提下可对其做种种变化加以实施。What set forth above is the excellent effect shown by an embodiment of the present invention. Obviously, the present invention is not only suitable for the above-mentioned embodiment, but it can be used under the premise of not departing from the basic spirit of the present invention and not exceeding the content involved in the essence of the present invention. Make changes and implement them.

Claims

1. A protein structure prediction method from local to global based on spectral knowledge, characterized in that: the protein structure prediction method comprises the following steps:

1) given query sequence information;

2) Download the resolution less than high-precision protein, where is the unit of distance, According to the sequence comparison algorithm NW-Align, amino acid chains with a sequence similarity greater than 30% were removed to obtain a non-redundant protein template library;

3) According to the multi-feature similarity function:

Compare the score f(i,j) of each residue position of the protein chain in the non-redundant template library relative to the query sequence by seamless threading method, where i is the residue position of the query sequence, and j is the fragment structure; In f(i,j), the subscript q represents the query sequence feature score item, the subscript t represents the template protein feature score item, P _q (i,k) is the sequence frequency spectrum of the query sequence obtained by PSI-BLAST, where k is the preset number of amino acid types; L _q (i,k) and L _t (j,k) are the query sequence and template sequence logarithmic spectrum obtained by PSI-BLAST; ss _t (j) is the template protein secondary structure classification , calculated by DSSP; ss _q (i) is the secondary structure classification of the query sequence, which can be obtained by two-layer neural network training; sa _t (j) and sa _q (i) are the template structure and solvent accessibility of the query sequence Indicators, trained by EDTSurf and neural network programs; ψ _q (j) is the query sequence dihedral angle pair can be obtained through two-layer neural network training; ψ _t (j) is obtained by querying the protein dictionary; SP _t (j,k) is the structural spectrum of the template protein; w ₁ , w ₂ , w ₃ , w ₄ and w ₅ are weight values;

4) According to the similarity score f(i, j), select the M fragments with the highest score at each position of the query sequence to obtain the fragment library file;

5) Statistical query sequence residue pairs come from the distance between the same template fragment, here only statistics less than The distance between the residue pairs, draw the histogram to get the distance spectrum, the distance interval of the abscissa of the histogram is When the distance between residue pairs in the template is within a certain interval, the total number of the interval will be increased by 1. If the line graph is in A peak appears in a certain distance interval within the peak, then the distance interval corresponding to the peak is the predicted distance from residue i to residue j in the target sequence, and the distribution is recorded as the distance profile between the two residues;

6) dividing the query sequence into n segments according to the positions of the residues in the obtained distance spectrum;

7) Let l = 1, l ∈ {1,2,3,...,n}, perform the following operations on the segmented structure:

7.1) Fragment assembly is carried out to the segment l fragment structure;

7.2) Calculate the distance between the residue positions containing the distance spectrum information, and calculate the deviation from the predicted distance, accumulate the deviation value and take the average and record it as ΔD;

7.3) If ΔD<R, store the conformation as a cell, where R is the structural accuracy constraint;

7.4) Repeat 7.1) to 7.3) until the number of stored cells reaches x cells, compare the energy of the segmented structure in the cell based on RosettaScore3, and select the structure with the lowest energy as the predicted structure of the segment;

7.5) l=l+1, judge whether l is greater than or equal to n, then enter 8), otherwise return to 7.1);

8) Set the number of iterations as G, let s=1, perform the following operations:

8.1) Calculate the target-induced conformational energy E(P _target );

8.2) Fragment assembly is performed on the unsegmented structure, and the energy value E(P _trail ) is calculated;

8.3) If E(P _target )>E(P _trail ), replace P _target with P _trail ;

8.4) s=s+1; judge whether s is greater than or equal to G, if so, enter 9), otherwise return to 8.1);

9) Output the induced conformation, and obtain the near-native state structure of the query sequence.