CN110706741A

CN110706741A - Multi-modal protein structure prediction method based on sequence niche

Info

Publication number: CN110706741A
Application number: CN201910793341.1A
Authority: CN
Inventors: 张贵军; 夏瑜豪; 饶亮; 刘俊; 彭春祥; 周晓根
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2020-01-17
Anticipated expiration: 2039-08-27
Also published as: CN110706741B

Abstract

A multi-modal protein structure prediction method based on sequence niches is characterized in that under a Monte Carlo framework, an original energy function is used for first round search; and then, an energy function for the next operation is constructed according to the conformation information obtained after each operation, so that the situation that the conformations are repeatedly trapped in the same energy trap is avoided, the problem of inaccuracy of the energy function can be relieved, the sampling capability can be enhanced, the sampling efficiency is improved, and the prediction precision is improved. The invention provides a sequence niche-based multi-modal protein structure prediction method with high prediction accuracy.

Description

Multi-modal protein structure prediction method based on sequence niche

Technical Field

The invention relates to the fields of bioinformatics and computer application, in particular to a sequence niche-based multi-modal protein structure prediction method.

Background

Protein molecules are central to many biochemical processes in cells. They are produced, mobilized, and killed in cells in precise time and space, and perform their functions necessary to sustain the life activities of the organism, depending on the three-dimensional structure of these proteins. Therefore, how to accurately obtain the three-dimensional structure of the protein and elucidate the relationship between the three-dimensional structure and the biological function is a serious challenge.

Currently, methods of biological wet experiments, such as X-ray diffraction, nuclear magnetic resonance, and cryoelectron microscopy, are mainly used to determine the three-dimensional structure of proteins. X-ray diffraction is the most effective method for determining protein structure at present, the accuracy achieved by the method is incomparable with other methods, and the main defects are that protein crystals are difficult to culture and the period for determining the crystal structure is long; the nuclear magnetic resonance method can directly measure the structure of the protein in the solution, but has large requirements on the sample quantity and high purity, and only can measure the small-molecular protein at present. The main problems of the experimental determination of structure method are two aspects: on the one hand, it is difficult to determine the structure of membrane proteins, the main targets of modern drug design; in addition, the experimental determination process is time consuming, expensive, and costly, e.g., using nmr methods to determine a protein structure typically requires 15 thousand dollars and a half year of time. Furthermore, the speed of measurement by the experimental method is far from the speed of sequence measurement. Therefore, an efficient, fast and simple method for predicting the structure of unknown protein is urgently needed. Anfinsen suggested in 1961 that the amino acid sequence of a protein determines its spatial arrangement for biological activity. Therefore, a method for predicting the three-dimensional structure of a protein from its amino acid sequence by computer technology has been proposed. Methods for predicting the three-dimensional structure of a protein based on an amino acid sequence mainly include a homology modeling method and a de novo prediction method. The de novo prediction method searches a globally optimal solution in a conformational space using an optimization algorithm based directly on a physical or knowledge energy model of the protein.

However, the conformational space of proteins is extremely large and complex, and the existing methods often have three major disadvantages: first, the energy function is not precise, and the low-energy conformation is not necessarily closer to the native structure, resulting in failure to accurately find satisfactory results; secondly, the sampling capability of the current optimization method is insufficient, and the energy barrier is difficult to cross in the sampling process, so that the searched conformation is limited in a potential energy trap, and the overall prediction accuracy is influenced; thirdly, the traditional monte carlo method needs to reverse the weight every time the operation is performed, so that a plurality of tracks are completely independent and can not acquire information, and the conformation obtained after multiple operations is easy to fall into the same trap repeatedly, and the conformation with various structures is difficult to obtain.

Therefore, the existing protein structure prediction method has the problems of inaccurate energy function, insufficient sampling capability, low sampling efficiency, insufficient prediction accuracy and the like, and needs to be improved.

Disclosure of Invention

In order to solve the problems of inaccurate energy function, insufficient sampling capability, low sampling efficiency, insufficient prediction precision and the like of the conventional protein structure prediction method, the invention provides a sequence niche-based multi-mode protein structure prediction method, which constructs an energy function for next operation according to conformation information obtained after each operation by serially operating a plurality of Monte Carlo tracks, so that the conformation is prevented from being trapped in a trap of the previous round, the sampling capability is enhanced, the sampling efficiency is improved, and the overall prediction precision is improved.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for sequence niche-based multi-modal protein structure prediction, the method comprising the steps of:

1) inputting sequence information of a target protein;

2) acquiring fragment library files of 3 fragments and 9 fragments from a ROBETTA server (http:// www.robetta.org /) according to a target protein sequence;

3) setting parameters: maximum iteration times G, an energy function coefficient k and a degradation function coefficient m;

4) setting G ═ 1, G ∈ {1, 2.., G };

5) and (3) conformation initialization: generating an initial constellation using the first and second phases of the Rosetta protocol

If g is 1, continue with step 6); otherwise, go to step 7);

6) the initial modal conformation generation operation is as follows:

6.1) recording P as the target conformation to

Running the fourth phase of the Rosetta protocol as the initial constellation and setting the energy function M in Rosetta_g(P) score3(P), noteThe receiving constellation with the highest energy in the fourth phase of Rosetta

The lowest energy receiving conformation;

6.2) notesRespectively, the lowest energy receiving conformation

The dihedral angle of the ith residue of (1),

respectively, the highest energy receiving conformation

L is the length of the sequence of the target protein, and the radius of niche r is calculated as follows^g：

6.3) performing step 8);

7) the multimodal conformation generation procedure is as follows:

7.1) noting P as the target conformation,

φ_i、ω_idihedral angles, M, of the i-th residue of the target conformation P, respectively_g(P) is the energy function of the g-th iteration,

r^g-1respectively the highest energy value, the lowest energy conformation and the niche radius of the g-1 iteration,

is the distance between the target conformation and the energy-minimum conformation, to

The fourth stage of the Rosetta protocol was performed as the initial constellation and the energy function was calculated as follows:

7.2) notes

The receiving constellation with the highest energy in the fourth phase of Rosetta

Calculating the niche radius r for the lowest energy receiving constellation according to the formula (1) in step 6.2)^g；

8) Setting G to G +1, and if G > G, executing step 9); otherwise, turning to the step 5);

9) outputting G energy-lowest constellations in G iterations

As a final prediction result, G ∈ {1, 2.

The technical conception of the invention is as follows: under the monte carlo framework, firstly, a first round of search is carried out by using an original energy function; then, constructing an energy function for the next operation according to the conformation information obtained after each operation, and avoiding the repeated trapping of the conformation into the same energy trap; finally, the lowest energy conformation in each run is output as the final prediction. The multi-modal protein structure prediction method based on the sequence niche can not only relieve the problem of inaccurate energy function, but also enhance the sampling capability and improve the sampling efficiency, thereby improving the prediction precision.

The invention has the beneficial effects that: according to the sequence niche strategy, the sampling efficiency is improved; outputting multiple conformations alleviates the drawback of evaluating conformations with only a single energy function, increases conformational diversity, and thus improves overall prediction accuracy.

Drawings

FIG. 1 is a schematic diagram of conformation update when a multi-modal protein structure prediction method based on sequence niches performs structure prediction on protein 1 FNA.

FIG. 2 is a three-dimensional structure diagram obtained by performing structure prediction on protein 1FNA by a multi-mode protein structure prediction method based on sequence niches.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a method for multi-modal protein structure prediction based on sequence niches, the method comprising the steps of:

1) inputting sequence information of a target protein;

4) setting G ═ 1, G ∈ {1, 2.., G };

5) and (3) conformation initialization: generating an initial constellation using the first and second phases of the Rosetta protocolIf g is 1, continue with step 6); otherwise, go to step 7);

6) the initial modal conformation generation operation is as follows:

6.1) recording P as the target conformation to

Running the fourth phase of the Rosetta protocol as the initial constellation and setting the energy function M in Rosetta_g(P) score3(P), note

The lowest energy receiving conformation;

6.2) notes

Respectively, the lowest energy receiving conformation

Residue i of (2)The angle of the two-sided angle of (c),

respectively, the highest energy receiving conformation

6.3) performing step 8);

7) the multimodal conformation generation procedure is as follows:

7.1) noting P as the target conformation,

7.2) notes

9) outputting G energy-lowest constellations in G iterations

As a final prediction result, G ∈ {1, 2.

The present embodiment takes protein 1FNA with sequence length of 91 as an example, and provides a multi-modal protein structure prediction method based on sequence niches, and the method comprises the following steps:

1) inputting sequence information of a target protein;

3) setting parameters: the maximum iteration time G is 5, the energy function coefficient k is 1, and the degradation function coefficient m is 0.001;

4) setting G ═ 1, G ∈ {1, 2.., G };

If g is 1, continue with step 6); otherwiseGo to step 7);

6) the initial modal conformation generation operation is as follows:

6.1) recording P as the target conformation to

The lowest energy receiving conformation;

6.2) notes

Respectively, the lowest energy receiving conformation

The dihedral angle of the ith residue of (1),

respectively, the highest energy receiving conformation

6.3) performing step 8);

7) the multimodal conformation generation procedure is as follows:

7.1) The label P is the target conformation,

7.2) notesThe receiving constellation with the highest energy in the fourth phase of Rosetta

9) outputting G energy-lowest constellations in G iterations

As a final prediction result, G ∈ {1, 2.

Using protein 1FNA with sequence length of 91 as an example, the above method is used to obtain the near-natural state conformation of the protein, the conformation renewal scheme is shown in FIG. 1, and the root mean square deviation between the 5 structures obtained after 5 runs and the natural state structure is respectively

The predicted three-dimensional structure is shown in fig. 2.

While the foregoing illustrates one embodiment of the invention showing advantageous results, it will be apparent that the invention is not limited to the above-described embodiment, but is capable of numerous modifications without departing from the basic inventive concepts and without exceeding the scope of the inventive concepts.

Claims

1. A multi-modal protein structure prediction method based on sequence niches is characterized in that: the method comprises the following steps:

1) inputting sequence information of a target protein;

2) acquiring fragment library files of 3 fragments and 9 fragments from a ROBETTA server according to a target protein sequence;

4) setting G ═ 1, G ∈ {1, 2.., G };

If g is 1, continue with step 6); otherwise, go to step 7);

6) the initial modal conformation generation operation is as follows:

6.1) recording P as the target conformation to

The lowest energy receiving conformation;

6.2) notes

Respectively, the lowest energy receiving conformation

The dihedral angle of the ith residue of (1),

respectively, the highest energy receiving conformationL is the length of the sequence of the target protein, and the radius of niche r is calculated as follows^g：

6.3) performing step 8);

7) the multimodal conformation generation procedure is as follows:

7.1) noting P as the target conformation,

r^g-1respectively the highest energy value, the lowest energy conformation and the niche radius of the g-1 iteration,is the distance between the target conformation and the energy-minimum conformation, toThe fourth stage of the Rosetta protocol was performed as the initial constellation and the energy function was calculated as follows:

7.2) notes

9) output ofG energy-lowest conformations in G iterations

As a final prediction result, G ∈ {1, 2.