CN107609342B

CN107609342B - Protein conformation search method based on secondary structure space distance constraint

Info

Publication number: CN107609342B
Application number: CN201710683896.1A
Authority: CN
Inventors: 张贵军; 王小奇; 马来发; 周晓根; 谢腾宇; 王柳静; 孙科
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2017-08-11
Filing date: 2017-08-11
Publication date: 2020-08-18
Anticipated expiration: 2037-08-11
Also published as: CN107609342A

Abstract

A protein conformation search method based on secondary structure space distance constraint is characterized in that under the basic framework of a genetic algorithm, a feature vector is formed by utilizing the space length of each secondary structure in a target protein and the space distance information between two adjacent secondary structure center residues to serve as a space limiting condition, so that under the condition of a given energy function, a solution space is searched in a smaller conformation space, and meanwhile, the space distance information is added into a selection operator, so that the inaccuracy of the energy function is compensated, and the accuracy of structure modeling is effectively improved. The invention provides a protein conformation search method based on secondary structure space distance constraint, which has the advantages of higher sampling efficiency, higher prediction precision and low calculation cost.

Description

Protein conformation search method based on secondary structure space distance constraint

Technical Field

The invention relates to the fields of biological informatics, artificial intelligence optimization and computer application, in particular to a protein conformation search method based on secondary structure space distance constraint.

Background

The protein is a biological macromolecule formed by amino acid dehydration condensation, plays a decisive role in human health, and has important significance in disease research, biopharmaceutical industry and the like by accurately mastering the structure and the function of the protein. The current methods for predicting protein structure mainly comprise two methods: experimental methods and theoretical predictions. The experimental methods include X-ray crystallography, nuclear magnetic resonance spectroscopy, electron microscopy, and the like; while these methods are capable of accurately determining the three-dimensional structure of certain proteins, determining the structure by experimental methods is time consuming and expensive, while the structure of some proteins is not at all accessible by experimental methods. Therefore, the use of computational methods to predict protein structure has been a focus in bioinformatics research. The theoretical prediction method mainly utilizes a computer technology and an intelligent optimization algorithm to predict the three-dimensional structure of the protein from the primary amino acid sequence, thereby effectively saving the prediction cost and reducing the prediction time, and compared with an experimental method, the method can be widely applied. However, the problem of predicting the three-dimensional structure of the protein is still a problem to be solved so far due to the complexity of the protein structure.

In the method for predicting the protein structure from the beginning, an evolutionary algorithm is an important method for researching protein molecule conformation optimization, such as genetic algorithm, differential evolution and other algorithms, and the algorithms have the advantages of high convergence rate, simple structure, strong robustness and the like. However, when the protein sequence is relatively long, because the conformational space is too large, if the search is performed according to a specific energy function, due to the inaccuracy of the energy function, it cannot be guaranteed that the conformation with the smallest energy is found closest to the native structure, and thus the correct folding is often not formed.

Therefore, the existing conformational space search methods have defects in prediction accuracy and sampling efficiency, and need to be improved.

Disclosure of Invention

In order to overcome the defects of low sampling efficiency and low prediction precision of the conventional protein structure prediction conformation space search method, the invention provides a protein conformation search method based on secondary structure space distance constraint, which has high sampling efficiency and high prediction precision.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method of protein conformation search based on secondary structure spatial distance constraints, the method comprising the steps of:

1) given input sequence information;

2) initializing parameters: is provided withPopulation size NP, maximum genetic algebra G_maxDetermining the cross probability P_cInitial population iteration time iteration, cross segment length frag _ length, assembly counter reject _ number, maximum assembly time reject _ max, and feature vector D { D } formed by spatial length of secondary structure and spatial distance between two adjacent secondary structure center residues in prior knowledge₁,…,d_m,d_1,2,…,d_k,k+1In which d is_mIs the length of the m-th secondary structural block of the target protein, d_k,k+1Is the spatial distance between the kth secondary structure block and the kth +1 secondary structure central residue, the maximum distance constraint range, and the selection probability P_s；

3) Initializing a population: starting NP Monte Carlo tracks, searching each track for iteration times, and generating NP initial individuals;

4) for each target individual x_iAnd randomly selected individuals x_jThe following operations are performed, i, j ∈ (1.. NP), and j ≠ i:

4.1) by probability P_cFor an individual x_iAnd x_jThe crossover operation was performed as follows:

4.1.1) randomly selecting a cross-over start point begin _ position within an allowable range [1, total _ residual-front _ length ], and simultaneously calculating a cross-over end _ position (begin _ position + front _ length), wherein total _ residual is the total number of residues;

4.1.2) at each intersection position ∈ [ begin _ position, end _ position ]]Carrying out torsion angle exchange to generate a new individual x'_i,x′_jI.e. cross individual x'_i,x′_j；

4.2) pairs of crossed individuals x'_i,x′_jThe following mutation operations were performed:

4.2.1) Cross individuals x 'by fragment Assembly technique'_iPerforming spatial conformation search to calculate x 'of crossed individuals'_iThe length of the secondary structure after the fragment assembly and the space distance between the central residues of two adjacent secondary structures form a distance vector

Wherein

Is a crossed individual x'_iThe length of the m-th secondary structure block,

is the spatial distance of the kth secondary structure block center residue and the kth +1 secondary structure block center residue;

4.2.2) according to the formula

Calculate individual x'_iFeature vector of

And the feature vector D ═ D in the prior knowledge₁,…,d_m,d_1,2,…,d_k,k+1The Manhattan distance of the individual x' generated by mutation if the similarity _ mutation _1 is less than or equal to_iIf the secondary structure space distance constraint is met, turning to the step 4.2.4), otherwise, turning to the step 4.2.3);

4.2.3) the counter reject _ number starts counting, and if reject _ number is ≦ reject _ max, steps 4.2.1) and 4.2.2) are performed in sequence to generate the new individual x ″ "_iStopping until the similarity _ rotation _1 is not more than the requirement; otherwise, step 4.2.1) is executed to generate a new individual x ″_i；

4.2.4) Individual x 'in analogy to steps 4.2.1) and 4.2.2)'_jAssembling the segments and calculating the corresponding Manhattan distance value similarity _ rotation _2 to obtain a new individual x ″)_j；

4.2.5) according to the formula

Calculating a target individual x_iDistance vector of

And the feature vector D ═ D in the prior knowledge₁,…,d_m,d_1,2,…,d_k,k+1The Manhattan distance of the };

5) according to the target individual x_iAnd variant individuals x ″)_i、x″_jThe energy and distance similarity of the population is selected, the dominant individual is selected and the population is updated, the process is as follows:

5.1) function E (x) according to Rosetta Score3_i) Calculating target individuals x respectively_iAnd variant individuals x ″)_i、x″_jEnergy E (x) of_i)、E(x″_i) And E (x ″)_j)；

5.2) in the target individual x_iAnd variant individuals x ″)_i、x″_jIn (2), X ∈ { X } for a certain object X_i,x″_i,x″_jThe energy value of the individual is smaller than the energy values of the other two individuals, and the corresponding Manhattan distance value is smaller than the Manhattan distance values of the other two individuals, the individual is a dominant individual, and if an individual X ', X' ∈ { X is smaller than the Manhattan distance values of the other two individuals, the individual is a dominant individual_i,x″_i,x″_jOnly if the energy value is smaller than the energy values of the other two individuals, according to the selection probability P_sSetting the individual as a dominant individual, and similarly, if an individual is X ', X' ∈ { X_i,x″_i,x″_jOnly if the corresponding Manhattan distance value is smaller than the corresponding Manhattan distance values of the other two individuals, then the probability P is selected_sSetting the individual as a dominant individual; finally, replacing the target individual with the dominant individual, and updating the population;

6) judging whether the maximum genetic algebra G is reached_maxAnd if the termination condition is met, outputting the result, otherwise, turning to the step 4).

The technical conception of the invention is as follows: under the basic framework of a genetic algorithm, a feature vector is formed by utilizing the space length of each secondary structure in a target protein and the space distance information between two adjacent secondary structure center residues as a space limiting condition, so that under the condition of a given energy function, a solution space is searched in a smaller conformation space, and meanwhile, the space distance information is added into a selection operator, the inaccuracy of the energy function is made up, and the accuracy of structure modeling is effectively improved.

The beneficial effects of the invention are as follows: on one hand, a feature vector is formed by the space length of the secondary structure and the space distance between two adjacent secondary structure center residues to serve as a space limiting condition, so that the conformational search space is reduced, errors caused by inaccurate energy functions are reduced, and the prediction precision is greatly improved; on the other hand, under the framework of a genetic algorithm, the convergence speed is accelerated and the diversity of the population is increased through information interaction among individuals and variation selection operation of parent individuals.

Drawings

FIG. 1 is a basic flow diagram of a protein conformation search method based on secondary structure spatial distance constraints.

FIG. 2 is a schematic diagram of conformation update in the structural prediction of protein 1AIL by a protein conformation search method based on secondary structure space distance constraint.

FIG. 3 is a three-dimensional structure diagram of protein 1AIL predicted by a protein conformation search method based on secondary structure space distance constraints.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a method for protein conformation search based on secondary structure spatial distance constraint, the method comprising the steps of:

1) given input sequence information;

2) initializing parameters: setting population size NP and maximum genetic algebra G_maxDetermining the cross probability P_cInitial population iteration time iteration, cross segment length frag _ length, assembly counter reject _ number, maximum assembly time reject _ max, and feature vector D { D } formed by spatial length of secondary structure and spatial distance between two adjacent secondary structure center residues in prior knowledge₁,…,d_m,d_1,2,…,d_k,k+1In which d is_mIs the length of the m-th secondary structural block of the target protein, d_k,k+1Is the spatial distance of the kth secondary structure block and the kth +1 secondary structure center residue, the maximum distance constraintRange, selection probability P_s；

Wherein

Is a crossed individual x'_iThe length of the m-th secondary structure block,

4.2.2) according to the formula

Calculate individual x'_iFeature vector of

4.2.5) according to the formula

Calculating a target individual x_iDistance vector of

5.2) in the target individual x_iAnd changeHetero entity x ″)_i、x″_jIn (2), X ∈ { X } for a certain object X_i,x″_i,x″_jThe energy value of the individual is smaller than the energy values of the other two individuals, and the corresponding Manhattan distance value is smaller than the Manhattan distance values of the other two individuals, the individual is a dominant individual, and if an individual X ', X' ∈ { X is smaller than the Manhattan distance values of the other two individuals, the individual is a dominant individual_i,x″_i,x″_jOnly if the energy value is smaller than the energy values of the other two individuals, according to the selection probability P_sSetting the individual as a dominant individual, and similarly, if an individual is X ', X' ∈ { X_i,x″_i,x″_jOnly if the corresponding Manhattan distance value is smaller than the corresponding Manhattan distance values of the other two individuals, then the probability P is selected_sSetting the individual as a dominant individual; finally, replacing the target individual with the dominant individual, and updating the population;

In this embodiment, the α -sheet protein 1AIL with a sequence length of 73 is an embodiment, and a protein conformation search method based on secondary structure space distance constraint, comprising the following steps:

1) given input sequence information;

2) initializing parameters: setting the population size NP to 200, and setting the maximum genetic algebra G_maxDetermining a crossover probability P of 2000_c0.1, 2000 initial population iteration times, 9 cross segment length, 0 assembly counter reject _ number, 100 maximum assembly times reject _ max, 15 maximum distance constraint range, 15 probability P, wherein the feature vector D composed of spatial length of secondary structure and spatial distance between two adjacent secondary structure center residues in prior knowledge is {3.81085,33.8066,8.38603,30.3193,6.69076,22.1852,19.6409,17.2739,15.4455,14.6372,15.5907,12.43}, and the maximum distance constraint range_s＝0.3；

4) for each target individual x_iAnd randomly selectTaken individual x_jThe following operations are performed, i, j ∈ (1.. NP), and j ≠ i:

4.2) pairs of crossed individuals x'_i，x′_jThe following mutation operations were performed:

Wherein

Is a crossed individual x'_iThe length of the m-th secondary structure block,

4.2.2) according to the formula

Calculate individual x'_iFeature vector of

And features in a priori knowledgeSign vector D ═ D₁,…,d_m,d_1,2,…,d_k,k+1The Manhattan distance of the individual x' generated by mutation if the similarity _ mutation _1 is less than or equal to_iIf the secondary structure space distance constraint is met, turning to the step 4.2.4), otherwise, turning to the step 4.2.3);

4.2.5) according to the formula

Calculating a target individual x_iDistance vector of

5.2) in the target individual x_iAnd variant individuals x ″)_i、x″_jIn (2), X ∈ { X } for a certain object X_i,x″_i,x″_jThe energy value of the person is less than the energy values of the other two persons, and the corresponding Manhattan distance value is less than the Manhattan distance values of the other two persons, so that the person isDominant individual, if an individual X ', X' ∈ { X }_i,x″_i,x″_jOnly if the energy value is smaller than the energy values of the other two individuals, according to the selection probability P_sSetting the individual as a dominant individual, and similarly, if an individual is X ', X' ∈ { X_i,x″_i,x″_jOnly if the corresponding Manhattan distance value is smaller than the corresponding Manhattan distance values of the other two individuals, then the probability P is selected_sSetting the individual as a dominant individual; finally, replacing the target individual with the dominant individual, and updating the population;

Using the α -folded protein 1AIL with a sequence length of 73 as an example, the above method was used to obtain the near-native conformation of the protein with a minimum RMS deviation of

Mean root mean square deviation of

The prediction structure is shown in fig. 3.

The above description is the optimization effect of the present invention using 1AIL protein as an example, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.

Claims

1. A protein conformation search method based on secondary structure space distance constraint is characterized in that: the conformational space search method comprises the following steps:

1) given input sequence information;

2) initializing parameters: setting population size NP and maximum genetic algebra G_maxDetermining the cross probability P_cInitial population iteration time iteration, cross segment length frag _ length, assembly counter reject _ number, maximum assembly time reject _ max, second level in priori knowledgeThe space length of the structure and the space distance between two adjacent secondary structure center residues form a feature vector D ═ { D ═ D₁,…,d_m,d_1,2,…,d_k,k+1In which d is_mIs the length of the m-th secondary structural block of the target protein, d_k,k+1Is the spatial distance between the kth secondary structure block and the kth +1 secondary structure central residue, the maximum distance constraint range, and the selection probability P_s；

Wherein

Is a crossed individual x'_iThe length of the m-th secondary structure block,

is a crossed individual x'_iThe spatial distance between the kth secondary structure block center residue and the (k + 1) th secondary structure block center residue;

4.2.2) according to the formula

Calculate individual x'_iFeature vector of

4.2.5) according to the formula

Calculating a target individual x_iDistance vector of