Background
Proteins are the blueprint of life and proteins are the machinery of life. Nucleic acid sequences contain vital information, while proteins perform various important tasks in the body of a human, such as catalysis of biochemical reactions, transport of nutrients, control of growth and differentiation, identification and transmission of biological signals, and the like. Proteins have different lengths, different amino acid arrangements and different spatial structures, and experimental analysis shows that proteins can form specific structures.
The structural significance of protein research is great, and the analysis of protein structure, function and relationship is an important component in proteome planning. Studying protein structure helps to understand the role of a protein, to understand how a protein performs its biological functions, and to recognize the interaction between a protein and a protein (or other molecule), which is very important both for biology and for medicine and pharmacy. However, what determines the spatial structure of the protein? When the spatial structure of a protein is disrupted, or the protein unfolds, it can recover its natural folded structure. A large number of experimental results prove that: the structure of a protein is determined by the protein sequence. Although another factor that affects the spatial structure of proteins is the solution environment in which the protein molecule is located, information that determines the structure of the protein is encoded within the amino acid sequence. However, can such code be deciphered? Or whether the spatial structure of a protein can be predicted directly from the amino acid sequence? Although the structure of a protein is generally thought to be determined by its amino acid sequence, our current indications are not sufficient to accurately predict the tertiary structure of a protein.
The existing experimental methods for predicting the protein structure mainly comprise methods such as X-ray, nuclear magnetic resonance, cryoelectron microscope and the like, but the methods are expensive and long in time consumption, and under the double promotion of theoretical requirements and practical application, the method for predicting the protein structure by using an amino acid sequence and a computer optimization method is adopted, wherein the de novo prediction method has the defects of poor prediction precision and insufficient sampling capability.
Therefore, the invention provides a group protein structure prediction method based on secondary structure fragment assembly, so that the problems of insufficient prediction precision and sampling capability in the existing protein structure prediction method are solved.
Disclosure of Invention
In order to overcome the defects of insufficient sampling capability and prediction accuracy of the conventional protein structure prediction method, the invention provides a group protein structure prediction method based on secondary structure fragment assembly, which designs the fragment assembly based on the secondary structure, samples protein conformations with energy and structures closer to natural states through Loop-based information interaction and the secondary structure-based fragment assembly, improves the sampling capability, and changes the structures of Loop regions through small disturbance given to the Loop regions of the conformations, thereby effectively improving the problem of low protein structure prediction accuracy caused by inaccuracy of energy functions.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for population protein structure prediction based on secondary structure fragment assembly, the method comprising the steps of:
1) setting parameters, and the process is as follows:
reading sequence information of a target protein, fragment library information, and setting a population position ═ p of a protein conformation1,p2,...,pi,...,pnWhere n is the population size, piRepresenting the ith individual of the population, the iteration number is G, and the maximum iteration number is GmaxThe information interaction probability R and the sequence length are L;
2) population initialization, the process is as follows:
replicating these linear chains to give n initial population of individuals based on the initial linear chains of the protein conformation, using a fragment of length 9 for each individual p in the populationiFragment assembly is performed until the residue types at all positions are replaced at least once;
3) and (3) interacting population information, wherein the process is as follows:
for each individual p in the populationiJudging whether the individual carries out population interaction according to the information interaction probability, and randomly selecting another individual p from the population if the population interaction is carried outjWhere i ≠ j, randomly choosing piOne Loop region of the conformation, with pjExchanging dihedral angle information in corresponding area of conformation, and obtaining two new individuals p after information interactioni′,pjIf the population interaction is not carried out, carrying out the step 3) on the next individual in the population to finish the information interaction of all the individuals in the population;
4) assembling population fragments based on secondary structure, wherein the process is as follows:
for individual pi′,i∈[1,NP]Performing 9-segment fragment assembly, judging after each fragment assembly, and if the region of the fragment assembly comprises residues of the Loop region, replacing the information of the current Loop region residues with the residue information of the Loop region in the conformation before the fragment assembly to obtain the individual piAssembling all individuals in the population by fragments based on a secondary structure;
5) and (3) disturbing the Loop area, wherein the process is as follows:
for individual pi″,i∈[1,NP]Perturbing the Loop region, and finely adjusting the dihedral angle of each residue of the conformation Loop region within an angle range of +/-2 to obtain an individual piAfter the disturbance process, the individuals before and after the disturbance are respectively evaluated by using an energy function to obtain EiAnd Ei', if Ei<Ei', then jump back to step 4) to re-assemble the segments, if Ei>Ei', ending the variation operation and obtaining a new individual;
6) the population is selected using an energy function, as follows:
firstly, combining an initial population and a disturbed population into a new population with the population size of 2 x n, then calculating the energy of individuals of the new population according to an energy function, sequencing the combined population according to the energy level, selecting the first n individuals with low energy as the selected population individuals, and finally setting G + 1;
7) judging whether the maximum iteration number G is reachedmaxAnd if the conditions are met, stopping iteration and outputting the information of the population individuals of the last generation, otherwise, returning to the step 3).
The technical conception of the invention is as follows: the invention provides a group protein structure prediction method based on secondary structure fragment assembly under the framework of a group algorithm. Firstly, the information interaction process probability in the group algorithm can control the group convergence speed; then, secondary structure-based fragment assembly operations can increase conformational diversity, thereby achieving a more native conformation; and finally, carrying out micro disturbance on the Loop area, and carrying out optimization on the population by using energy in the selection process, eliminating individuals with higher energy, and leaving the better individuals for next iteration.
The beneficial effects of the invention are as follows: on one hand, a group algorithm is used, information interaction is carried out among groups, and search of a conformation space is increased; on the other hand, the diversity of conformation is increased by the fragment assembly operation based on the secondary structure and the tiny perturbation of the residue in the Loop region, and the prediction precision is improved.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 and 2, a method for predicting a population protein structure based on secondary structure fragment assembly, the method comprising the steps of:
1) setting parameters, and the process is as follows:
reading sequence information of a target protein, fragment library information, and setting a population position ═ p of a protein conformation1,p2,...,pi,...,pnWhere n is the population size, piRepresenting the ith individual of the population, the iteration number is G, and the maximum iteration number is GmaxThe information interaction probability R and the sequence length are L;
2) population initialization, the process is as follows:
obtaining n initial population of individuals by replicating these linear chains according to the initial linear chains of the protein conformation, pairing the individuals p of the population with a fragment of length 9iAssembling fragments, and replacing residue types at all positions assembled to the conformation at least once, wherein the initialization operation is finished, and all individuals in the population are initialized;
3) and (3) interacting population information, wherein the process is as follows:
for each individual p in the populationiJudging whether the individual carries out population interaction according to the given information interaction probability, and randomly selecting another individual p from the population if the population interaction is carried outjWhere i ≠ j, randomly choosing piOne Loop region of the conformation, with pjExchanging dihedral angle information in corresponding area of conformation, and obtaining two new individuals p after information interactioni′,pjIf the population interaction is not carried out, carrying out the step 3) on the next individual in the population to finish the information interaction of all the individuals in the population;
4) assembling population fragments based on secondary structure, wherein the process is as follows:
for individual pi′,i∈[1,NP]Performing 9-segment fragment assembly, judging after each fragment assembly, and if the region assembled by the fragment comprises the residue of the Loop region, replacing the information of the current Loop region residue with the information of the residue of the Loop region in the conformation before the fragment assembly, namely, retaining the structure information of the Loop region to obtain the individual piAssembling all individuals in the population by fragments based on a secondary structure;
5) and (3) disturbing the Loop area, wherein the process is as follows:
for individual pi″,i∈[1,NP]Perturbing the Loop region, and finely adjusting the dihedral angle of each residue of the conformation Loop region within an angle range of +/-2 to obtain an individual piAfter the disturbance process, the individuals before and after the disturbance are respectively evaluated by using an energy function to obtain EiAnd Ei', if Ei<Ei', then jump back to step 4) to re-assemble the segments, if Ei>Ei', ending the variation operation and obtaining a new individual;
6) the population is selected using an energy function, as follows:
firstly, combining an initial population and a disturbed population into a new population with the population size of 2 x n, then calculating the energy of individuals of the new population according to an energy function, sequencing the combined population according to the energy level, selecting the first n individuals with low energy as the selected population individuals, and finally setting G + 1;
7) judging whether the maximum iteration number G is reachedmaxAnd if the conditions are met, stopping iteration and outputting the information of the population individuals of the last generation, otherwise, returning to the step 3).
This example illustrates an α -sheet protein 1GYZ with a sequence length of 60, a population-based method for predicting protein structure restricted by assembly of Loop region fragments, the method comprising the steps of:
1) setting parameters, and the process is as follows:
reading sequence information of a target protein, fragment library information, and setting a population position ═ p of a protein conformation1,p2,...,pi,...,pnWhere n is 100 is the population size, piRepresenting the ith individual of the population, the iteration number is G, and the maximum iteration number is GmaxThe information interaction probability R is 0.1, and the sequence length L is 60;
2) population initialization, the process is as follows:
based on the initial linear chains of the protein conformation, 100 initial population individuals are obtained by replicating these linear chains, and the individual p of the population is paired with a fragment of fragment length 9iAssembling fragments, and replacing residue types at all positions assembled to the conformation at least once, wherein the initialization operation is finished, and all individuals in the population are initialized;
3) and (3) interacting population information, wherein the process is as follows:
for each individual p in the populationiJudging whether the individual carries out population interaction according to the given information interaction probability, and randomly selecting another individual p from the population if the population interaction is carried outjWhere i ≠ j, randomly choosing piOne Loop region of the conformation, with pjExchanging dihedral angle information in corresponding area of conformation, and obtaining two new individuals p after information interactioni′,pj' if the population interaction is not carried out, the next individual in the population is carried out with the step 3), and the step for all the individuals in the population is finishedInformation interaction of the body;
4) assembling population fragments based on secondary structure, wherein the process is as follows:
for individual pi′,i∈[1,NP]Performing 9-segment fragment assembly, judging after each fragment assembly, and if the region assembled by the fragment comprises the residue of the Loop region, replacing the information of the current Loop region residue with the information of the residue of the Loop region in the conformation before the fragment assembly, namely, retaining the structure information of the Loop region to obtain the individual piAssembling all individuals in the population by fragments based on a secondary structure;
5) and (3) disturbing the Loop area, wherein the process is as follows:
for individual pi″,i∈[1,NP]Perturbing the Loop region, and finely adjusting the dihedral angle of each residue of the conformation Loop region within an angle range of +/-2 to obtain an individual piAfter the perturbation process, the individuals before and after the perturbation are respectively evaluated by using an energy function 'score 3' to obtain EiAnd Ei', if Ei<Ei', then jump back to step 4) to re-assemble the segments, if Ei>Ei', ending the variation operation and obtaining a new individual;
6) the population is selected using an energy function, as follows:
firstly, combining an initial population and a disturbed population into a new population with the population size of 2 × n, then calculating the energy of new population individuals according to an energy function 'score 3', sequencing the combined population according to the energy level, selecting the first n individuals with low energy as the selected population individuals, and finally setting G + 1;
7) judging whether the maximum iteration number G is reachedmaxAnd if the conditions are met, stopping iteration and outputting the information of the population individuals of the last generation, otherwise, returning to the step 3).
Using the example of the alpha-folded protein 1GYZ with a sequence length of 60, the above method was used to obtain the near-native conformation of the protein with a minimum RMS deviation of
Mean root mean square deviation of
The prediction structure is shown in fig. 2.
The above description is the optimization effect of the present invention using the 1GYZ protein as an example, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.