CN109767808B

CN109767808B - Novel protein evolution simulation model based on cellular automaton

Info

Publication number: CN109767808B
Application number: CN201811571554.1A
Authority: CN
Inventors: 肖绚; 薛广福
Original assignee: Jingdezhen Ceramic Institute
Current assignee: Jingdezhen Ceramic Institute
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2020-07-28
Anticipated expiration: 2038-12-21
Also published as: CN109767808A

Abstract

The invention provides a novel protein evolution simulation model based on a cellular automata, which comprises the following steps of finding all proteins belonging to a protein family phi in an NCBI database by using a keyword phi, finding the domain information of the protein family phi in a Uniport database through ID numbers of the proteins to form a training data set of the evolution simulation model, obtaining a common domain of the protein family phi through data in the training data set, obtaining the probabilities of other domains of the protein family phi appearing in front of and behind the common domain through changing a priori probability table phi from front to back and changing the priori probability table phi from back to back and changing the priori probability table Э from back to front, and simulating a novel protein sequence which is provided in a domain connection mode.

Description

Novel protein evolution simulation model based on cellular automaton

Technical Field

The invention relates to the technical field of bioinformatics, protein structural domains and protein evolution simulation, in particular to a novel protein evolution simulation model based on a cellular automaton.

Background

Understanding the mechanisms of protein evolution is central to many areas such as molecular evolution, comparative genomics, and structural biology. Determining the rate of evolution of proteins is crucial for quantitative selection, genetic drift, and the calculation of selective forces (selective forces) from genomic data. Protein evolution analysis also provides a unique tool for studying problems such as biomorphic evolution, aging, etc., and facilitates the determination of important functional sites (e.g., for protein design), polypeptides, drug targets, or protein interaction networks associated with human genetic diseases.

At present, the research on the protein evolution mechanism is limited to simple biological experiments and statistical analysis, which not only wastes time and labor, but also mostly only stays in the hypothesis stage. With the richness of experimental data and the development of information technology, new intelligent algorithms and system modeling methods are becoming powerful tools for solving such problems.

In most cases, the assumption that all sites in a protein sequence are mutually independent during protein evolution is not consistent with the fact that any amino acid residue inside a protein interacts with its adjacent amino acids, Yang designs a skillful and easily computable method, which allows different sites in an amino acid sequence to have different evolutionary rates, and which can be proved to better conform to the actual situation by classifying amino acids according to physicochemical properties so that substitutions between amino acids having similar properties can more easily occur, L orraine discovers through the structural information research on hemoglobin of a plurality of species that amino acid residues having small contributions to the conformational stability of a protein are more prone to mutations, i.e., the structural stability of a protein has a pressure on the mutation, so that the structural stability of a protein can be selected based on the structural information of the protein, and the structural information of the protein can be analyzed based on the principles of the dynamics of the university of protein, thus the principle of protein evolution and the structural similarity of the protein can be predicted from the physical and physical principles of the protein.

Cellular Automata (CA) is a power system that is discrete in both time and space. CA has been applied in many fields such as simulation of physical and biological phenomena such as sand heap regulation, ant regulation, wetting phenomena, etc. The complexity of the biological height can be reproduced through a simple evolution process in the cellular automata, a real biological form can be generated through the simple genetic evolution process, and a new model can be established for bioinformatics by using the cellular automata.

The foreign scholars Sirakoulis designs a DNA sequence evolution model based on a cellular automaton, and in the one-dimensional cellular automaton model, cells have four states which respectively represent four bases A, C, T and G and are respectively represented by numbers: a → 0, C → 1, T → 2, G → 3. The cellular neighbors are defined as nearest neighbors, namely, one base evolution is determined by the left, right, nearest bases and the base evolution itself, the evolution rule is that the sum of the numbers of the leftmost base and the rightmost base and the numbers represented by the base evolution itself is carried out, the modulo 4 calculation is carried out, and the finally obtained numbers represent the base evolution. The model can simulate the evolution of DNA sequences, but the defects are obvious, the evolution rule has no biological significance, and therefore, whether the evolved sequences are related to the sequences in reality or not is not known.

Protein domains are regions of proteins with specific spatial structures and independent functions, and are key functional units of proteins for biological utility. Each domain typically consists of 100-300 amino acid residues, each with a unique spatial structure and bearing distinct biological functions. Several domains in a protein molecule may be identical or different, and the domains may be similar in different protein molecules. Existing species have evolved from limited archeological species, as have existing proteins from simple proteins. Proteins with novel functions or specificities are produced by specific domain insertions or deletions, mutations, duplications or fusions with other domains, etc. during evolution. The existing protein evolution models are all based on the mathematical statistics of amino acids in protein sequences, the model complexity is high, and the simulation of protein evolution is difficult. Because the structural domain plays an important role in protein function, a novel protein evolution simulation model based on cellular automata is designed, and the evolution rule is necessary for simulating protein evolution research for structural domain evolution.

Disclosure of Invention

The invention aims to solve the technical problem of providing a novel protein evolution simulation model based on a cellular automaton, and aims to solve the problem that the protein simulated by the conventional protein evolution simulation model is far different from the protein evolution in reality through the evolution of a protein structural domain.

In order to solve the technical problems, the technical scheme of the invention is as follows: a novel protein evolution simulation model based on cellular automata is characterized by comprising the following steps:

(1) using the keyword phi to find all proteins belonging to the protein family in an NCBI database, and finding the structure domain information of the protein family in the Uniport database according to the ID numbers of the proteins to form a training data set of the evolution simulation model;

(2) obtaining a public structural domain of the protein family £ and other structural domains included in the family £ members through data in a training data set, and calculating a forward-backward-modeling prior probability table Φ and a backward-forward-modeling prior probability table Э according to the probability of the other structural domains of the protein family £ appearing in front of and behind the public structural domain;

(3) the cellular automata model adopts a one-dimensional cellular automata, supposing that a protein family £ has ξ structural domains, the cellular has ξ +2 states, including ξ structural domains, an evolution termination symbol X and an empty structural domain;

(4) assuming that the common domain of the protein family is Г + Д, the domain of the ancestral protein of the protein family is Д 3+ Д, other proteins in the family are evolved from the ancestor, the states of two cells in the middle of the cell space are defined as Д and Д respectively when the cellular automaton is initialized, and the states of the other cells are empty domains, in the evolution process of the cellular automaton, the states of the two cells are changed each time, when the cellular automaton is evolved for the first time, the cellular state after the cellular state Д and the cellular state before the cellular state Г are respectively obtained according to a phi ratio and a backward evolution prior probability table Д by using a roulette method, the cellular state after the cellular state Г is Д and the cellular state after the cellular state Д is obtained as a pi 638, when the cellular state after the cellular state is obtained as a cellular state before the cellular state Д and the cellular state before the cellular state Г is obtained as a phi 638, the cellular state after the cellular state is obtained as a cellular state before the cellular state 638 is evolved for the first time, the cellular state after the cellular state is obtained as a phi 632 + Г + 596, when the cellular state is obtained by using a priori and the cellular evolution method, the cellular state after the cellular state before the cellular evolution is obtained as a cellular state before the cellular evolution, the cellular state before the cellular state is obtained as a cellular state, the cellular state before the;

(5) the cellular automata model was run once to simulate a new protein sequence, which is given in the form of domain junctions.

If the protein P in the protein family £ in (3) has the most structural domains, and the number of the structural domains is M, the number of cells in the model must be greater than 2 × M.

Since the above method is based on existing protein family data, the use of the method in protein evolution simulation allows for the simulation of many existing proteins from ancestral protein sequences, demonstrating the effectiveness of the model, and the model also allows for the prediction of the evolution of existing proteins.

Compared with the existing method for simulating protein evolution by inserting, deleting and copying basic groups or amino acids, the method provided by the invention has the advantages of simple model and high simulation degree, and has wide application prospect.

Drawings

In the embodiment of fig. 1, the prior probability table Φ is quantized forward and backward;

the prior probability table Э is backward-advanced in the embodiment of fig. 2.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are only for illustrating the present invention and are not intended to limit the present invention, and the examples herein are for simulating the evolution algorithm of the human Rho guanine nucleotide exchange genes (Rho GEF) protein family.

The novel protein evolution simulation model based on the cellular automata is adopted, and the specific steps are as follows:

1) the method comprises the steps of using a keyword Rho GET to find all proteins belonging to a protein family Rho GET in an NCBI database, then using the keyword Human to find all homologous proteins belonging to Human Rho GEF, wherein the total number of the homologous proteins is 1507, and using ID numbers of the proteins to find domain information of the homologous proteins in a Uniport database to form a training data set of an evolution simulation model.

2) The probability of inserting The shared domain DbIhomology (DH) + Plecktrin Homology (PH) of human Protein family RhoGET by training data in The data set, The probability of inserting The shared domain DbIhomology (PH) of family member DH + PH, The probability of inserting The shared domain into The domain I-H + PH, The probability of inserting The domain D-H + H among family members, i.e., Breast Cancer-terminal (BRCT), C2 domain (C2), Calponin-host (CH), CRA L-TRIO, dishelveled, Egl-10, and Plykstrin (DEP), DH, hand-EF (EF), Eps15 Homology (EH), FERM (FERM), FYyyzinc fingdomain (FyVE-type), Fibridectin 3 (Fibride-type), The probability of inserting The domain I-H-3-domain I-H, The probability of inserting The domain I-H + H, The probability of inserting The domain I-H +.

3) The cellular automaton model adopts a one-dimensional cellular automaton, the cellular expression domain, 29 structural domains of the human protein family RhoGET, and 31 states of the cellular automaton, including 29 structural domains, an evolution termination symbol X and an empty structural domain. The protein Q5VST9 from the human Rho GET protein family had the largest domain with a domain number of 65, which was defined as 150 in the cellular automaton model.

4) The method comprises the following steps of obtaining the state of a cell behind the 76 th cell and the state of a cell ahead of the 75 th cell by using a wheel method according to a forward-backward evolution prior probability table phi and a backward-forward evolution prior probability table Э during evolution of the cellular automaton, obtaining the state of a cell behind the 76 th cell and the state of a cell ahead of the 75 th cell by using a wheel method when t =1, selecting a preceding cell in a DH state as a state, determining the state of a 74 th cell as a state Sp, selecting a cell behind a PH state as a state 3 when t =1, selecting a preceding cell in a SH-th cell as a state 3 when t =1, selecting a preceding cell in a DH state as a state, terminating the state of the preceding cell as a SH-th cell in a SH-3-th cell as a state of a SH-th cell, selecting the state of the other cells as a state SH-3 + 70 th cell, terminating a protein evolution model when t = 70 st + 3, simulating a protein evolution model is obtained from a SH-X + 3 st cell, and a protein model is obtained when t = 72 st + 3 st evolution is completed as a protein model CT + 3, and when t = 72 st + 3 st evolution is completed.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A novel protein evolution simulation model based on cellular automata is characterized by comprising the following steps:

2. The cellular automaton-based protein evolution simulation model according to claim 1, wherein if the protein P in the protein family £ in (3) has the most domains, and the number of the domains is M, the number of cells in the model must be greater than 2 × M.