CN115083513A - Method for constructing protein complex structure based on medium-resolution cryoelectron microscope image - Google Patents

Method for constructing protein complex structure based on medium-resolution cryoelectron microscope image Download PDF

Info

Publication number
CN115083513A
CN115083513A CN202210709043.1A CN202210709043A CN115083513A CN 115083513 A CN115083513 A CN 115083513A CN 202210709043 A CN202210709043 A CN 202210709043A CN 115083513 A CN115083513 A CN 115083513A
Authority
CN
China
Prior art keywords
main chain
backbone
conformation
chain
conformations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210709043.1A
Other languages
Chinese (zh)
Other versions
CN115083513B (en
Inventor
黄胜友
何佳铧
林培聪
陈吉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202210709043.1A priority Critical patent/CN115083513B/en
Publication of CN115083513A publication Critical patent/CN115083513A/en
Application granted granted Critical
Publication of CN115083513B publication Critical patent/CN115083513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Abstract

The invention discloses a method for constructing a protein complex structure based on a medium-resolution cryoelectron microscope image, which adopts a deep learning guidance method to automatically construct a multi-chain protein complex structure from the medium-resolution cryoelectron microscope image. By iterative fitting, optimization and assembly of single-stranded protein structures predicted from the sequence, high quality protein complex structures can be constructed without human intervention. Different from the prior art that the protein chain is fitted with a cryoelectron microscope image, the invention fits the protein chain with a main chain probability image; the value at each lattice point in the backbone probability map represents the probability of finding a backbone atom around that lattice point; predicting a main chain probability map from the input density map by using a UNet + + based deep learning network model; compared with a density map, the main chain probability map contains more accurate main chain atom position information, and the fitting precision can be improved.

Description

Method for constructing protein complex structure based on medium-resolution cryoelectron microscope image
Technical Field
The invention belongs to the field of protein complex structure construction, and particularly relates to a method for constructing a protein complex structure based on a medium-resolution cryoelectron microscope image.
Background
The progress in the aspects of a cryoelectron microscope instrument, data collection, image reconstruction and the like enables the reconstructed cryoelectron microscope density images to be more and more. However, the ultimate goal of cryoelectron microscopy is not to reconstruct density maps, but to determine atomic structures. For high resolution maps
Figure BDA0003706422660000011
Software, typically designed for X-ray crystallography, can be used to create high quality atomic structures. For the resolution in
Figure BDA0003706422660000012
The left and right cryoelectron microscope density maps can obtain satisfactory performance through head modeling. However, for intermediate resolutions
Figure BDA0003706422660000013
For cryo-electron micrographs, establishing an accurate structural model remains a challenging and labor-intensive process, reflecting the large gap between the number of reconstructed intermediate resolution density maps and the modeled three-dimensional structure. On the public database EMDB, medium resolution cryoelectron micrographs of over 40% did not model three-dimensional structures.
The fundamental reason for creating a protein model de novo on medium resolution electron microscopy is the large uncertainty in the intermediate resolution cryo-electron microscopy. This uncertainty is usually compensated for by a priori knowledge. In most cases, this a priori knowledge is embodied as the initial template structure. Starting from a given initial template structure, fitting the initial template structure to a cryoelectron micrograph, and then performing structure optimization, an atomic model corresponding to the given cryoelectron micrograph can be constructed. The initial template structure can be taken from an existing high-resolution structure, and can also be predicted by a structure prediction method such as homologous modeling or deep learning. Rigid and flexible fitting are common techniques for fitting template structures into medium resolution cryoelectron micrographs. The rigid fit searches for possible relative orientations between the structure and the density map, in which process the template structure is rigid. The fit between the fitting structure and the electron microscope image is measured by a scoring function, such as cross-correlation, mutual information, and the like. To date, various rigid fitting tools have been developed, including EMfit, UCSF Chimera, gmfit, multifit, Situs, PowerFit, TEMPy, MOFIT, VESPER, Phenix. If the initial template structure deviates from the true structure to a certain degree of conformation deviation, flexible fitting is often needed to improve and optimize the structure obtained by rigid fitting, so that the structure is more suitable for a density map.
Existing fitting algorithms still have limited ability to model structures from medium resolution cryoelectron micrographs. First, experimentally solved electron microscopy density maps typically contain heterogeneous density signals and random noise. Therefore, scoring functions that measure how well structures fit to the density map may mislead the search and ranking process of the fit. The existing method has no choice but to ensure the robustness at the cost of sacrificing the scoring precision. Second, manual intervention is still necessary for most fitting algorithms, which makes fitting labor intensive and very unfriendly for non-professional users. The fitting results are greatly affected by the use of specific parameters. Third, many methods are designed for single-chain protein fitting, and therefore require electron microscopy segmentation prior to fitting. Complex structures can only be constructed by manually combining the fit results of multiple single strands. For some low-quality density maps, accurate segmentation of the electron micrographs is almost impossible, and in this case it is extremely difficult to build a reasonable composite model. Finally, proteins are flexible molecules and therefore often require flexible fitting and optimization during modeling, which is a laborious process. The optimization method based on molecular dynamics has complex initialization process, long calculation time, parameter-dependent result and possible error in the whole process. Therefore, lightweight optimization methods are more common, however they cannot handle large conformational changes.
Disclosure of Invention
Aiming at the defects or improvement requirements of the prior art, the invention provides a method for constructing a protein complex structure based on a medium-resolution cryoelectron microscope image, so that the problems that the existing cryoelectron microscope fitting technology is seriously influenced by noise, needs manual intervention, can only perform single-chain protein fitting, is difficult to realize flexible fitting and the like are solved.
To achieve the above object, according to a first aspect of the present invention, there is provided a method for constructing a protein complex structure based on medium resolution cryoelectron microscopy, comprising:
s1, inputting the cryoelectron microscope image into a pre-trained prediction model to obtain a corresponding main chain probability image;
s2, predicting a corresponding atom model from the L single-stranded sequences;
s3, rigidly fitting each single-chain atom model to the main chain probability map respectively to obtain a plurality of rigidly fitted conformations, and keeping the rigidly fitted conformations with the highest main chain matching scores of the former M main chains;
s4, splitting each single-chain atom model into n short structural domains, and constructing an undirected graph V1 according to the connection mode of the n short structural domains; respectively taking each short structural domain as a seed domain, and performing structural domain optimization on the M rigid fitting conformations with the highest main chain matching fraction based on a simplex method according to the breadth-first search sequence on V1 to obtain M multiplied by n structural domain optimized conformations of each single chain;
s5, selecting conformations with the highest main chain matching scores from the rigid fitting conformations with the highest main chain matching scores of the first M single chains and the optimized conformations of the Mxn structural domains respectively to obtain LxK conformations so as to construct an undirected graph V2, wherein the vertex of V2 is the main chain matching score of each conformation, and the edges are atomic conflict scores between each conformation of every two single chains; and removing the edge with the atomic collision score exceeding the threshold value from the undirected graph, and determining the single-chain and conformation combination with the highest total main chain matching score as the constructed protein complex structure based on a Bron-Kerbosch search algorithm.
According to a second aspect of the present invention, there is provided a system for constructing a protein complex structure based on medium resolution cryo-electron microscopy, comprising: a computer-readable storage medium and a processor;
the computer readable storage medium is used for storing executable instructions;
the processor is configured to read executable instructions stored in the computer-readable storage medium and execute the method according to the first aspect.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
1. the invention provides a method for constructing a protein complex structure based on a medium-resolution cryoelectron microscope image, which adopts a deep learning guidance method to automatically construct a multi-chain protein complex structure from the medium-resolution cryoelectron microscope image. By iterative fitting, optimization and assembly of single-stranded protein structures predicted from the sequence, high quality protein complex structures can be constructed without human intervention.
2. According to the method for constructing the protein complex structure based on the medium-resolution cryoelectron microscope image, the protein chain is not directly fitted into the initial density image, but the protein chain is fitted with the main chain probability image. The value at each lattice point in the backbone probability map represents the probability of finding the backbone atom around that lattice point. The backbone probability map is predicted from the input density map using a UNet + + based deep learning network model. Compared with a density map, the main chain probability map contains more accurate main chain atom position information, and the fitting precision is greatly improved.
3. The invention provides a method for constructing a protein complex structure based on a medium-resolution cryoelectron microscope image, and also provides a semi-flexible protein domain optimization strategy on the basis of rigid fitting, so that the orientation of a protein domain can be rapidly optimized. And (3) finding the protein chain combination with the highest score from different protein chain combinations by using a searching method based on the maximum cluster problem, thereby obtaining a final compound model. The evaluation results on the test set show that the composite model constructed by the modeling method of the invention has high quality on each index no matter the PDB structure or the electron microscope density map is taken as a reference.
Drawings
FIG. 1 is a schematic diagram of a process for constructing a protein complex structure based on a medium-resolution cryoelectron micrograph according to an embodiment of the present invention;
fig. 2 (a) is a schematic diagram of a protein domain optimization process provided by an embodiment of the present invention, and fig. 2 (b) is a schematic diagram of an iterative modeling provided by an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The embodiment of the invention provides a method for constructing a protein complex structure based on a medium-resolution cryoelectron microscope image, which comprises the following steps:
and S1, inputting the cryoelectron microscope image into a pre-trained prediction model to obtain a corresponding main chain probability image.
Preferably, the prediction model is obtained by taking a cryoelectron micrograph as a training sample and training by taking a main chain probability map of the PDB structure corresponding to the cryoelectron micrograph as a label.
Specifically, a training strategy of a deep learning based main chain probability prediction module is adopted. The deep learning module adopts a UNet + + architecture. The training set consists of a plurality of experimental electron microscope images and a main chain probability image derived from the corresponding PDB structure. And predicting a main chain probability map by adopting a UNet + + network according to an input electron microscope density map.
Preferably, as shown in fig. 1, UNet + + network uses 3 encoder sub-blocks and 3 decoder sub-blocks with dense hop connections and uses 3 x 3 three-dimensional convolutional layers. A three-dimensional maximum pooling layer with step size 2 is used for down-sampling, and a three-dimensional linear interpolation layer with a magnification factor of 2 is used for up-sampling. The input to the network is of size 40 x 40 and grid spacing
Figure BDA0003706422660000051
The electron microscope density square. These same sized density squares were cut from the electron micrograph using a sliding window. Prior to this step, electron micrographs with different grid spacings were all interpolated using cubic interpolation to
Figure BDA0003706422660000052
Network interval of (2). The output of the network is a main chain probability block of the same size. These backbone probability squares are then recombined into the final backbone probability map by averaging the overlap.
It should be noted that, during the training and practical use process, the inputted electron microscope density map is a main map, and is usually a sharpened map. Therefore, it is preferable to use the sharpened graph as an input for the prediction of the main chain probability graph. Simple RELION sharpening using an automatically determined B factor can significantly improve the prediction quality of the main chain probability map compared to an unsharpened half map.
To train the network model, 209 pairs of experimental electron micrographs and associated PDB protein three-dimensional structures are preferably used as training sets. For each density map in the training set, its target backbone probability map is generated from its associated PDB structure. During training, 20% of the graphs are randomly selected from the training set as the validation set. The network is implemented with Pytrich1.8.1 + cuda 11.1. For each model, the network trained a maximum of 300 rounds with a single batch of 160 blocks input. An Adam optimizer is employed to minimize the prediction loss.
Two different loss functions are used to measure the difference between the predicted backbone probability block and the target backbone probability block, including the smoothing L1 loss function and the Structural Similarity (SSIM) loss function. The initial learning rate was set to 1e-3 and no L2 regularization was performed. A learning rate decay strategy is employed, specifically if the average training loss is not reduced during 4 consecutive rounds of training, the learning rate is reduced to 1/2 of its current value. The training process will stop when the learning rate reaches a minimum value of 1 e-5. The final used main chain probability graph prediction model is the network model with the minimum verification loss.
Preferably, the main chain probability map is formed by PDB structures associated with the main chain probability map
Figure BDA0003706422660000063
On a three-dimensional lattice point which is a lattice point spacing, the following formula is obtained:
Figure BDA0003706422660000061
wherein x is a position vector of a grid point; p (x) is the probability of finding a backbone atom in the vicinity of lattice point x; the set A consists of position vectors of all main chain atoms N, C alpha and C; k ═ pi/(2.4 +0.8R) 2 And R is the resolution of the cryoelectron micrograph.
To train the deep learning model, for a given cryoelectron microscopy density map, the target backbone probability map is determined by the PDB structure associated with it
Figure BDA0003706422660000064
Three-dimensional lattice points which are lattice point intervals are generated according to the following formula:
Figure BDA0003706422660000062
wherein x is the position vector of the grid point; p (x) is the backbone probability for lattice point x, i.e., the probability of finding a backbone atom in the vicinity of lattice point x; set A consists of the position vectors of all the main chain atoms (N, C α, and C). The value of k depends on the resolution R of the cryoelectron micrograph, i.e., k ═ pi/(2.4 +0.8R) 2
Preferably, the main chain point is adopted to carry out simplified representation on the main chain probability chart; the backbone points may be generated from a backbone probability map by a mean shift clustering algorithm.
In order to improve the calculation efficiency, the main chain probability graph obtained by deep learning model prediction is not directly used, but a simplified representation, namely main chain points, is adopted. The main chain points can be all adoptedThe value-shifting clustering algorithm is generated from the main-chain probability map. Specifically, a set of points are placed at all grid point positions greater than 0 on the main chain probability map as seed points, and the positions of the seed points are
Figure BDA0003706422660000071
Figure BDA0003706422660000072
Move to local maxima on the main chain probability map through multiple iterations:
Figure BDA0003706422660000073
wherein x is n (N-1, N) is a position vector of a grid point,
Figure BDA0003706422660000074
is a Gaussian kernel function, p (x) n ) Is a lattice point x n T is the number of iterations. Gaussian kernel function
Figure BDA0003706422660000075
Figure BDA0003706422660000076
Where the value of k depends on the resolution R of the cryoelectron micrograph, i.e. k ═ pi/(2.4 +0.8R) 2 . Backbone probability of seed point
Figure BDA0003706422660000077
Can be calculated on the main chain probability map according to the Gaussian kernel function, i.e.
Figure BDA0003706422660000078
After the mean shift process is converged (namely the seed points do not move any more), the seed points with closer distance (below a certain threshold value) are clustered, and the seed point with the highest main chain probability value in each class is selected as a representative. The final group of main chain points Z consists of the representative points of each type of seed points i (i=1,...,L)∈Z。
S2, predicting the corresponding atom model from the L single-stranded sequences.
Specifically, the atomic model is predicted from the single-stranded sequence using a protein structure prediction program. Preferably, AlphaFold2 is used as a prediction program with excellent performance in protein structure prediction. For the predicted single-stranded model, their domains were assigned using SWORD.
And S3, rigidly fitting each single-chain atom model to the main chain probability map respectively to obtain a plurality of rigidly fitted conformations, and keeping the rigidly fitted conformations with the highest main chain matching scores of the former M main chains.
Preferably, the atomic model of each single chain is rigidly fitted to the backbone probability map separately based on global search by FFT and simplex method.
In the rigid fitting process, a Fast Fourier Transform (FFT) based fitting strategy is used to fit the protein model of each chain to the backbone probability map by global search, as shown in fig. 1. The search process involves a translation search and a rotation search to traverse the entire sampling space. All translational searches can be performed by a round of fast fourier transform, while rotational searches are performed by exploring a large set of rotational angles. Specifically, an FFT-based translation search is performed for each rotation of the protein structure.
Preferably, the rotational search is performed over a uniform discrete euler space with 15 degree intervals, for a total of 4392 rotational searches. The exhaustive global search is discrete to a fixed lattice point interval and angle interval, and in order to further improve the fitting accuracy, a simplex method is adopted to further continuously optimize the fitting result of each global discrete search.
The objective function in the fitting and optimization process is the main chain matching score.
And measuring the fit degree of the fitted conformations and the main chain probability density map by calculating main chain matching scores, sequencing rigid fitting results, and keeping the rigid fitted conformations with the highest main chain matching scores in the first M numbers.
Preferably, the backbone match score is calculated by the formula:
Figure BDA0003706422660000081
wherein the backbone match score of conformation Y,
Figure BDA0003706422660000082
z is the position vector of the backbone points, Z is the set of all backbone points, P (Z) is the backbone probability of the backbone points, y q (Q ═ 1., Q.) is the backbone atom in conformation Y, k ═ pi/(2.4 +0.8R) 2 R is resolution of a cryo-electron micrograph, and θ ═ k/π 1.5
That is, for backbone atom Y in fitted conformation Y q (Q ═ 1.., Q), its backbone match score s' can be calculated according to the following formula:
Figure BDA0003706422660000083
where Z ∈ Z is the position vector of the backbone points (Z is the set of all backbone points), P (Z) is the backbone probability of a backbone point.
Figure BDA0003706422660000084
S4, splitting each single-chain atom model into n short structural domains, and constructing an undirected graph V1 according to the connection mode of the n short structural domains; and respectively taking each short domain as a seed domain, and performing domain optimization (adjusting to adjust the position of each domain on the rigid fitting conformation) on the rigid fitting conformations with the highest M main chain matching scores based on a simplex method according to the breadth-first search sequence on V1 to obtain M multiplied by n domain optimized conformations of each single chain.
Wherein, for undirected graph V1, any two domains with adjacent residues are connected by an undirected edge.
Preferably, in step S4, the L single-chain atom model is split into n short domains based on the SWORD domain splitting method.
After a protein model of a certain chain is rigidly fitted to a main chain probability map, a semi-flexible structural domain optimization strategy is adopted to further improve the fit degree of the protein model and the main chain probability map, as shown in (a) in fig. 2, a schematic diagram of a protein domain optimization process is shown. Due to the conformational deviation between the prediction model and the reference PDB structure, the prediction chain is used as a rigid whole to be fitted, and the matching degree with the reference structure is poor. By segmenting the protein chain into small domains and locally optimizing the position of each domain in the backbone probability map, the quality of the fit can be greatly improved. Wherein the lower half of (a) in fig. 2 compares the fitted structure before (left) and after (right) the application domain optimization. The upper half of fig. 2 (a) is a detailed description of domain optimization. Each domain in the fitted structure is labeled with a unique number, and the protein chain is well fitted to the backbone probability map (right) by local optimization of each domain separately on the rigidly fitted conformation (left).
For each protein model, SWORD is assigned n short domains. On this basis, a simple map can be constructed in which any two domains with adjacent residues are connected by an undirected edge. Domain optimization was performed on the M highest scoring fitted conformations from the rigid fit. Starting from a selected one of the domains as the seed domain, an optimal fit is performed for all the domains in turn. Specifically, starting from the seed domain, the current domain is simplex optimized, a local optimal match is found, and then the neighboring domains of the current domain are optimized. This process is repeated until all domains have completed local optimization. The order of optimization is determined by breadth-first search (BFS) on the above-described graph. Each domain is separately taken as a seed domain and domain optimization will yield n different models from each rigid fit result. For each chain, the final fit results are M × n domain optimization models plus M rigid fit models. The models are ordered according to respective match scores.
S5, selecting conformations with the highest main chain matching scores from the rigid fitting conformations with the highest main chain matching scores of the first M single chains and the optimized conformations of the Mxn structural domains respectively to obtain LxK conformations so as to construct an undirected graph V2, wherein the vertex of V2 is the main chain matching score of each conformation, and the edges are atomic conflict scores between each conformation of every two single chains; and removing the edge with the atomic collision score exceeding the threshold value from the undirected graph, and determining the single-chain and conformation combination with the highest total main chain matching score as the constructed protein complex structure based on a Bron-Kerbosch search algorithm.
After rigid fitting and flexible optimization of the individual strands, multiple single strands need to be assembled into a composite structure. To prevent collisions between different chains, atomic collision scores between the various conformations of different chains are first calculated.
Preferably, the atomic collision fraction between conformation a and conformation B is calculated by the formula:
Figure BDA0003706422660000101
wherein n is A Is the number of atoms in A, c (a; B) is the collision fraction of atoms a of conformation A relative to all atoms in conformation B,
Figure BDA0003706422660000102
d clash is the distance cut-off of any one of atom a and conformation B.
To prevent collisions between different conformations, a collision score c of one atom a in the set of atoms a of conformation a relative to all atoms B e B of another conformation B is defined as follows:
Figure BDA0003706422660000103
it can be seen that if the distance between atom a and any one of the atoms of conformation B is at the truncation distance d clash Hereinafter, the collision score is fixed to 1.0. The collision fraction C of conformation A relative to conformation B is the average of the collision fractions of all atoms of AMean value, i.e.
Figure BDA0003706422660000104
Wherein n is A Is the number of atoms in A.
The problem of assembly of multiple strands into a complex is translated into a maximum clump problem based on the match score for each conformation and the collision score between the different strands, as shown in figure 1. First, an undirected graph is constructed in which the vertices are the matching scores for each conformation and the edges of the two conformations connecting the two different chains are the atomic collision scores between the two. It should be noted that the different fitting conformations of the same strand are not connected. Edges whose collision score exceeds a given threshold are then removed from the graph. After the graph construction is completed, the chain combination with the highest total matching score is found as the constructed protein complex structure by using the Bron-Kerbosch algorithm.
As shown in fig. 1, the system comprises an upper part, a middle part and a lower part, wherein the upper part is a training schematic diagram of a deep learning-based main chain probability prediction module; the middle part is the workflow of composite modeling. The invention inputs electron microscope density image and multi-protein chain sequence. The main chain probability map is predicted from the input density map using a trained deep learning model. The atomic model is predicted from each single-stranded sequence using a structure prediction program. Models of a set of predicted chains are individually fitted to the main chain probability map by Fast Fourier Transform (FFT) based fitting and optimization. Using the fitting results for all single strands, the final protein complex structure was constructed by the Bronon-Kerbosch maximum cluster algorithm. Finally, the quality of the built model is estimated by using the main chain matching fraction between the built model and the main chain probability graph. The bottom part is a schematic diagram of the Bron-Kerbosch maximum blob algorithm, which computes the collision scores of the matching conformations of the different chains, and if the collision score is below a certain threshold, concatenates the two conformations, and vice versa. Based on these linkage cases, a graph is generated whose vertices are the fit scores of the different conformations from the different chains. The Bron-Kerbosch search algorithm was used to find the highest scoring combination of different chains as the final protein complex structure.
It is contemplated that some chains may not assemble into protein complexes by only one round of the Bron-Kerbosch algorithm. Therefore, the present invention employs an iterative strategy for composite assembly, preferably the method further comprises: s6, removing the region of the existing protein complex structure from the main chain probability map to update the main chain probability map, and returning to the step S2 until all single chains are assembled into the protein complex structure.
As shown in fig. 2 (b), the iterative modeling strategy is adopted, and after each modeling round, the density region of the existing structure is removed in the next modeling round of fitting. After each round of modeling, unassembled chains are iteratively assembled onto the complex under the direction of the updated backbone probability map. That is, after each round of Bron-Kerbosch algorithm, the main chain probability regions of the existing fitted structures are removed from the assembly of the next round. And fitting and optimizing the unassembled chains according to the removed main chain diagram, and iteratively assembling the unassembled chains into the complex structure to obtain the protein complex structure.
After each round of assembly of protein complexes by the Bron-Kerbosch algorithm, the current complex structure D and backbone point P (z) were calculated based on the collision scores described above i ) Collision score c (z) of (i ═ 1, …, L) i (ii) a D) (in this process, the set of main chain points can be treated as a set of atoms), and then the main chain points are updated according to the following formula:
P'(z i ;D)=P(z i )*(1.0-c(z i ;D))
by updating the backbone probability of the backbone points, the regions of the assembled structure can be removed from further structure assembly.
The embodiment of the invention provides a system for constructing a protein complex structure based on a medium-resolution cryoelectron microscope image, which comprises the following steps:
a computer-readable storage medium and a processor;
the computer-readable storage medium is used for storing executable instructions;
the processor is used for reading the executable instructions stored in the computer readable storage medium and executing the method according to any one of the above embodiments.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for constructing a protein complex structure based on a medium-resolution cryoelectron microscope image is characterized by comprising the following steps:
s1, inputting the cryoelectron microscope image into a pre-trained prediction model to obtain a corresponding main chain probability image;
s2, predicting a corresponding atom model from the L single-stranded sequences;
s3, rigidly fitting each single-chain atom model to the main chain probability map respectively to obtain a plurality of rigidly fitted conformations, and keeping the rigidly fitted conformations with the highest main chain matching scores of the former M main chains;
s4, splitting each single-chain atom model into n short structural domains, and constructing an undirected graph V1 according to the connection mode of the n short structural domains; respectively taking each short structural domain as a seed domain, and performing structural domain optimization on the M rigid fitting conformations with the highest main chain matching fraction based on a simplex method according to a breadth-first search sequence on V1 to obtain M multiplied by n structural domain optimized conformations of each single chain;
s5, selecting conformations with the highest main chain matching scores from the rigid fitting conformations with the highest main chain matching scores of the first M single chains and the optimized conformations of the Mxn structural domains respectively to obtain LxK conformations so as to construct an undirected graph V2, wherein the vertex of V2 is the main chain matching score of each conformation, and the edges are atomic conflict scores between each conformation of every two single chains; and removing edges with atomic conflict scores exceeding a threshold value from the undirected graph, and determining the conformation combination of each single chain with the highest total main chain matching score as the constructed protein complex structure based on a Bron-Kerbosch search algorithm.
2. The method according to claim 1, wherein the prediction model is obtained by training a cryo-electron micrograph as a training sample and a main chain probability map of the PDB structure corresponding to the cryo-electron micrograph as a label.
3. The method of claim 2, wherein the backbone probability map is constructed from PDB structures associated therewith
Figure FDA0003706422650000012
On a three-dimensional lattice point which is a lattice point spacing, the following formula is obtained:
Figure FDA0003706422650000011
wherein x is a position vector of a grid point; p (x) is the probability of finding a backbone atom at lattice point x; the set A consists of position vectors of all main chain atoms N, C alpha and C; k ═ pi/(2.4 +0.8R) 2 And R is the resolution of the cryoelectron micrograph.
4. The method of claim 3, wherein the main chain probability map is represented in a simplified manner using main chain points; the main chain points are generated from the main chain probability map by a mean shift clustering algorithm.
5. The method of claim 1, wherein the backbone match score is calculated by:
Figure FDA0003706422650000021
wherein the backbone match score of conformation Y,
Figure FDA0003706422650000022
z is the position vector of the backbone points, Z is the set of all backbone points, P (Z) is the backbone probability of the backbone points, y q (Q-1. -, Q.) is a backbone atom of conformation Y, k- (/ (2)).4+0.8R)) 2 R is resolution of a cryo-electron micrograph, and θ ═ k/π 1.5
6. The method of claim 1, wherein the atomic collision fraction between conformation a and conformation B is calculated by the formula:
Figure FDA0003706422650000023
wherein n is A Is the number of atoms in A, c (a; B) is the collision fraction of atoms a of conformation A relative to all atoms in conformation B,
Figure FDA0003706422650000024
d clash is the distance cut-off of any one of atom a and conformation B.
7. The method of claim 1, wherein the method further comprises: s6, removing the region of the existing protein complex structure from the main chain probability map to update the main chain probability map, and returning to the step S2 until all single chains are assembled into the protein complex structure.
8. The method of claim 1, wherein in step S3, an FFT-based global search and simplex method are used to rigidly fit the atomic model of each single chain to the main chain probability map.
9. The method of claim 1 or 2, wherein in step S4, the L single-chain atom model is split into n short domains based on a SWORD domain splitting method.
10. A system for constructing a protein complex structure based on medium resolution cryoelectron microscopy images, comprising: a computer-readable storage medium and a processor;
the computer-readable storage medium is used for storing executable instructions;
the processor is configured to read executable instructions stored in the computer-readable storage medium and execute the method according to any one of claims 1-9.
CN202210709043.1A 2022-06-21 2022-06-21 Method for constructing protein complex structure based on medium-resolution cryoelectron microscope image Active CN115083513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210709043.1A CN115083513B (en) 2022-06-21 2022-06-21 Method for constructing protein complex structure based on medium-resolution cryoelectron microscope image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210709043.1A CN115083513B (en) 2022-06-21 2022-06-21 Method for constructing protein complex structure based on medium-resolution cryoelectron microscope image

Publications (2)

Publication Number Publication Date
CN115083513A true CN115083513A (en) 2022-09-20
CN115083513B CN115083513B (en) 2023-03-10

Family

ID=83253592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210709043.1A Active CN115083513B (en) 2022-06-21 2022-06-21 Method for constructing protein complex structure based on medium-resolution cryoelectron microscope image

Country Status (1)

Country Link
CN (1) CN115083513B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497553A (en) * 2022-09-29 2022-12-20 水木未来(杭州)科技有限公司 Protein three-dimensional structure modeling method and device, electronic device and storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060121455A1 (en) * 2003-04-14 2006-06-08 California Institute Of Technology COP protein design tool
US20170103161A1 (en) * 2015-10-13 2017-04-13 The Governing Council Of The University Of Toronto Methods and systems for 3d structure estimation
US20190156911A1 (en) * 2016-04-27 2019-05-23 Massachusetts Institute Of Technology Stable nanoscale nucleic acid assemblies and methods thereof
CN111210869A (en) * 2020-01-08 2020-05-29 中山大学 Protein cryoelectron microscope structure analysis model training method and analysis method
US20200300763A1 (en) * 2017-12-05 2020-09-24 Simon Fraser University Methods for analysis of single molecule localization microscopy to define molecular architecture
US20200333270A1 (en) * 2017-10-06 2020-10-22 The Governing Council Of The University Of Toronto Methods and systems for 3d structure estimation using non-uniform refinement
CN111968707A (en) * 2020-08-07 2020-11-20 上海交通大学 Energy-based atomic structure and electron density map multi-objective optimization fitting prediction method
WO2020239822A1 (en) * 2019-05-27 2020-12-03 The European Molecular Biology Laboratory Nucleic acid construct binding to influenza polymerase pb1 rna synthesis active site
WO2020243839A1 (en) * 2019-06-07 2020-12-10 Structura Biotechnology Inc. Methods and systems for determining variability of cryo-em protein structures
WO2021178508A1 (en) * 2020-03-05 2021-09-10 University Of Washington Rigid helical junctions for modular repeat protein sculpting and methods of use
CN113990384A (en) * 2021-08-12 2022-01-28 清华大学 Deep learning-based frozen electron microscope atomic model structure building method and system and application
WO2022040423A2 (en) * 2020-08-19 2022-02-24 University Of Pittsburgh-Of The Commonwealth System Of Higher Education Coronavirus nanobodies and methods for their use and identification
WO2022081920A1 (en) * 2020-10-15 2022-04-21 The Regents Of The University Of California Systems for and methods of treatment selection
US20220189579A1 (en) * 2020-12-14 2022-06-16 University Of Washington Protein complex structure prediction from cryo-electron microscopy (cryo-em) density maps

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060121455A1 (en) * 2003-04-14 2006-06-08 California Institute Of Technology COP protein design tool
US20170103161A1 (en) * 2015-10-13 2017-04-13 The Governing Council Of The University Of Toronto Methods and systems for 3d structure estimation
US20190156911A1 (en) * 2016-04-27 2019-05-23 Massachusetts Institute Of Technology Stable nanoscale nucleic acid assemblies and methods thereof
US20200333270A1 (en) * 2017-10-06 2020-10-22 The Governing Council Of The University Of Toronto Methods and systems for 3d structure estimation using non-uniform refinement
US20200300763A1 (en) * 2017-12-05 2020-09-24 Simon Fraser University Methods for analysis of single molecule localization microscopy to define molecular architecture
WO2020239822A1 (en) * 2019-05-27 2020-12-03 The European Molecular Biology Laboratory Nucleic acid construct binding to influenza polymerase pb1 rna synthesis active site
WO2020243839A1 (en) * 2019-06-07 2020-12-10 Structura Biotechnology Inc. Methods and systems for determining variability of cryo-em protein structures
CN111210869A (en) * 2020-01-08 2020-05-29 中山大学 Protein cryoelectron microscope structure analysis model training method and analysis method
WO2021178508A1 (en) * 2020-03-05 2021-09-10 University Of Washington Rigid helical junctions for modular repeat protein sculpting and methods of use
CN111968707A (en) * 2020-08-07 2020-11-20 上海交通大学 Energy-based atomic structure and electron density map multi-objective optimization fitting prediction method
WO2022040423A2 (en) * 2020-08-19 2022-02-24 University Of Pittsburgh-Of The Commonwealth System Of Higher Education Coronavirus nanobodies and methods for their use and identification
WO2022081920A1 (en) * 2020-10-15 2022-04-21 The Regents Of The University Of California Systems for and methods of treatment selection
US20220189579A1 (en) * 2020-12-14 2022-06-16 University Of Washington Protein complex structure prediction from cryo-electron microscopy (cryo-em) density maps
CN113990384A (en) * 2021-08-12 2022-01-28 清华大学 Deep learning-based frozen electron microscope atomic model structure building method and system and application

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIAHUA HE等: "Full-length de novo protein structure determination from cryo-EM maps using deep learning", 《BIOINFORMATICS》 *
XIAOGEN ZHOU等: "Progressive assembly of multi-domain protein structures from cryo-EM density maps", 《NATURE COMPUTATIONAL SCIENCE》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497553A (en) * 2022-09-29 2022-12-20 水木未来(杭州)科技有限公司 Protein three-dimensional structure modeling method and device, electronic device and storage medium

Also Published As

Publication number Publication date
CN115083513B (en) 2023-03-10

Similar Documents

Publication Publication Date Title
CN106846425B (en) Scattered point cloud compression method based on octree
CN109685831B (en) Target tracking method and system based on residual layered attention and correlation filter
Estrada et al. Tree topology estimation
JP2022501694A (en) Determining a protein distance map by combining distance map crops
CN113065594B (en) Road network extraction method and device based on Beidou data and remote sensing image fusion
CN115083513B (en) Method for constructing protein complex structure based on medium-resolution cryoelectron microscope image
CN114841898B (en) Deep learning-based post-processing method and device for three-dimensional density map of cryoelectron microscope
CN114972366A (en) Full-automatic segmentation method and system for cerebral cortex surface based on graph network
CN114330867A (en) Path planning method based on problem solving of coverage traveling salesman
CN112750137A (en) Liver tumor segmentation method and system based on deep learning
CN117274339A (en) Point cloud registration method based on improved ISS-3DSC characteristics combined with ICP
Zhu et al. Vdb-edt: An efficient euclidean distance transform algorithm based on vdb data structure
CN112101475A (en) Intelligent classification and splicing method for multiple disordered images
JP2002537605A (en) Matching engine
Cao et al. On accurate computation of trajectory similarity via single image super-resolution
CN116206108A (en) OCT image choroid segmentation network model and method based on domain self-adaption
CN112785082B (en) Learning-based road network shortest path distance approximate calculation model training method and device
CN108828641A (en) A method of shortening the Fast integer Ambiguity Resolution time
CN114530195A (en) Protein model quality evaluation method based on deep learning
CN112446893B (en) Contour segmentation method and device for liver image
Rahman et al. Equivariant encoding based gvae (eqen-gvae) for protein tertiary structure generation
Castro et al. ReLSO: a transformer-based model for latent space optimization and generation of proteins
KR102475727B1 (en) Method and apparatus for improving surface registration using neural network model
CN113633375B (en) Construction method of non-diagnosis-purpose virtual bronchoscope
Vrćek et al. Reconstruction of short genomic sequences with graph convolutional networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant