WO2024119597A1 - 基于神经网络的冷冻电镜蛋白质模型搭建方法及存储介质 - Google Patents

基于神经网络的冷冻电镜蛋白质模型搭建方法及存储介质 Download PDF

Info

Publication number
WO2024119597A1
WO2024119597A1 PCT/CN2023/074086 CN2023074086W WO2024119597A1 WO 2024119597 A1 WO2024119597 A1 WO 2024119597A1 CN 2023074086 W CN2023074086 W CN 2023074086W WO 2024119597 A1 WO2024119597 A1 WO 2024119597A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
cryo
density map
neural network
representation
Prior art date
Application number
PCT/CN2023/074086
Other languages
English (en)
French (fr)
Inventor
张强锋
徐魁
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2024119597A1 publication Critical patent/WO2024119597A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present invention belongs to the technical field of structural biology image processing, and in particular relates to a computer storage medium, a computer system, a method for processing a cryo-electron microscopy density map, and a method for building a cryo-electron microscopy protein model based on a neural network.
  • cryo-EM cryo-electron microscopy
  • the existing methods of building atomic models have problems such as high image requirements, high technical requirements for implementers, and poor result accuracy.
  • the present invention provides a computer storage medium storing a deep neural network.
  • the deep neural network includes a cryoformation module stack, and the cryoformation module stack includes a plurality of cryoformation modules Cryoformer;
  • Cryoformer includes an encoder and a decoder
  • the decoder is used to learn the matching of sequence-related representations and the three-dimensional spatial information of cryo-EM density maps, and cross-fuse the sequence-related representations with the three-dimensional spatial information output by the encoder.
  • the decoder takes the output of the sequence branch of the deep neural network, the output of the encoder, and the three-dimensional position encoding of the cryo-electron microscopy density map as input, and generates a cross-single sequence representation through a self-attention module and a cross-attention module.
  • each Cryoformer includes N enc encoders and N dec decoders;
  • Sequence-related representations include multiple sequence representations and amino acid pairing representations
  • Each decoder passes the multi-sequence representation and the amino acid pair representation through the linear layer and adds them to the crossed single sequence representation, and then passes them through the LayerNorm layer and adds them together to form a new single sequence representation;
  • the new single sequence representation is input into the self-attention module
  • the output of the self-attention module, the amino acid embedding representation, the output of the encoder, and the three-dimensional position encoding of the density map are input into the cross-attention module together to match the cryo-EM density map features and sequence features.
  • the cross-attention module takes three variables Q c , K c and V c as input, where Q c is the result of adding the output of the self-attention module to the amino acid embedding representation, K c is the result of adding the density map representation output by the encoder to the three-dimensional position encoding of the density map, and V c is the density map representation output by the encoder.
  • CryoFold includes the sequence branch for learning proteins from protein sequences Evolutionarily relevant sequence-related representations, including multiple sequence representations and amino acid pairing representations.
  • sequence branch includes an encoding module and an embedding representation learning module
  • the encoding module is used to encode the amino acid sequence, multiple sequence alignment MSA and structure template;
  • the embedding representation learning module is used to embed the encoded amino acid sequence, MSA and structure template to generate multi-sequence representation and amino acid pairing representation;
  • the sequence branch also includes an Evoformer stack, which is used to learn multi-sequence representations and amino acid pair representations, and output new multi-sequence representations and amino acid pair representations.
  • the deep neural network adopts a cryo-folding model CryoFold including Cryoformer
  • CryoFold includes a cryo-EM density map branch, which includes a three-dimensional residual neural network for mapping high-dimensional features into low-dimensional density map representations.
  • cryo-EM density map branch takes the cryo-EM density map as input, passes through a three-dimensional convolutional neural network layer, a batch normalization layer, a rectified linear unit ReLU, and a maximum pooling layer, and then is sequentially input into four three-dimensional residual convolution modules, and then processed by a three-dimensional convolutional neural network layer and output.
  • the present invention also provides a computer system, comprising:
  • One or more processors and one or more non-transitory computer-readable media storing the above-described deep neural network configured to process cryo-EM density maps.
  • the present invention also provides a method for processing a cryo-electron microscopy density map, by using the above-mentioned deep neural network to process the cryo-electron microscopy density map.
  • the present invention also provides a method for building a cryo-electron microscopy protein model based on a neural network, comprising: processing the cryo-electron microscopy density map by using the above-mentioned deep neural network to obtain the atomic model of the corresponding protein complex structure.
  • the present invention proposes a method for building a cryo-EM protein model based on a neural network and a storage medium, etc., using an end-to-end deep learning network model (referred to as CryoFold in the present invention, a method or model for determining protein structure from cryo-EM density maps), by combining an advanced method based on protein sequence structure prediction, from low-resolution cryo-EM density maps
  • CryoFold an end-to-end deep learning network model
  • the atomic model for analyzing the structure of protein complexes has high accuracy, ease of use, and low image resolution requirements, thus expanding the scope of application.
  • CryoFold On a benchmark dataset of 317 protein complexes, CryoFold On the low-resolution density map, the TM-score reaches 0.91. On the high-resolution density map, the TM-score reached 0.95.
  • AlphaFold-Multimer it was found that with the help of the cryo-EM density map, CryoFold achieved a 25% improvement over AlphaFold-Multimer.
  • CryoFold also showed significant advantages compared with other similar methods. CryoFold will greatly speed up the process of protein complex structure analysis, especially for heterogeneous conformational states and low-resolution density not captured in the PDB (Protein DataBase), including in situ structures.
  • FIG1 shows a schematic diagram of a CryoFold network framework according to an embodiment of the present invention
  • FIG2 shows a schematic diagram of the branch structure of the CryoFold network cryo-EM density map according to an embodiment of the present invention
  • FIG3 shows a schematic diagram of the structure of a Cryoformer encoder according to an embodiment of the present invention
  • FIG4 shows a schematic diagram of the structure of a Cryoformer decoder according to an embodiment of the present invention
  • FIG5( a ) shows a schematic diagram of a process of processing a cryo-EM density map according to an embodiment of the present invention
  • FIG5( b ) shows a schematic diagram of a process of constructing an EMPIAR downsampling dataset according to an embodiment of the present invention
  • FIG5( c ) shows a schematic diagram of a process of low-pass filtering a data set according to an embodiment of the present invention
  • FIG5( d ) shows a schematic diagram of a data set simulation process according to an embodiment of the present invention
  • FIG5(e) shows the data distribution of each data set in each resolution range according to an embodiment of the present invention
  • FIG6 shows a data distribution diagram of each resolution interval according to an embodiment of the present invention.
  • FIG7 is a schematic diagram showing a comparison between the results predicted by CryoFold according to an embodiment of the present invention and the published structures in the database;
  • FIG8 is a schematic diagram showing performance indicators of prediction results of CryoFold on data in various resolution intervals according to an embodiment of the present invention.
  • FIG9 is a schematic diagram showing a comparison between CryoFold according to an embodiment of the present invention and other related methods
  • FIG10 is a schematic diagram showing a comparison between CryoFold and AlphaFold-Multimer according to an embodiment of the present invention.
  • FIG11 shows a distribution diagram of the results of CryoFold and AlphaFold-Multimer on Chain-match according to an embodiment of the present invention
  • Figure 12 shows a schematic diagram of the effects of CryoFold and AlphaFold-Multimer on a protein complex structure (PDB ID: 6q0t) according to an embodiment of the present invention.
  • the embodiment of the present invention combines advanced three-dimensional image recognition and protein structure prediction technology to provide an end-to-end deep neural network CryoFold (cryofolding model).
  • CryoFold predicts the structure of protein complexes by combining cryo-electron microscopy density maps, amino acid sequences, multiple sequence alignments (MSA) and structural templates.
  • CryoFold includes multiple core neural network modules, which are referred to as Cryoformer (cryoconversion modules) in the embodiment of the present invention, forming a Cryoformer stack (cryoconversion module stack).
  • CryoFold including a cryoconversion module combined with the three-dimensional information in the density map, the evolutionary information in the MSA and the homology information in the structural template, the main chain and side chain representation of the structural model of the protein complex can be effectively learned, as shown in Figure 1. Furthermore, in order to ensure the geometric constraints between bond lengths and bond angles in the protein structure, a structural module is used in CryoFold to generate a final structural model with three-dimensional coordinates.
  • the deep neural network of the embodiment of the present invention can be stored in a computer storage medium, such as RAM, ROM, EEPROM, EPROM, flash memory device, disk, etc. and their combination.
  • a computer storage medium such as RAM, ROM, EEPROM, EPROM, flash memory device, disk, etc. and their combination.
  • CryoFold When CryoFold is called, it can execute a cryo-electron microscopy density map processing method.
  • the embodiment of the present invention also provides a computer system, one or more processors and one or more non-transitory computer-readable media, which stores a deep neural network model CryoFold configured to process cryo-EM density maps, and further, processes cryo-EM maps to obtain atomic models of corresponding protein complex structures.
  • the CryoFold network model of the embodiment of the present invention can not only realize the model building of protein complexes in high-resolution cryo-EM density maps, but also realize the automatic construction of protein complex models in low-resolution cryo-EM density maps, expanding the scope of use of neural networks to automatically build models.
  • cryofolding model CryoFold The structure of the cryofolding model CryoFold according to an embodiment of the present invention is exemplarily described below.
  • the network framework of CryoFold includes two input branches, a cryo-conversion module stack, a structure module, and multiple output modules.
  • the two input branches are the cryo-EM density map branch and the sequence branch.
  • the cryo-EM density map branch includes a three-dimensional residual neural network to learn the amino acid information, secondary structure, and protein backbone information in the cryo-EM density map.
  • the sequence branch is used to learn sequence-related representations (including MSA representations and amino acid pairing representations) related to protein evolution from protein sequences.
  • the sequence branch includes an encoding module and an embedded representation learning module.
  • the encoding module is used to encode input information such as amino acid sequences, MSA and structural templates (Templates).
  • the embedded representation learning module is used to embed and learn the encoded amino acid sequences, MSA and structural templates (Templates) information to generate MSA representations (multiple sequence representations) and amino acid pairing representations (referred to as pairing representations).
  • the sequence branch also includes an Evoformer stack (see Figure 4), which is used to learn MSA representations and amino acid pairing representations. After learning, the Evoformer stack outputs new MSA representations and amino acid pairing representations.
  • amino acid sequences, MSA and structural templates can be generated based on the sequence of the input protein complex.
  • the embodiment of the present invention not only performs multiple cycles in the process of generating the atomic structure of the protein, but also simulates the three-dimensional density map of the cryo-electron microscope for the generated structural model, and adds the simulated map as input to the cryo-electron microscope density map branch, and optimizes iteratively.
  • the cryo-EM density map branch takes a cryo-EM density map of shape W ⁇ H ⁇ L as input, passes through a 3D convolutional neural network layer, batch normalization (BatchNormalization) After the layer, the rectified linear unit ReLU and the maximum pooling (MaxPooling) layer, they are input into four three-dimensional residual convolution modules (ResBlock) in sequence, and finally a three-dimensional convolutional neural network layer with a convolution kernel size of 1 is used to map the high-dimensional features into a low-dimensional density map representation as the first density map representation.
  • the MaxPooling layer reduces the length, width and height of the density map to half of the original, that is, the shape is W/2 ⁇ H/2 ⁇ L/2.
  • the architecture of the four residual network modules is the same. Among them, the stride of the second residual network module is 2, and the strides of the other three three-dimensional residual convolution networks are 1. Therefore, after the second residual network module, the shape of the feature map becomes W/4 ⁇ H/4 ⁇ L/4. Finally, the feature map of the density map is mapped to a feature of dimension 384 through a three-dimensional convolutional neural network layer with a convolution kernel size of 1.
  • the cryoconversion module stack includes multiple (e.g., 8) Cryoformers.
  • Cryoformer is a key module of CryoFold, and each Cryoformer includes N enc encoders and N dec decoders.
  • the encoder is used to learn the global three-dimensional spatial information of amino acids from the cryo-EM density map
  • the decoder is used to learn the matching of multi-sequence representation and the three-dimensional spatial information of the density map, that is, to cross-fuse the multi-sequence representation output by the sequence branch and the three-dimensional spatial information output by the encoder.
  • the cryoformer encoder takes the output of the cryo-EM density map after the three-dimensional residual neural network, that is, the first density map representation as input, flattens it and adds it to the three-dimensional position encoding of the density map, and then inputs it into the self-attention module (Multi-Head Self-Attention module), and then passes through the first LayerNorm layer, the linear layer (Linear) and the second LayerNorm layer in sequence to generate a new density map representation (density representation), that is, the second density map representation, as shown in Figure 3.
  • the self-attention module Multi-Head Self-Attention module
  • the self-attention The force module enables direct information interaction between voxel points of the entire density map, thereby obtaining a global representation based on the semantic features and three-dimensional position information of the entire density map, further improving the recognition rate of information such as amino acid type, secondary structure, protein backbone, protein topology, interaction between domains, and orientation of the entire protein complex from the density map.
  • the second density map representation is used as the input of the next encoder.
  • the structures of N enc encoders are the same, but the parameters are not shared.
  • the cryoformer decoder (referred to as the decoder) takes the output of the sequence branch, the output of the encoder, and the three-dimensional position encoding of the density map as input, and generates a crossed single sequence representation through the self-attention module, the cross-attention module, the LayerNorm layer, and the linear layer, as shown in Figure 4.
  • each layer of the decoder adds the multi-sequence representation and the pairing representation output by the Evoformer stack to the crossed single sequence representation after passing through the linear layer, and then adds them after passing through the LayerNorm layer to form a new single sequence representation, that is, the new single sequence representation integrates the amino acid embedding representation, the multi-sequence representation, the pairing representation, and the crossed single sequence representation output by the previous decoder.
  • the new single sequence representation will be input into the self-attention module in the form of three variables: Qs , Ks, and Vs.
  • Qs and Ks are the results of adding the new single sequence representation to the amino acid embedding representation
  • Vs is the new single sequence representation.
  • the amino acid embedding representation is also added to the new single sequence representation of Qs and Ks .
  • the output of the Evoformer stack is added to the amino acid embedding representation after the LayerNorm layer, and the result of the addition is added to Qs and Ks respectively.
  • the output of the self-attention module, the amino acid embedding representation, the output of the encoder, and the three-dimensional position encoding of the density map are input into the cross-attention module for cryo-EM density Matching of graph features and sequence features.
  • the cross-attention module in the cryoformer decoder is the key to matching sequence-related representations (including multi-sequence representations and matching representations) and the three-dimensional spatial information of the density map in the neural network space.
  • the cross-attention module takes three variables, Q c , K c and V c, as input, where Q c is the result of adding the output of the self-attention module to the amino acid embedding representation, K c is the result of adding the density map representation output by the encoder to the three-dimensional position encoding of the density map, and V c is the density map representation output by the encoder.
  • the sequence-related representation is fused with the three-dimensional spatial information from the three-dimensional cryo-EM density map, thereby providing the source of three-dimensional coordinate position information for each amino acid in the sequence, so that the final generated protein all-atom coordinates are based on the atomic model of the cryo-EM density map.
  • the output of the cross-attention module is added to the output of the self-attention module and then input into a LayerNorm layer (the third LayerNorm layer).
  • the output of the third LayerNorm layer is processed by the linear layer and superimposed with the output of the third LayerNorm layer, input into the fourth LayerNorm layer for processing, and outputs a new cross-separate sequence representation.
  • the CryoFold network model is trained end-to-end using multiple loss functions.
  • the tasks associated with multiple loss functions include amino acid type recognition based on density maps, secondary structure type recognition based on density maps, amino acid semantic segmentation based on density maps, mask recognition of multiple sequence alignments, residue distance prediction, regression of all-atom coordinates, side chain torsion angle prediction, atomic collision prediction, and so on.
  • the loss function used for amino acid type recognition based on density maps is the cross entropy loss LCLS for amino acid type recognition
  • the loss function used for secondary structure type recognition based on density maps is the cross entropy loss LS for secondary structure type recognition
  • the loss function used for amino acid semantic segmentation based on density maps is the cross entropy loss LS for amino acid semantic segmentation.
  • the cross entropy loss L seg is used, the loss function used for mask recognition of multiple sequence alignment is the cross entropy loss L MSA for mask recognition, the loss function used for residue distance prediction is the cross entropy loss L dist for residue distance prediction, the loss function used for regression of all-atom coordinates is the Frame Aligned Point Error (FAPE) loss L FAPE and the root mean square error loss L RMSD related to all-atom coordinate regression, the FAPE loss L FAPE-BF and the root mean square error loss L RMSD-BF are used for protein main chain frame prediction, the loss function used for side chain torsion angle prediction is the loss L angle for side chain torsion angle prediction, the loss function used for atomic collision prediction is the atomic collision prediction L clash , and the correlation loss L density between the predicted structure simulation density map and the input density map is also used.
  • FPE Frame Aligned Point Error
  • the CryoFold training process is divided into three stages.
  • the purpose of the first stage is to learn the mask features of multiple sequence alignment, so the weight of the cross entropy loss of mask recognition of multiple sequence alignment is larger, while the weight of the two root mean square error losses is smaller.
  • the weight of the cross entropy loss of mask recognition of multiple sequence alignment is 160, while the root mean square of protein main chain frame prediction and the root mean square of regression of all-atom coordinates are 0.1.
  • the three weights of the loss of side chain torsion angle prediction, collision between atoms, and correlation loss of density map are all 0.
  • the training uses 24 NVIDIA 40G A100 GPU devices, and the three stages take 3 days, 7 days, and 30 days respectively.
  • Adam is used as the optimizer, the initial learning rate is 0.001, and the learning rate is decayed by one order of magnitude every 10,000 steps in a step-by-step manner.
  • the cryo-EM density map processing method of the embodiment of the present invention processes the cryo-EM density map by using the above-mentioned model CryoFold to obtain the intermediate output product or the final output product of the deep neural network.
  • the embodiment of the present invention also provides a method for building a cryo-EM protein model based on a neural network, which uses the above-mentioned deep neural network to process the cryo-EM density map to obtain the atomic model of the corresponding protein complex structure.
  • the cryo-EM density map processing method and the neural network-based cryo-EM protein model building method also include data processing and model training before using CryoFold.
  • cryo-EM 3D density map and the corresponding published atomic model were obtained from EMDB data. These samples can be filtered out in the following cases:
  • the publication date is after the specified date
  • the resolution of the PDB structure is greater than
  • the reconstruction method is not based on single-particle cryo-EM analysis (SPA);
  • a valid protein sequence is defined as a sequence of at least 25 amino acids in length with less than 30% unknown residues;
  • the correlation coefficient value between the density map and the atomic model is less than 0.5.
  • the experimental analysis data set consists of 9150 cryo-electron microscopy three-dimensional density maps. In this embodiment, 20 density maps without protein molecules, 30 density maps with multiple related atomic structures, and 123 density maps showing poor structural consistency with atomic structures during manual inspection were deleted. After this process, 8977 density maps were retained.
  • the .cif file from PDB contains only the atomic coordinates of one asymmetric unit. Therefore, ChimeraX is used to apply symmetry operations (_pdbx_struct_oper_list) based on one symmetry unit to obtain a .pdb format file containing all atomic coordinates.
  • cryo-EM density map Since the size of the original cryo-EM density map is usually much larger than the bounding box of the structural model, for the density map outside the structural model, the structural model (asymmetry is applied) is first cropped using Phenix.map_box to reduce the size of the density map.
  • the cryo-EM density map is reshaped to a specific voxel size of 0.6667 by spline interpolation.
  • the density value is then normalized to [0,2] based on interval division. All density map samples are saved as .mrc files.
  • the processing of the cryo-EM density map is shown in Figure 5(a).
  • the processing of sequence data is as follows:
  • the first step in running CryoFold is to The process involves taking one or more sequences as input and generating input features.
  • the data flow of AlphaFold2 is used to generate features for each chain sequence, and all sequences from 8977 structural models are processed as described below.
  • the specific data processing flow can be described as the following steps:
  • MSAs Search multiple sequence alignments from sequence databases. Use HHblits to search BFD and UniRef30 (version 2020_02) databases. Use JackHMMER to search UniRef90, MGnify, Metaeuk, and MGY databases. Homologous sequences from different sources will be ranked according to their similarity to the query sequence, and duplicate sequences will be removed from the MSA.
  • Combine multiple chains If there are multiple chains in the sample, combine the features from each chain.
  • Features with the chain length as the first dimension are directly concatenated, including aatype, residual_index, between_segment_residues, seq_length, sequence, num_alignments.
  • For features with sequence number as the first dimension including msa and deletion_matrix_int, they are first padded to the maximum number of zero sequences in the first dimension. Then, the two features from different chains are concatenated through the second (chain length) dimension.
  • Template features are processed similarly to MSA features. The first dimension ( ⁇ 20), the number of templates, is first padded to the maximum value of the template numbers in all chains. After concatenation of the chain length dimension, all features from 8977 samples are saved as compressed pickle files.
  • splitting the training and validation sets The purpose of splitting the training and validation sets is to split the 8977 density maps into two sets of training and test data sets with lower homology.
  • the clustering file defines many chain clusters with sequence identities above 40%.
  • To construct the test set a PDB model (which may contain multiple chains) is randomly sampled from all samples each time, and any other PDB model with a chain with a sequence identity greater than 40% with the sampled PDB model is also added to the test set. The process is repeated until the test data size reaches 317.
  • the training set consists of 8660 density maps and PDB pairs for training the CryoFold model.
  • the test set consists of 317 density maps and PDB pairs for evaluating the CryoFold model, as shown in Figure 6.
  • the present invention performs three enhancement methods on the training set.
  • the first is the EMPIAR downsampled dataset, which downsamples the two-dimensional particle images of each EMDB density map to reconstruct multiple density maps at a lower resolution.
  • the second is a low-pass filtered dataset, which converts the high-resolution (higher than 1000 pixels) in the EMDB (electron microscopy database) into a low-pass filtered dataset. )
  • the density map was low-pass filtered into multiple resolution levels.
  • the third one is a simulated dataset, which simulates the cryo-EM density map of protein complexes that do not have density maps in the PDB dataset.
  • EMPIAR Electronic Microscopy Public Image Archive
  • FIG. 5(b) The steps of constructing the downsampled dataset of the EMPIAR (Microscope Public Image Archive) are shown in Figure 5(b).
  • 88 image datasets were extracted from EMPIAR, and the data processing process corresponding to the image datasets was reproduced, where the data processing process can be obtained from the relevant papers in the archive.
  • a total of 112 density maps were reconstructed, and the number of particle images ranged from 14,262 to 730,118.
  • the particle images were resampled multiple times and a new density map was reconstructed using each subset. These density maps all have the same atomic structure model as the original density map.
  • the resolution is lower than density maps of and compiled a dataset consisting of 19,887 density maps and 112 atomic structures.
  • Low-pass filter dataset High-resolution (higher than ) density map to perform a low-pass filter.
  • the present invention uses the low-pass filtering method in RELION (a cryo-EM 3D reconstruction software) and self-set parameters to perform low-pass filtering on high-resolution data, including using different thresholds Processing is performed to generate a large amount of low-resolution data, which is then cropped and reshaped to a different voxel size, as shown in Figure 5(c).
  • the CryoFold model can directly infer the all-atom model by inputting the cryo-EM density map and the protein complex sequence.
  • the protein complex structure is a model built by CryoFold based on the cryo-EM density map (EMD-7770) generated by the experiment.
  • EMD-7770 cryo-EM density map
  • Seq-match sequence matching score
  • the target type of amino acid is within a radius of In terms of the in-ball (within-ball) indicator
  • CryoFold also outperforms these methods, with an average Seq-match of 0.94, much higher than Phenix's 0.05, and much higher than ModelAngelo's 0.43 and DeepTracer's 0.40.
  • the embodiment of the present invention tested 174 protein complexes with less than 2500 residues.
  • the results show that CryoFold is superior to AlphaFold-Multimer in all indicators, including Chain-match, TM-score and GDT-TS.
  • the average Chain-match of CryoFold is 0.85, the TM-score is 0.87, and the GDT-TS is 0.73, while the average Chain-match of AlphaFold-Multimer is 0.36, the TM-score is 0.57 and the GDT-TS is 0.31.
  • the cryo-EM density map greatly improves the accuracy of CryoFold in building protein complex structures.
  • CryoFold can accurately build atomic models of protein complexes by simultaneously inputting cryo-electron microscopy density maps and sequences into the neural network.
  • Figure 12 shows the CryoFold and AlphaFold-Multimer in an example (EMD: 20552, ).
  • the protein complex structure (PDB ID: 6q0t) has 5 protein chains and 1322 modeled residues.
  • the resolution of the cryo-EM density map is The prediction results of AlphaFold-Multimer and CryoFold are 0.104 and 0.791 in TM-score, and 0.400 and 0.783 in Chain-match. It can be seen that CryoFold has a huge advantage over AlphaFold-Multimer.

Landscapes

  • Spectroscopy & Molecular Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Analysing Materials By The Use Of Radiation (AREA)

Abstract

本发明提供基于神经网络的冷冻电镜蛋白质模型搭建方法及存储介质,存储介质存储有深度神经网络,所述深度神经网络包括冷冻转换模块栈,冷冻转换模块栈包括多个冷冻转换模块Cryoformer;Cryoformer包括编码器和解码器;解码器用于学习序列相关表征和冷冻电镜密度图的三维空间信息的匹配,将序列相关表征和编码器输出的三维空间信息进行交叉融合。深度神经网络能够用于对冷冻电镜密度图进行处理,获取相应的蛋白质复合物结构的原子模型,高效准确,并能够针对低分辨率密度图进行处理,极大的扩展了自动化冷冻电镜模型搭建的适用范围。

Description

基于神经网络的冷冻电镜蛋白质模型搭建方法及存储介质 技术领域
本发明属于结构生物学图像处理技术领域,特别涉及一种计算机存储介质、一种计算机系统、一种冷冻电镜密度图处理方法以及一种基于神经网络的冷冻电镜蛋白质模型搭建方法。
背景技术
随着单颗粒冷冻电子显微镜技术的突破性发展,特别是硬件和软件方面不断革新,冷冻电镜(cryo-EM)技术已成为解析具有重要生物学意义的大分子和细胞机器结构的关键方法,尤其是针对蛋白质复合物结构的场景。
尽管先进的基于机器学习的结构预测算法,如AlphaFold和RoseTTAFold,正在改变分析单个蛋白质三维结构的方式,然而,从冷冻电镜产生的电子云密度图自动搭建三维原子结构模型是一个高难度的工作,特别是多个蛋白组成的蛋白质复合物。要求模型的搭建者对蛋白质的结构特征、侧链构象有较高的认知水平,尤其是有许多密度差的区域,需要花费大量的时间进行推敲,甚至是补充额外的实验重新获得更高分辨率的密度图来解决。但是,获得高分辨率的密度图要求所解析的复合物具有良好的均匀性,而且需要非常先进的冷冻电镜设备,因此获得高分辨率的密度图具有很大的挑战甚至不可能。
综上,现有搭建原子模型的方法存在图像要求高、对实施人员技术要求高以及结果准确性差的问题。
因此亟需开发出高准确率、全自动化、支持中低等分辨率冷冻电镜蛋白质复合物结构模型搭建的方案。
发明内容
针对上述问题,本发明提供一种计算机存储介质,存储有深度神经网络,
所述深度神经网络包括冷冻转换模块栈,冷冻转换模块栈包括多个冷冻转换模块Cryoformer;
Cryoformer包括编码器和解码器;
解码器用于学习序列相关表征和冷冻电镜密度图的三维空间信息的匹配,将序列相关表征和编码器输出的三维空间信息进行交叉融合。
进一步地,所述解码器以深度神经网络的序列分支的输出、编码器的输出以及冷冻电镜密度图的三维位置编码为输入,通过自注意力模块和交叉注意力模块生成交叉的单序列表征。
进一步地,每个Cryoformer包括Nenc个编码器和Ndec个解码器;
序列相关表征包括多序列表征和氨基酸间配对表征;
每个解码器将多序列表征和氨基酸间配对表征分别通过线性层后与交叉的单序列表征进行相加,并各自通过LayerNorm层后进行相加,形成新的单序列表征;
新的单序列表征输入到自注意力模块中;
自注意力模块的输出、氨基酸嵌入表征、编码器的输出以及密度图的三维位置编码一起输入到交叉注意力模块中,进行冷冻电镜密度图特征和序列特征的匹配。
进一步地,交叉注意力模块以Qc、Kc和Vc三个变量作为输入,其中Qc为自注意力模块的输出与氨基酸嵌入表征相加的结果,Kc为编码器输出的密度图表征与密度图的三维位置编码相加的结果,Vc为编码器输出的密度图表征。
进一步地,交叉注意力模块的输出与自注意力模块的输出相加后输入第三LayerNorm层,第三LayerNorm层的输出经过线性层处理后与第三LayerNorm层的输出叠加,输入第四LayerNorm层处理,并输出新的交叉的单独序列表征。
进一步地,所述深度神经网络采用包含Cryoformer的冷冻折叠模型CryoFold,
CryoFold包括所述序列分支,用于从蛋白质序列中学习蛋白质 进化相关的序列相关表征,包括多序列表征和氨基酸间配对表征。
进一步地,所述序列分支包括编码模块和嵌入表征学习模块,
编码模块用于对氨基酸序列、多序列比对MSA和结构模板进行编码;
嵌入表征学习模块用于对编码后的氨基酸序列、MSA和结构模板进行嵌入学习,生成多序列表征和氨基酸间配对表征;
序列分支还包括Evoformer栈,用于学习多序列表征和氨基酸间配对表征,输出新的多序列表征和氨基酸间配对表征。
进一步地,所述深度神经网络采用包含Cryoformer的冷冻折叠模型CryoFold,
CryoFold包括冷冻电镜密度图分支,冷冻电镜密度图分支包括一个三维残差神经网络,用于将高维的特征映射成低维的密度图表征。
进一步地,冷冻电镜密度图分支以冷冻电镜密度图作为输入,经过三维卷积神经网络层、批规范化层、修正线性单元ReLU以及最大池化层后,再依次输入到4个三维残差卷积模块中,之后通过一个三维卷积神经网络层处理后输出。
本发明还提供一种计算机系统,包括:
一个或多个处理器和一个或多个非暂时性计算机可读介质,其存储被配置为处理冷冻电镜密度图的上述深度神经网络。
本发明还提供一种冷冻电镜密度图处理方法,通过采用上述深度神经网络对冷冻电镜密度图进行处理。
本发明还提供一种基于神经网络的冷冻电镜蛋白质模型搭建方法,包括:通过采用上述深度神经网络对冷冻电镜密度图进行处理,获取相应的蛋白质复合物结构的原子模型。
本发明提出基于神经网络的冷冻电镜蛋白质模型搭建方法及存储介质等,采用端到端的深度学习网络模型(本发明称之为CryoFold,一种从冷冻电镜密度图确定蛋白质结构的方法或模型),通过结合基于蛋白质序列结构预测的先进方法,从低分辨率的冷冻电镜密度图中 解析蛋白质复合物结构的原子模型,准确度高,易用性强,并且对图像分辨率要求低,从而扩大了适用范围。
在317种蛋白质复合物的基准数据集上,CryoFold在的低等分辨率的密度图上,TM-score达到0.91,在的高分辨率的密度图上,TM-score达到0.95。通过与基于序列的蛋白质复合物预测方法—AlphaFold-Multimer进行比较,发现在冷冻电镜密度图的帮助下,CryoFold比AlphaFold-Multimer实现了25%的提升。另外,在与同类的其他方法比较中,CryoFold也展现了显著的优势。CryoFold将大大加快蛋白质复合物结构分析的过程,特别是对于PDB(Protein DataBase,蛋白质数据库)中未捕获的异质构象状态和低分辨率密度,包括原位结构。
本发明的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在说明书、权利要求书以及附图中所指出的结构来实现和获得。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了根据本发明实施例的CryoFold网络框架示意图;
图2示出了根据本发明实施例的CryoFold网络冷冻电镜密度图分支结构示意图;
图3示出了根据本发明实施例的Cryoformer编码器结构示意图;
图4示出了根据本发明实施例的Cryoformer解码器结构示意图;
图5(a)示出了根据本发明实施例的冷冻电镜密度图的处理过程示意图;
图5(b)示出了根据本发明实施例的EMPIAR降采样数据集构建过程示意图;
图5(c)示出了根据本发明实施例的低通滤波数据集过程示意图;
图5(d)示出了根据本发明实施例的模拟数据集过程示意图;
图5(e)示出了根据本发明实施例的各个数据集在各个分辨率区间的数据分布;
图6示出了根据本发明实施例的各个分辨率区间的数据分布图;
图7示出了根据本发明实施例的CryoFold预测的结果与数据库中发表结构的对比示意图;
图8示出了根据本发明实施例的CryoFold在各个分辨率区间数据上的预测结果的性能指标示意图;
图9示出了根据本发明实施例的CryoFold与其他相关方法的对比示意图;
图10示出了根据本发明实施例的CryoFold与AlphaFold-Multimer的比较示意图;
图11示出了根据本发明实施例的CryoFold和AlphaFold-Multimer的结果在Chain-match上的分布图;
图12示出了根据本发明实施例的CryoFold和AlphaFold-Multimer在蛋白质复合物结构(PDB ID:6q0t)上的效果示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地说明,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例结合先进的三维图像识别和蛋白质结构预测技术,提供一种端到端的深度神经网络CryoFold(冷冻折叠模型)。CryoFold通过结合冷冻电镜密度图、氨基酸序列、多序列比对(MSA)和结构模板来预测蛋白质复合物结构。CryoFold包括多个核心神经网络模块,本发明实施例称之为Cryoformer(冷冻转换模块),形成Cryoformer栈(冷冻转换模块栈)。利用包含冷冻转换模块的CryoFold,结合密度图中的三维信息、MSA中的进化信息和结构模板中的同源信息,能够有效学习得到蛋白质复合物的结构模型的主链和侧链表示,如图1所示。进一步地,为了保证蛋白质结构中键长、键角间的几何约束,在CryoFold中采用结构模块生成具有三维坐标的最终结构模型。
不失一般性地,本发明实施例的深度神经网络能够存储在计算机存储介质中,如RAM、ROM、EEPROM、EPROM、闪存设备、磁盘等以及它们的组合。CryoFold被调用时,能够执行一种冷冻电镜密度图处理方法。
本发明实施例还提供一种计算机系统,一个或多个处理器和一个或多个非暂时性计算机可读介质,其存储被配置为处理冷冻电镜密度图的深度神经网络模型CryoFold,进一步地,处理冷冻电镜图以获取相应的蛋白质复合物结构的原子模型。本发明实施例的CryoFold网络模型,不仅能够实现高分辨冷冻电镜密度图的蛋白质复合物的模型搭建,还能够实现低分辨率冷冻电镜密度图中自动搭建蛋白质复合物模型,扩大了采用神经网络自动搭建模型的使用范围。
下面对本发明实施例的冷冻折叠模型CryoFold的结构进行示例性说明。
如图1所示,CryoFold的网络框架包括两个输入分支、一个冷冻转换模块栈、一个结构模块和多个输出模块。两个输入分支分别为冷冻电镜密度图分支和序列分支。冷冻电镜密度图分支包括一个三维残差神经网络,用来学习冷冻电镜密度图中的氨基酸信息、二级结构和蛋白质主链信息。
序列分支用于从蛋白质序列中学习蛋白质进化相关的序列相关表征(包括MSA表征和氨基酸间配对表征)。序列分支包括编码模块和嵌入表征学习模块,编码模块用于对氨基酸序列、MSA和结构模板(Templates)等输入信息进行编码,嵌入表征学习模块用于对编码后的氨基酸序列、MSA和结构模板(Templates)信息进行嵌入学习,生成MSA表征(多序列表征)和氨基酸间配对表征(简称配对表征)。序列分支还包括Evoformer栈(见图4),用于学习MSA表征和氨基酸间配对表征,Evoformer栈学习后输出新的MSA表征和氨基酸间配对表征。其中,氨基酸序列、MSA和结构模板(Templates)能够基于输入的蛋白质复合物的序列生成的。优选地,本发明实施例不仅在蛋白质原子结构的生成过程中,进行了多次循环,同时还针对生成的结构模型,进行了冷冻电镜三维密度图的模拟,并将模拟图作为输入加入到冷冻电镜密度图分支中,迭代地进行优化。
冷冻电镜密度图分支是以形状为W×H×L的冷冻电镜密度图作为输入,经过三维卷积神经网络层、批规范化(BatchNormalization) 层、修正线性单元ReLU以及最大池化(MaxPooling)层后,再依次输入到4个三维残差卷积模块(ResBlock)中,最终通过一个卷积核大小为1的三维卷积神经网络层将高维的特征映射成低维的密度图表征,作为第一密度图表征。其中MaxPooling层将密度图的长宽高三个维度分别缩小至原来的一半,即形状为W/2×H/2×L/2。四个残差网络模块的架构是相同的,其中,第二个残差网络模块的步幅为2,其它三个三维残差卷积网络的步幅均为1。因此,经过第二个残差网络模块后,特征图的形状变成了W/4×H/4×L/4。最后,密度图的特征图经过卷积核大小为1的三维卷积神经网络层映射到了维度为384的特征。
冷冻转换模块栈包括多个(如8个)Cryoformer。Cryoformer是CryoFold的关键模块,每个Cryoformer包括Nenc个编码器和Ndec个解码器。编码器用于从冷冻电镜密度图中学习全局的氨基酸的三维空间信息,解码器用于学习多序列表征和密度图的三维空间信息的匹配,即将序列分支输出的多序列表征和编码器输出的三维空间信息进行交叉融合。
Cryoformer编码器(简称编码器),以冷冻电镜密度图经过三维残差神经网络的输出,即第一密度图表征作为输入,铺平后将其与密度图的三维位置编码相加,然后输入到自注意力模块(Multi-Head Self-Attention模块)中,再依次经过第一LayerNorm层、线性层(Linear)和第二LayerNorm层生成新的密度图表征(density representation),即第二密度图表征,如图3所示。其中,自注意 力模块使得整个密度图的体素点间实现了直接的信息交互,从而获得基于整个密度图的语义特征和三维位置信息的全局表征,进一步提升了从密度图中识别氨基酸类型、二级结构、蛋白质主链、蛋白质的拓扑结构、结构域间的相互作用以及整个蛋白质复合物的朝向等信息的识别率。第二密度图表征作为下一个编码器的输入。Nenc个编码器间结构是相同的,但参数是非共享的。
Cryoformer解码器(简称解码器)以序列分支的输出、编码器的输出以及密度图的三维位置编码为输入,分别通过自注意力模块、交叉注意力模块、LayerNorm层和线性层生成交叉的单序列表征(crossed single representation),如图4所示。为了保留多序列中学习到有效的进化信息,每层解码器都将Evoformer栈输出的多序列表征和配对表征分别通过线性层后与交叉的单序列表征进行相加,并各自通过LayerNorm层后进行相加,形成新的单序列表征,即新的单序列表征中融合了氨基酸嵌入表征、多序列表征、配对表征以及上一个解码器输出的交叉的单序列表征。接下来,新的单序列表征将以Qs、Ks和Vs三个变量的形式,输入到自注意力模块中。其中Qs和Ks均为新的单序列表征与氨基酸嵌入表征相加的结果,Vs为新的单序列表征。为了加强氨基酸类型表征的强度,将氨基酸嵌入表征也加入到Qs和Ks的新的单序列表征中。具体地,Evoformer栈的输出经过LayerNorm层后与氨基酸嵌入表征相加,相加的结果分别与Qs和Ks相加。自注意力模块的输出、氨基酸嵌入表征、编码器的输出以及密度图的三维位置编码一起输入到交叉注意力模块中,进行冷冻电镜密度 图特征和序列特征的匹配。
Cryoformer解码器中的交叉注意力模块用于实现序列相关表征(包括多序列表征和匹配表征)和密度图的三维空间信息在神经网络空间中进行匹配的关键。交叉注意力模块以Qc、Kc和Vc三个变量作为输入,其中Qc为自注意力模块的输出与氨基酸嵌入表征相加的结果,Kc为编码器输出的密度图表征与密度图的三维位置编码相加的结果,Vc为编码器输出的密度图表征。通过交叉注意力模块,序列相关表征与来自三维冷冻电镜密度图中的三维空间信息融合,从而为序列中每个氨基酸提供的三维坐标位置信息的来源,使得最终生成的蛋白质的全原子坐标是基于冷冻电镜密度图的原子模型。交叉注意力模块的输出与自注意力模块的输出相加后输入一个LayerNorm层(第三LayerNorm层)。第三LayerNorm层的输出经过线性层处理后与第三LayerNorm层的输出叠加,输入第四LayerNorm层处理,并输出新的交叉的单独序列表征。
CryoFold网络模型是采用多个损失函数进行端到端训练。多个损失函数所关联的任务,包括基于密度图的氨基酸类型识别、基于密度图的二级结构类型识别、基于密度图的氨基酸语义分割、多序列比对的掩码识别、残基距离的预测、全原子坐标的回归、侧链的扭转角预测、原子间的碰撞预测等等。相应地,基于密度图的氨基酸类型识别采用损失函数为氨基酸类型识别的交叉熵损失LCLS,基于密度图的二级结构类型识别采用损失函数为二级结构类型识别交叉熵损失LSS,基于密度图的氨基酸语义分割采用的损失函数为氨基酸语义分割的 交叉熵损失Lseg,多序列比对的掩码识别采用的损失函数为掩码识别的交叉熵损失LMSA,残基距离的预测采用的损失函数为残基距离预测的交叉熵损失Ldist,全原子坐标的回归采用的损失函数为全原子坐标回归相关的Frame Aligned Point Error(FAPE)损失LFAPE和均方根误差损失LRMSD,蛋白质主链帧预测采用FAPE损失LFAPE-BF和均方根误差损失LRMSD-BF,侧链的扭转角预测采用的损失函数为侧链扭转角预测的损失Langle,原子间的碰撞预测采用的损失函数为原子间的碰撞预测Lclash,另外还采用了基于预测结构模拟密度图与输入密度图间的相关性损失Ldensity
以上损失函数的表达式可以根据现有技术获得,不再赘述。
CryoFold训练过程分为三个阶段。第一阶段目的在于多序列比对的掩码特征的学习,因此多序列比对的掩码识别的交叉熵损失的权重较大,而两个均方根误差损失的权重较小。根据经验,多序列比对的掩码识别的交叉熵损失的权重为160,而蛋白质主链帧预测的均方根和全原子坐标的回归的均方根为0.1。另外,为了保证训练的稳定,侧链扭转角预测的损失、原子间的碰撞、以及密度图的相关性损失这三个权重均为0。训练过程以多序列比对的掩码特征的交叉熵损失下降到稳定状态作为结束的参考条件。即第一阶段的总损失函数如下:
L=LCLS+LSS+Ldist+Lseg+160LMSA+0.1LRMSD+LFAPE
+0.1LRMSD-BF+LFAPE-BF
第二阶段的目标在于训练出蛋白质结构的主链原子坐标的位置,因此将两个均方根误差损失设置为1.0。即第二阶段的总损失函数如 下:
L=LCLS+LSS+Ldist+Lseg+160LMSA+LRMSD+LFAPE
+LRMSD-BF+LFAPE-BF
第三阶段的目标在于对全原子结构进行精准预测的训练,因此加入侧链扭转角预测、原子间的碰撞、以及密度图的相关性这三个损失函数。根据经验,这三个的权重分别为1.0、0.1和1.0。即第三阶段的总损失函数如下:
L=LCLS+LSS+Ldist+Lseg+160LMSA+LRMSD+LFAPE
+LRMSD-BF+LFAPE-BF+Langle+0.1Lclash+Ldensity
示例性地,训练采用24块NVIDIA 40G A100的GPU设备,三个阶段分别花费3天、7天和30天。采用Adam作为优化器,初始的学习率为0.001,并采用阶梯式每10000个步骤衰减一个数量级的方式进行学习率的衰减。
本发明实施例的冷冻电镜密度图处理方法通过采用上述模型CryoFold,对冷冻电镜密度图进行处理,以获取深度神经网络的中间输出产物或者最终输出产物。不失一般性地,本发明实施例还提供一种基于神经网络的冷冻电镜蛋白质模型搭建方法,采用上述深度神经网络对冷冻电镜密度图进行处理,获取相应的蛋白质复合物结构的原子模型。冷冻电镜密度图处理方法和基于神经网络的冷冻电镜蛋白质模型搭建方法还包括在使用CryoFold之前进行数据处理和模型训练。
下面示例性地,对数据处理和模型训练过程进行说明。
冷冻电镜三维密度图和对应的发表的原子模型分别从EMDB数据 库和PDB数据库中收集。在以下情况下可以过滤掉这些样本:
发布日期为指定日期之后;
PDB结构的分辨率大于
重构方法不是基于单颗粒冷冻电镜分析方法(SPA);
没有有效的蛋白质序列。有效的蛋白质序列定义为至少25个氨基酸长度且未知残基少于30%的序列;
密度图与原子模型之间的相关系数值小于0.5。实验解析数据集由9150冷冻电镜三维密度图组成。该实施例中,删除了20个不含蛋白质分子的密度图、30个具有多个相关原子结构的密度图,以及123个在手动检查期间显示与原子结构的结构一致性较差的密度图。在此过程之后,保留了8977个密度图。对于某些对称蛋白质,来自PDB的.cif文件仅包含一个不对称单元的原子坐标。因此,使用ChimeraX基于一个对称单元应用对称操作(_pdbx_struct_oper_list),获得包含所有原子坐标的.pdb格式文件。
由于原始冷冻电镜密度图的大小通常远大于结构模型的边界框,对于结构模型之外的密度图,首先使用Phenix.map_box对结构模型(应用不对称)进行裁剪,以减小密度图的大小。通过样条插值将冷冻电镜密度图重塑为0.6667的特定体素大小。然后基于区间划分将密度值归一化为[0,2]。所有密度图样本都保存为.mrc文件。冷冻电镜密度图的处理过程如图5(a)所示。
序列数据的处理过程如下:运行CryoFold的第一步就是要对输 入的序列进行处理。这个过程包括将一个或多个序列作为输入并产生输入特征。本实施例中使用AlphaFold2的数据流程为每个链的序列生成特征,并对来自8977个结构模型的所有序列都按照以下描述进行处理。具体的数据处理流程可以描述为以下步骤:
从序列数据库中搜索多序列比对(MSA)。使用HHblits用于搜索BFD和UniRef30(2020_02版本)数据库。使用JackHMMER用于搜索UniRef90、MGnify、Metaeuk、MGY数据库。对于来自不同来源的同源序列,将按照与查询序列的相似度进行排,并从MSA中删除重复的序列。
从PDB70中搜索同源模板。使用hhsearch以UniRef90的MSA profile作为输入,来搜索PDB70数据库,得到PDB ID和链ID后,从预先准备好的本地PDB数据库中获取对应的mmCIF文件。按照规范链序列与mmCIF残基进行对齐,来解析原子3D配位(维度为[链长,37,3])和掩码(维度为[链长,37])。并提取模板残基类型、原子位置、原子掩码用作以下分析的模板特征。最多保留20个模板用于后续分析。
组合多条链:如果样本中存在多条链,则将来自每个链的特征组合起来。以链长为第一维的特征直接拼接,包括aatype、residual_index、between_segment_residues、seq_length、sequence、num_alignments。对于以序号为第一维的特征,包括msa和deletion_matrix_int,首先填充到第一维中序号为零的最大值的个数。然后,来自不同链的这两个特征通过第二个(链长)维度连接起 来。模板特征的处理类似于MSA特征。第一个维度(≤20),模板的数量,首先填充到所有链中模板编号的最大值的数量。在链长维度的串联之后,来自8977个样本的所有特征都保存为压缩的pickle文件。
训练集和验证集的切分:训练集和验证集的切分的目的是将8977个密度图拆分为两组同源性较低的训练数据集和测试数据集。首先从RCSB PDB数据库下载40%序列同一性聚类文件。聚类文件定义了许多序列同一性高于40%的链簇。为了构建测试集,每次从所有样本中随机抽取一个PDB模型(可能包含多个链),并且任何其他PDB模型具有与采样PDB模型的序列同一性大于40%的链也添加到测试集中。重复该过程,直到测试数据大小达到317。最后,训练集由8660个密度图和PDB对组成,用于训练CryoFold模型。测试集由317个密度图和PDB对组成,用于对CryoFold模型进行评估,如图6所示。
数据增强:为了提高CryoFold模型的性能,本发明实施例在训练集上执行了3种增强方法。第一个是EMPIAR降采样数据集,对每个EMDB密度图的二维颗粒图像进行降采样,以在较低分辨率下重建处多个密度图。第二个是低通滤波数据集,将EMDB(电子显微镜数据库)中的高分辨率(高于)密度图经过低通滤波成多个分辨率级别。第三个是模拟数据集,对PDB数据集中没有密度图的蛋白质复合物进行冷冻电镜密度图模拟。
EMPIAR(Electron Microscopy Public Image Archive,电子显 微镜公共图像档案)降采样数据集构建的步骤如图5(b)所示。从EMPIAR中提取了88个图像数据集,并复现了图像数据集对应的数据处理过程,其中,数据处理过程能够从档案库的相关论文获得。本发明实施例中,总共重建了112张密度图,其颗粒图像数量从14,262到730,118不等。
对颗粒图像进行了多次重新采样,并使用每个子集重建一个新的密度图。这些密度图都与原始密度图具有相同的原子结构模型。
本发明实施例中,丢弃了分辨率低于的密度图,并编制了一个由19,887个密度图和112个原子结构组成的数据集。
低通滤波数据集:在EMDB中对高分辨率(高于)密度图执行低通滤波器。本发明实施例采用RELION(一款冷冻电镜三维重构的软件)中的低通滤波方法和自行设定的参数,对高分辨率数据进行了低通滤波处理,包括使用不同的阈值进行处理,以生成大量低分辨率数据,并进行裁剪和重塑体素大小,如图5(c)所示。
模拟数据集:在PDB中,大约90%的蛋白质复合物结构是通过X-ray方法获得的,其中大多数没有冷冻电镜密度图。为了对这些大量标记数据进行训练,在这些PDB上模拟了超过100,000个冷冻电镜密度图,并在分辨率范围内进行多重模拟,从而得到大量的模拟密度图数据,如图5(d)所示。
最终,四种数据集一共50多万个样本组成跨多分辨率的复合物大数据集,如图5(e)所示。
经过训练后的CryoFold模型,可直接通过输入冷冻电镜密度图和蛋白质复合物序列进行全原子模型的推理。如图7所示,蛋白质复合物结构为CryoFold根据实验产生的冷冻电镜密度图(EMD-7770)搭建出来的模型。从图中可以看出,CryoFold的结果与发表在PDB数据中的结构(PDB:6cvm)吻合的非常好,并且侧链也与密度图非常吻合。
如图8所示,在317种蛋白质复合物的基准数据集上,在虚线的左侧,数据点是高分辨率样本,样本数为138,表示为高分辨数据集。在灰色虚线的右侧,数据点是较低分辨率的样本,样本数为179,表示为低分辨率数据集。可以看出,CryoFold在基于需模版对齐的指标TM-score和无需模版对齐的指标Chain-match上均取得了较高的效果。其中在低分辨率冷冻电镜密度图上,CryoFold预测结果的平均TM-score为0.91。在高分辨率图上,CryoFold预测结果的平均TM-score为0.95。另外,在Chain-match指标上,高分辨率和低分辨率数据集的结果分别为的平均链匹配分数为0.92和0.87。同时,还可以看到,CryoFold在高分辨率数据集上搭建出的蛋白质主链上Cα的均方根误差为而在低分辨率数据集上为这些结果展示了CryoFold可以在低分辨率的冷冻电镜密度图中准确的搭建出蛋白质复合物的原子模型。
在上述低分辨的冷冻电镜密度图数据集上,与常用的方法Phenix,DeepTracer,以及ModelAngelo进行了比较。图9的结果表明CryoFold 的效果要优于其他方法。在Chain-match(链匹配得分)指标上,CryoFold的平均分数为0.87,而Phenix为0.03、ModelAngelo为0.41。对于评估氨基酸类型准确性的Seq-match(序列匹配得分,目标类型的氨基酸在半径为的球内)指标上,CryoFold也优于这些方法,平均Seq-match为0.94远高于Phenix的0.05,远高于ModelAngelo的0.43以及DeepTracer的0.40。
为了与AlphaFold-Multimer进行比较,本发明实施例测试了174个残基数小于2500的蛋白复合物。如图10所示,结果表明CryoFold在所有指标上都优于AlphaFold-Multimer,包括Chain-match、TM-score和GDT-TS。CryoFold的平均Chain-match为0.85,TM-score为0.87,GDT-TS为0.73,而AlphaFold-Multimer的平均Chain-match为0.36,TM-score为0.57和GDT-TS为0.31。从图11中的散点图可见,冷冻电镜密度图极大的提升了CryoFold搭建蛋白质复合物结构的精度。并且CryoFold的精度因冷冻电镜密度图的分辨率而异,分辨率越高,精度越高。尽管AlphaFold2可以准确的预测出大多数单链蛋白的结构,但对于蛋白质复合物预测的性能仍还有很大的提升空间。而CryoFold通过将冷冻电镜密度图与序列同时输入到神经网络中,能够准确地建立蛋白质复合物的原子模型。
图12中,展示了CryoFold和AlphaFold-Multimer在一个示例(EMD:20552,)上的效果。蛋白质复合物结构(PDB ID:6q0t)共有5条蛋白质链和1322个建模残基,冷冻电镜密度图的分辨率为 AlphaFold-Multimer和CryoFold预测结果在TM-score上为0.104和0.791,在Chain-match上为0.400和0.783。可以看出CryoFold较AlphaFold-Multimer展现出了巨大的优势。
尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (12)

  1. 一种计算机存储介质,其特征在于,存储有深度神经网络,
    所述深度神经网络包括冷冻转换模块栈,冷冻转换模块栈包括多个冷冻转换模块Cryoformer;
    Cryoformer包括编码器和解码器;
    解码器用于学习序列相关表征和冷冻电镜密度图的三维空间信息的匹配,将序列相关表征和编码器输出的三维空间信息进行交叉融合。
  2. 根据权利要求1所述的计算机存储介质,其特征在于,
    所述解码器以深度神经网络的序列分支的输出、编码器的输出以及冷冻电镜密度图的三维位置编码为输入,通过自注意力模块和交叉注意力模块生成交叉的单序列表征。
  3. 根据权利要求2所述的计算机存储介质,其特征在于,
    每个Cryoformer包括Nenc个编码器和Ndec个解码器;
    序列相关表征包括多序列表征和氨基酸间配对表征;
    每个解码器将多序列表征和氨基酸间配对表征分别通过线性层后与交叉的单序列表征进行相加,并各自通过LayerNorm层后进行相加,形成新的单序列表征;
    新的单序列表征输入到自注意力模块中;
    自注意力模块的输出、氨基酸嵌入表征、编码器的输出以及密度图的三维位置编码一起输入到交叉注意力模块中,进行冷冻电镜密度图特征和序列特征的匹配。
  4. 根据权利要求3所述的计算机存储介质,其特征在于,
    交叉注意力模块以Qc、Kc和Vc三个变量作为输入,其中Qc为自注意力模块的输出与氨基酸嵌入表征相加的结果,Kc为编码器输出的密度图表征与密度图的三维位置编码相加的结果,Vc为编码器输出的密度图表征。
  5. 根据权利要求4所述的计算机存储介质,其特征在于,
    交叉注意力模块的输出与自注意力模块的输出相加后输入第三LayerNorm层,第三LayerNorm层的输出经过线性层处理后与第三LayerNorm层的输出叠加,输入第四LayerNorm层处理,并输出新的交叉的单独序列表征。
  6. 根据权利要求1-5中任一项所述的计算机存储介质,其特征在于,所述深度神经网络采用包含Cryoformer的冷冻折叠模型CryoFold,
    CryoFold包括所述序列分支,用于从蛋白质序列中学习蛋白质进化相关的序列相关表征,包括多序列表征和氨基酸间配对表征。
  7. 根据权利要求6所述的计算机存储介质,其特征在于,
    所述序列分支包括编码模块和嵌入表征学习模块,
    编码模块用于对氨基酸序列、多序列比对MSA和结构模板进行编码;
    嵌入表征学习模块用于对编码后的氨基酸序列、MSA和结构模板进行嵌入学习,生成多序列表征和氨基酸间配对表征;
    序列分支还包括Evoformer栈,用于学习多序列表征和氨基酸间配对表征,输出新的多序列表征和氨基酸间配对表征。
  8. 根据权利要求1-5中任一项所述的计算机存储介质,其特征在于,所述深度神经网络采用包含Cryoformer的冷冻折叠模型CryoFold,
    CryoFold包括冷冻电镜密度图分支,冷冻电镜密度图分支包括一个三维残差神经网络,用于将高维的特征映射成低维的密度图表征。
  9. 根据权利要求8所述的计算机存储介质,其特征在于,
    冷冻电镜密度图分支以冷冻电镜密度图作为输入,经过三维卷积神经网络层、批规范化层、修正线性单元ReLU以及最大池化层后,再依次输入到4个三维残差卷积模块中,之后通过一个三维卷积神经网络层处理后输出。
  10. 一种计算机系统,其特征在于,包括:
    一个或多个处理器和一个或多个非暂时性计算机可读介质,其存储被配置为处理冷冻电镜密度图的如权利要求1-9中任一项所述的深度神经网络。
  11. 一种冷冻电镜密度图处理方法,其特征在于,通过采用如权利要求1-9中任一项所述的深度神经网络对冷冻电镜密度图进行处理。
  12. 一种基于神经网络的冷冻电镜蛋白质模型搭建方法,其特征在于,包括:
    通过采用如权利要求1-9中任一项所述的深度神经网络对冷冻电镜密度图进行处理,获取相应的蛋白质复合物结构的原子模型。
PCT/CN2023/074086 2022-12-05 2023-02-01 基于神经网络的冷冻电镜蛋白质模型搭建方法及存储介质 WO2024119597A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211551972.0 2022-12-05
CN202211551972.0A CN116230071A (zh) 2022-12-05 2022-12-05 基于神经网络的冷冻电镜蛋白质模型搭建方法及存储介质

Publications (1)

Publication Number Publication Date
WO2024119597A1 true WO2024119597A1 (zh) 2024-06-13

Family

ID=86571937

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/074086 WO2024119597A1 (zh) 2022-12-05 2023-02-01 基于神经网络的冷冻电镜蛋白质模型搭建方法及存储介质

Country Status (2)

Country Link
CN (1) CN116230071A (zh)
WO (1) WO2024119597A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767997A (zh) * 2021-02-04 2021-05-07 齐鲁工业大学 一种基于多尺度卷积注意力神经网络的蛋白质二级结构预测方法
CN113990384A (zh) * 2021-08-12 2022-01-28 清华大学 一种基于深度学习的冷冻电镜原子模型结构搭建方法及系统和应用
CN114503203A (zh) * 2019-12-02 2022-05-13 渊慧科技有限公司 使用自注意力神经网络的由氨基酸序列的蛋白质结构预测
US20220189579A1 (en) * 2020-12-14 2022-06-16 University Of Washington Protein complex structure prediction from cryo-electron microscopy (cryo-em) density maps
CN115083513A (zh) * 2022-06-21 2022-09-20 华中科技大学 基于中等分辨率冷冻电镜图构建蛋白质复合物结构的方法
US20220375538A1 (en) * 2021-05-11 2022-11-24 International Business Machines Corporation Embedding-based generative model for protein design

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114503203A (zh) * 2019-12-02 2022-05-13 渊慧科技有限公司 使用自注意力神经网络的由氨基酸序列的蛋白质结构预测
US20220189579A1 (en) * 2020-12-14 2022-06-16 University Of Washington Protein complex structure prediction from cryo-electron microscopy (cryo-em) density maps
CN112767997A (zh) * 2021-02-04 2021-05-07 齐鲁工业大学 一种基于多尺度卷积注意力神经网络的蛋白质二级结构预测方法
US20220375538A1 (en) * 2021-05-11 2022-11-24 International Business Machines Corporation Embedding-based generative model for protein design
CN113990384A (zh) * 2021-08-12 2022-01-28 清华大学 一种基于深度学习的冷冻电镜原子模型结构搭建方法及系统和应用
CN115083513A (zh) * 2022-06-21 2022-09-20 华中科技大学 基于中等分辨率冷冻电镜图构建蛋白质复合物结构的方法

Also Published As

Publication number Publication date
CN116230071A (zh) 2023-06-06

Similar Documents

Publication Publication Date Title
Lai et al. Fast and accurate image super-resolution with deep laplacian pyramid networks
CN113593631A (zh) 一种预测蛋白质-多肽结合位点的方法及系统
Qu et al. The algorithm of concrete surface crack detection based on the genetic programming and percolation model
CN114333986A (zh) 模型训练、药物筛选和亲和力预测的方法与装置
CN112560966B (zh) 基于散射图卷积网络的极化sar图像分类方法、介质及设备
CN111667880A (zh) 一种基于深度残差神经网络的蛋白质残基接触图预测方法
CN113052955A (zh) 一种点云补全方法、系统及应用
WO2022188643A1 (zh) 分子结构的重建方法、装置、设备、存储介质及程序产品
Gudyś et al. QuickProbs 2: towards rapid construction of high-quality alignments of large protein families
WO2024119597A1 (zh) 基于神经网络的冷冻电镜蛋白质模型搭建方法及存储介质
CN113611354A (zh) 一种基于轻量级深度卷积网络的蛋白质扭转角预测方法
Thom et al. Rapid exact signal scanning with deep convolutional neural networks
CN116978464A (zh) 数据处理方法、装置、设备以及介质
CN116189776A (zh) 一种基于深度学习的抗体结构生成方法
CN115731412A (zh) 一种基于群等变注意力神经网络的图像分类方法及其装置
Liu et al. Wang-Landau sampling in face-centered-cubic hydrophobic-hydrophilic lattice model proteins
Zhang et al. RandAlign: A Parameter-Free Method for Regularizing Graph Convolutional Networks
Liu et al. Large set microstructure reconstruction mimicking quantum computing approach via deep learning
CN116030883A (zh) 蛋白质结构预测方法、装置、设备及存储介质
Murtaza et al. Investigating the performance of deep learning methods for Hi-C resolution improvement
Fang et al. Cross knowledge distillation for image super-resolution
Zhang et al. Face super-resolution with progressive embedding of multi-scale face priors
Li et al. Multi-scale cross-fusion for arbitrary scale image super resolution
CN116741260B (zh) 基于深度学习模型的抗体结构优化方法和装置
CN114580603B (zh) 一种基于冷冻电镜数据构建单颗粒水平的能量曲面的方法