CN117292743A

CN117292743A - Method, apparatus, medium and program product for predicting protein complex structure

Info

Publication number: CN117292743A
Application number: CN202311262785.5A
Authority: CN
Inventors: 许锦波
Original assignee: Beijing Molecular Heart Technology Co ltd
Current assignee: Beijing Molecular Heart Technology Co ltd
Priority date: 2022-09-05
Filing date: 2023-04-20
Publication date: 2023-12-26
Also published as: CN116206675A; CN116206675B

Abstract

It is an object of the present application to provide a method, apparatus, medium and program product for predicting the structure of a protein complex, the method comprising querying a protein sequence database for all single-stranded MSAs of a protein complex of interest, wherein each single-stranded MSA corresponds to a component strand of the protein complex of interest; matching protein sequences within all single-stranded MSAs based on a protein language model to produce an MSA of the target protein complex; and inputting the MSA of the target protein complex into a deep learning model to obtain a predicted structure of the target protein complex, so that the accuracy and the calculation efficiency of the prediction of the protein complex structure are effectively improved.

Description

Method, apparatus, medium and program product for predicting protein complex structure

The present application is a divisional application of a method, an apparatus, a medium and a program product for predicting protein complex structure (application number: 202310431117.4, application date: 2023.04.20)

Priority of case CN202211078421.7 (application date 2022-09-05)

Technical Field

The present application relates to the field of artificial intelligence, and in particular to a technique for predicting protein complex structure.

Background

Most proteins function as protein complexes. Thus, obtaining an accurate protein complex structure is crucial for understanding how biological functions are achieved by interactions at the atomic level. In the prior art, the protein structure with high resolution can be obtained by using an experimental method such as an X-ray crystal analysis method, a freeze electron microscope technology and the like, or a calculation method such as protein complex structure prediction (PCP) or protein-protein butt joint and the like can be used for predicting the protein complex structure. The protein structure is obtained by the experimental method, so that the cost is high, the flux is low, and a large amount of manpower is required to prepare a sample for structure determination; the protein structure obtained by the foregoing calculation method tends to be limited in accuracy.

Disclosure of Invention

It is an object of the present application to provide a method, apparatus, medium and program product for predicting protein complex structure.

According to one aspect of the present application, there is provided a method for predicting the structure of a protein complex, the method comprising:

querying a protein sequence database to obtain all single-stranded MSAs of a target protein complex, wherein each single-stranded MSA corresponds to one component strand of the target protein complex;

Matching protein sequences within all single-stranded MSAs based on a protein language model to produce an MSA of the target protein complex;

inputting the MSA of the target protein complex into a deep learning model to obtain a predicted structure of the target protein complex.

According to one aspect of the present application there is provided a computer device for predicting protein complex structure comprising a memory, a processor and a computer program stored on the memory, characterised in that the processor executes the computer program to carry out the steps of any of the methods as described above.

According to one aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of any of the methods described above.

According to one aspect of the present application there is provided a computer program product comprising a computer program, characterized in that the computer program when executed by a processor implements the steps of any of the methods described above.

According to one aspect of the present application, there is provided an apparatus for predicting a protein complex structure, the apparatus comprising:

A module for querying and obtaining all single-stranded MSA of a target protein complex from a protein sequence database, wherein each single-stranded MSA corresponds to one component chain of the target protein complex;

a two-module for matching protein sequences within all single-stranded MSAs based on a protein language model to produce MSAs of a target protein complex;

and a three-module for inputting the MSA of the target protein complex into a deep learning model to obtain a predicted structure of the target protein complex.

Compared with the prior art, all single-chain MSA of the target protein complex is inquired and obtained from a protein sequence database, wherein each single-chain MSA corresponds to one component chain of the target protein complex; matching protein sequences within all single-stranded MSAs based on a protein language model to produce an MSA of the target protein complex; and inputting the MSA of the target protein complex into a deep learning model to obtain a predicted structure of the target protein complex, so that the accuracy and the calculation efficiency of the prediction of the protein complex structure are effectively improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 illustrates a flow chart of a method for predicting protein complex structure according to one embodiment of the present application;

FIG. 2 shows a flow chart for predicting protein complex structure according to one embodiment of the present application;

FIG. 3 shows a performance comparison list of the MSA pairing method ColAttn and other MSA pairing methods in the present solution according to one embodiment of the present application;

FIG. 4 shows a graph of predicted performance improvement versus difficulty for protein complex structures according to one embodiment of the present application;

FIG. 5 illustrates predicted performance graphs of three MSA pairing methods on different test sets according to one embodiment of the present application;

FIG. 6 shows a block diagram of an apparatus for predicting protein complex structure in accordance with one embodiment of the present application;

FIG. 7 illustrates an exemplary system that can be used to implement various embodiments described herein.

The same or similar reference numbers in the drawings refer to the same or similar parts.

Detailed Description

The present application is described in further detail below with reference to the accompanying drawings.

In one typical configuration of the present application, the terminal, the devices of the services network, and the trusted party each include one or more processors (e.g., central processing units (Central Processing Unit, CPU)), input/output interfaces, network interfaces, and memory.

The Memory may include non-volatile Memory in a computer readable medium, random access Memory (Random Access Memory, RAM) and/or non-volatile Memory, etc., such as Read Only Memory (ROM) or Flash Memory (Flash Memory). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase-Change Memory (PCM), programmable Random Access Memory (Programmable Random Access Memory, PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), other types of Random Access Memory (RAM), read-Only Memory (ROM), electrically erasable programmable read-Only Memory (EEPROM), flash Memory or other Memory technology, read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device.

The device referred to in the present application includes, but is not limited to, a user device, a network device, or a device formed by integrating a user device and a network device through a network. The user equipment includes, but is not limited to, any mobile electronic product which can perform man-machine interaction with a user (for example, perform man-machine interaction through a touch pad), such as a smart phone, a tablet computer and the like, and the mobile electronic product can adopt any operating system, such as an Android operating system, an iOS operating system and the like. The network device includes an electronic device capable of automatically performing numerical calculation and information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable logic device (Programmable Logic Device, PLD), a field programmable gate array (Field Programmable gateway array, FPGA), a digital signal processor (Digital Signal Processor, DSP), an embedded device, and the like. The network device includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud of servers; here, the Cloud is composed of a large number of computers or network servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, a virtual supercomputer composed of a group of loosely coupled computer sets. Including but not limited to the internet, wide area networks, metropolitan area networks, local area networks, VPN networks, wireless Ad Hoc networks (Ad Hoc networks), and the like. Preferably, the device may be a program running on the user device, the network device, or a device formed by integrating the user device and the network device, the touch terminal, or the network device and the touch terminal through a network.

Of course, those skilled in the art will appreciate that the above-described devices are merely examples, and that other devices now known or hereafter may be present as appropriate for the application, are intended to be within the scope of the present application and are incorporated herein by reference.

PCP (protein complex structure prediction) is a fundamental and long-standing challenge in computing structural biology. The inventors have found that various PCP methods are limited in accuracy. For head-initiated protein-protein docking, which gives only a single-stranded protein sequence as input, it is more difficult to perform PCP, as the auxiliary information on the single-stranded unbound structure (unbounded structure) and complex interface (complex interface) is not available. Deep learning has made substantial progress in a number of computational structural biology tasks, such as protein contact prediction, tertiary structure prediction, and cryo-electron microscopy structural determination. Among them, the newly released AlphaFold-Multimer has been demonstrated to be superior to previous protein complex structure prediction systems, e.g., the fast fourier transform-based approach ClusPro (a protein-protein rigid docking approach). However, the accuracy of the AlphaFold-Multimer is far from satisfactory compared to the accuracy of the AlphaFold2 on the folded monomer. Its success rate was about 70% and the average DockQ (a quality assessment standard for protein-protein docking model) score was about 0.6 (medium quality judged by DockQ). The most important input feature of the AlphaFold-Multimer is the multiple sequence alignment (multiple sequence alignment, MSA). In contrast to alpha fold2, which takes the MSA of a single protein as input, alpha fold-Multimer needs to construct a union MSA (joint MSA) for protein complex structure prediction. However, how to construct such a combined MSA remains a pending problem for heteromers. It is necessary to identify the interacting homologs (interlog) in the MSA corresponding to each component chain. While for heteromers, the same species may have multiple sequences similar to the corresponding sequences of the constituent chains (i.e., paralogs), thus, the heteromer protein complex structure may be ambiguous. In this application, the inventors have studied efficient algorithms for constructing a combinatorial MSA for heteromers.

The application provides a simple and effective MSA pairing algorithm, which utilizes the instant output of a protein language model to construct a combined MSA. Column attention (ColAttn) was used to construct a union MSA (joint MSA) for protein complex structure prediction. Compared with non-pairing methods such as Block, baseline pairing methods such as Genome or AF-Multimer, and enhanced pairing methods such as InterLocalCos or InterGlobal Cos protein language models (protein language model, PLM), colAttn achieved the best accuracy of protein complex structure prediction on three test sets (pConf 70, pConf80 and DockQ 49). In the test, the inventors utilized 5 models from the AlphaFold-Multimer for structure prediction. The pairing method ColAttn described herein scores 10.7%, 7.3% and 3.7% higher on the above three test sets than the other pairing methods, respectively, when considering the predicted structure with the best score among the predicted structures output by the 5 models, respectively. In addition, the inventor also discovers that the obtained mixed strategy combining ColAttn with other pairing methods can also significantly improve the prediction accuracy of the protein complex structure compared with a single strategy. Further, the inventors also analyzed the structural predictive performance of ColAttn on protein complexes from eukaryotes, bacteria and archaea, with ColAttn performing best on test targets of eukaryotes that are difficult to identify by interlog. Furthermore, colAttn performs better than other pairing methods when one component chain is from eukaryotes and one component chain is from bacteria in the protein complex, strongly demonstrating that ColAttn pairing methods are equally effective and robust for structural prediction of targets from different total kingdoms (superkingdom, three-domain theory (Three Domains Theory)).

FIG. 1 shows a flow chart of a method for predicting protein complex structure according to one embodiment of the present application, the method comprising: step S11, step S12, and step S13. In step S11, the device 1 queries from the protein sequence database for all single-stranded MSAs that acquire the target protein complex, wherein each single-stranded MSA corresponds to one component strand of the target protein complex; in step S12, the apparatus 1 matches protein sequences within all single-stranded MSAs based on the protein language model to generate MSAs of the target protein complex; in step S13, the apparatus 1 inputs the MSA of the target protein complex into a deep learning model to obtain a predicted structure of the target protein complex.

In step S11, the apparatus 1 queries from the protein sequence database for all single-stranded MSAs that acquire the target protein complex, wherein each single-stranded MSA corresponds to one component strand of the target protein complex. In some embodiments, the device 1 includes, but is not limited to, a user device, such as a tablet, computer, server, etc., having information processing or computing capabilities, which may be used for protein complex structure prediction. In some embodiments, the apparatus 1 may search a protein sequence database based on the protein sequence of each component chain in the protein complex of interest using a corresponding biological sequence analysis tool and generate a single-stranded MSA corresponding to the protein sequence of the component chain. In some embodiments, the biological sequence analysis tool includes, but is not limited to JackHMMER, BLAST, FASTA. The protein sequence databases include, but are not limited to, the UniProt database, the InterPro database, the GENATLAS database. In some embodiments, the single-stranded MSA comprises one or more protein sequences from different species that match the one component strand.

In step S12, the apparatus 1 matches protein sequences within all single-stranded MSAs based on the protein language model to generate MSAs of the target protein complex. In some embodiments, because the protein language model may fully capture biological constraints and co-evolution information encoded in the protein sequences, the device 1 may utilize the protein language model to identify, match each protein sequence in all single-chain MSAs to obtain a corresponding complex homology sequence (i.e., interlog), and thereby determine the MSA of the protein complex of interest based on the complex homology sequence.

In step S13, the apparatus 1 inputs the MSA of the target protein complex into a deep learning model to obtain a predicted structure of the target protein complex. In some embodiments, the deep learning model includes, but is not limited to, the AlphaFold-Multimer model described previously, as well as other existing or future possible deep learning models that utilize protein complex MSA for structural prediction.

In some embodiments, the step S12 includes: the method comprises the steps that 1, protein sequences in single-stranded MSA are grouped according to species information, and a complex homologous sequence corresponding to each species group is constructed, wherein the complex homologous sequence is formed by connecting protein sequences which are ranked identically in the same species group and come from different single-stranded MSA, and each species group can comprise zero or more complex homologous sequences; all complex homologous sequences constitute one combined MSA, wherein the combined MSA is the MSA of the protein complex of interest.

In some embodiments, referring to a flow chart for predicting the structure of a protein complex shown in fig. 2, taking as an example a target protein complex belonging to a heterodimer, the apparatus 1 queries the unit database using JackHMMER based on two component chains (protein sequence a and protein sequence B) corresponding to the heterodimer, so that single-chain MSAs corresponding to protein sequence a and protein sequence B can be obtained, respectively. The apparatus 1 may group the protein sequences in the single-stranded MSAs by species, for example, referring to fig. 2, the protein sequences in the single-stranded MSAs corresponding to protein sequence a and protein sequence B are classified into murine and duck groups. Based on this species group, device 1 can ligate protein sequences from different single-stranded MSAs that are ranked the same in the same species group to yield corresponding complex homologous sequences. For example, if the target protein complex corresponds to n single-stranded MSAs, the complex homologous sequence is obtained by ligating n protein sequences from the n single-stranded MSAs, respectively, and in the same species group, the n protein sequences in the rank being identical, one single-stranded MSAs for each protein sequence in the complex homologous sequence. In some cases, there is a single-stranded MSA, which contains a protein sequence that does not belong to a species, and for the group of species corresponding to that species, device 1 cannot do the aforementioned work of ligating protein sequences from different single-stranded MSAs that are ranked the same in the same group of species, and accordingly, the group of species includes zero complex homologous sequences. In some cases, a species group exists, to which at least one protein sequence belongs in each single-stranded MSA, and which corresponds to at least one complex homologous sequence. The device 1 may determine the set of homologous sequences of all complexes as a combined MSA, i.e. the MSA corresponding to the protein complex of interest. Finally, device 1 may input the combined MSA into a model of AlphaFold-Multimer (AF-Multimer) to obtain a predicted structure of the target protein complex.

In some embodiments, the grouping of protein sequences in the single-stranded MSA according to species information and constructing a corresponding complex homology sequence for each species group comprises: the device 1 determines one or more species groups according to species information and the single-chain MSA, wherein each species group corresponds to one species in the species information, each species group comprises a plurality of sub-classification groups, each sub-classification group corresponds to one single-chain MSA, and the sub-classification groups comprise protein sequences belonging to the species in the single-chain MSA; and determining the homologous sequence of the complex corresponding to each species group according to the one or more species groups. In some embodiments, the protein sequences from the same ancestor will have some homology based on evolutionary theory, the closer the relatedness of the species, the higher its corresponding protein sequence homology. Thus, the group of species to which each protein sequence belongs can be determined by grouping the protein sequences by aligning the protein sequences in each single-stranded MSA, or by aligning the protein sequences in each single-stranded MSA with the protein sequences corresponding to the corresponding species in the species information. The protein sequences of the completed species group still retain the information of the single-stranded MSA to which the protein sequences belong, so that ranking and pairing of the corresponding protein sequences of each single-stranded MSA in the species group can be carried out later to obtain corresponding complex homologous sequences.

In some embodiments, the determining, from the one or more species groups, the corresponding complex homology sequence for each species group comprises: the method comprises the steps that 1, similarity scoring information corresponding to each protein sequence in all sub-classification groups in each species group is determined; and determining the homologous sequence of the complex corresponding to each species group based on the similarity scoring information. In some embodiments, the device 1 determines the similarity score information by calculating the similarity of the protein sequences to the constituent chains of their corresponding target protein complexes. Based on the similarity score information, device 1 may rank the corresponding protein sequences of the single-stranded MSAs in each species group, thereby ligating the protein sequences from the same rank and different single-stranded MSAs in that species group into corresponding complex homologous sequences. In some embodiments, if the presence of a panel does not include protein sequences in a single-stranded MSA, i.e., if the presence of a single-stranded MSA does not include all protein sequences belonging to the panel, then it may not be necessary to determine the similarity score information, which includes 0 complex homology sequences, to conserve computational resources.

In some embodiments, the determining similarity score information corresponding to each protein sequence in all sub-taxonomic groups in each species group comprises: the device 1 determines a column attention matrix corresponding to each single-chain MSA; determining a corresponding pair-wise similarity matrix based on the column attention matrix; and determining similarity scoring information corresponding to each protein sequence in all sub-classification groups in each species group based on the pair-wise similarity matrix. For example, the apparatus 1 may acquire a column attention moment array corresponding to a single-chain MSA using an MSA transducer Wherein L is MSA TrThe number of layers of the ansformer model is H, the number of attention heads (attention heads) of each layer, C is the length of the component chain corresponding to the single-stranded MSA, and N is the number of protein sequences contained in the single-stranded MSA. The device 1 first symmetrizes the column attention matrix, and then aggregates the symmetry matrix along L, H, C dimensions to obtain a pair-wise similarity matrix corresponding to the single-chain MSA:

where superscript T is a transpose (transfer) notation and AGG (.cndot.) is an aggregation function. The pair-wise similarity matrix is a symmetric matrix and can be based on the first row of the pair-wise similarity matrixDetermining similarity score information corresponding to each protein sequence in the single-stranded MSA, wherein S ₁ The data in (2) can be regarded as a similarity score for the constituent chains of the protein sequence in the single-stranded MSA and its corresponding target protein complex. Based on the above, all single-stranded MSAs are calculated, and thus similarity score information corresponding to each protein sequence in all sub-taxonomic groups in each species group can be determined.

In some embodiments, the determining the corresponding complex homology sequences for each species group based on the similarity score information comprises: the device 1 sorts the protein sequences corresponding to each sub-classification group in the species group based on the similarity scoring information; based on the same ranked protein sequences in each subcategory, the corresponding complex homologous sequences for that group of species are determined. For example, referring to the example of fig. 2, in the murine group, the similarity scores of the protein sequences of the sub-taxonomic group of single-stranded MSAs corresponding to protein sequence a were 0.9, 0.6, 0.4, respectively, and the similarity scores of the protein sequences of the sub-taxonomic group of single-stranded MSAs corresponding to protein sequence B were 0.8, 0.7, respectively. Based on the scores, the protein sequences in each sub-classification group are ordered, the sub-classification group corresponding to the protein sequence A can be determined to be named (1, 0.9), (2,0.6) and (3,0.4), the sub-classification group corresponding to the protein sequence B is named (1,0.8) and (2,0.7), the protein sequence with the score of 0.9 corresponding to the protein sequence A and the protein sequence with the score of 0.8 corresponding to the protein sequence B are the first-ranking sequences in the respective sub-classification groups, the connection of the two sequences can be determined to be one compound homologous sequence corresponding to the species group, and similarly, the protein sequence with the second ranking can be connected to obtain the other compound homologous sequence corresponding to the species group.

In some embodiments, the determining, from the one or more species groups, the corresponding complex homology sequence for each species group comprises: the device 1 determines corresponding cosine similarity information based on each single-stranded MSA and one component chain of the single-stranded MSA corresponding to the target protein complex; and determining the homologous sequence of the compound corresponding to each species group based on the cosine similarity information. In some embodiments, cosine similarity may be used to measure similarity between protein sequences in the single-chain MSA and component chains of the corresponding target protein complex, and similarly, ranking protein sequences of the sub-taxonomic groups under each species group may be performed based on cosine similarity information corresponding to each obtained protein sequence, so that protein sequences with the same rank of each sub-taxonomic group under each species group are connected to obtain corresponding complex homologous sequences.

In some embodiments, the determining the respective cosine similarity information based on each single-stranded MSA and the single-stranded MSA corresponding to a component chain of the target protein complex comprises: device 1 determines a first sequence level of intercalation for each single-stranded MSA and a second sequence level of intercalation for the single-stranded MSA for a component strand of the target protein complex; and determining corresponding cosine similarity information based on the first sequence level embedding and the second sequence level embedding.

In some embodiments, the determining the first sequence level embedding for each single-stranded MSA comprises: the device 1 determines the residue level embedded set corresponding to each single-chain MSA; and determining the first sequence level embedding corresponding to each single-stranded MSA based on the residue level embedding set.

In some embodimentsFor each single-stranded MSA: m epsilon A ^N×C Wherein C is the length of the component chain corresponding to the single-stranded MSA, and N is the number of protein sequences contained in the single-stranded MSA. Device 1 may obtain a residue level embedding (residue-level embedding) set corresponding to the single-stranded MSAWhere d is the embedding dimension (embedding dimension), and L is the number of layers of the protein language model used to calculate cosine similarity. Device 1 may obtain the first sequence-level embedding (sequence-level embedding) through aggregation in L, C dimensionsSimilarly, a second sequence-level intercalating E corresponding to a single-stranded MSA corresponding to a component chain of the target protein complex may also be obtained in the same manner ₁ . Embedding E based on the first sequence level _n Embedding E with a second sequence level ₁ The device 1 can determine the corresponding cosine similarity information +.>The corresponding complex homology sequences may be determined by ranking the protein sequences in each sub-class in a similar manner as previously described for determining complex homology sequences based on the column attentions mechanism.

In some embodiments, the step S12 includes: the device 1 determines a similarity score matrix between each single-chain MSA; based on the similarity score matrix, MSA of the target protein complex is determined. For example, taking the calculation of heterodimer as an example, by the query in the aforementioned step S11, 2 single-stranded MSAs corresponding to the heterodimer can be determined and respectively designated asAnd->The device 1 is based on this M ₁ And M is as follows ₂ Respectively obtain corresponding sequence level embedding The acquisition mode of the sequence level embedding is the same as or similar to the acquisition mode of the first sequence level embedding, so that the description is omitted and the sequence level embedding is incorporated herein by reference. Based on the sequence level embedding obtained above, a similarity score matrix between these 2 single-stranded MSAs can be determined>Wherein B is _ij ＝cos(E ₁ [i],E ₂ [j]). The device 1 may perform inter-chain protein sequence pairing by using a global maximum optimization or a local maximum optimization algorithm based on the similarity score matrix B, so as to obtain a combined MSA corresponding to the target protein complex.

Protein contact maps and three-dimensional structure predictions based on co-evolution analysis have made substantial progress over the past decade and demonstrated the state accuracy of monomers (i.e., individual protein chains). These methods use MSA information to infer interactions between residues or the three-dimensional structure of the target monomer. AlphaFold2 is one of the co-evolution based methods, showing unparalleled accuracy in CASP 14. The alpha fold-Multimer is a derivative of alpha fold2 for the Multimer and has good accuracy in complex structure prediction. The AlphaFold-Multimer does not assume that each input monomer is a rigid body as in many FFT-based methods, but it requires the construction of a joint MSA for the target complex. To infer the pairwise correlation between two different strands, it is necessary to determine the interacting homologous sequences (interlog) of the two strands, which is a challenge for heterodimers.

Several algorithms have been proposed to identify introogs from genomic data, such as analyzing co-evolving genes, searching co-located genes, and comparing phylogenetic trees. Genome co-localization and species information are two common heuristics. Genome co-localization is widely used as a heuristic for identifying mutexes. This is based on the observation that in bacteria many interacting genes are encoded in the operon and co-transcribed to perform their function. However, this rule is not applicable to complexes from eukaryotes with a large number of paralogs, as it becomes more difficult to eliminate the correct mutual homologs. ComplexContact first proposed another simple rule for identifying interlog, which alpha Fold-Multimer also used. This rule is called a phylogenetic-based approach, in which a set of paralogs (sequences from the same species) is first identified from the MSAs of each strand, ordered according to their sequence similarity to their corresponding strand, and then sequences from the same species are ranked together with the peer.

A Protein Language Model (PLM) learns representations of protein sequences or MSAs, and the learned representations can be used as features for tasks such as contact prediction, remote homology detection, and mutation effect prediction.

In this application, the inventors focused on MSA transformers, a PLM trained on large single chain protein MSA databases. The intermediate representation generated by the MSA transducer contains some co-evolution information. Thus, the inventors studied how to use a learned representation of MSA transformers to accurately identify whether two or more proteins form interlog and to improve predictive accuracy of AlphaFold-multimers.

Specific implementation method

In complex structure prediction, current prediction methods, such as AlphaFold-Multimer, acquire co-evolution signals between component chains by matching sequences within the MSA (multiple sequence alignment) of each component chain of the complex. All MSA matching algorithms (including those used by AlphaFold-multimers) are currently not very accurate. The inventors propose in this patent a completely new approach to MSA pairing, colAttn, which constructs a union MSA (joint MSA) based on MSA of two or more constituent chains. The inventors can then input this combined MSA into a deep learning model (e.g., alpha fold-Multimer, raptorX, or other software) to predict the structure of the complex. The problem of MSA pairing of two component chains is defined as follows: given the respective MSAs of the two component chains, denoted M1 and M2, respectively, the inventors want to find a sequence correspondence from the sequence in M1 to the sequence in M2. This correspondence requires that two different sequences in M1 cannot correspond to the same sequence in M2 (and accordingly two different sequences in M2 cannot correspond to the same sequence in M1), but allows that some sequences in M1 or M2 may not find the corresponding sequences. This definition can also be easily generalized to pairing of multiple component chains MSAs.

This section describes the inventors' Protein Language Model (PLM) based MSA pairing method, colAttn. The inventor's MSA matching method ColAttn mainly utilizes the advantages of a Protein Language Model (PLM) to explore a more accurate MSA matching strategy, thereby improving the structural prediction of protein complexes. PLM can learn co-evolutionary signals in protein sequences and protein spatial structure constraints. Furthermore, MSA-based PLMs can further explicitly capture co-evolutionary information contained in MSA through an axial attention mechanism. The inventors now use the most advanced MSA-based PLM, MSA transducer, to match single-stranded protein sequences within component strand MSA, constructing rational interogs (protein complex homologous sequences) to improve complex structure predictions. Of course, other protein language models may be used by the inventors.

The general framework of the inventors approach is shown in figure 2. Given a heterodimer for which a predicted structure is desired, the following are specific steps for constructing a combinatorial MSA: 1) First step (corresponding to step 1 of fig. 2): the inventors first searched the protein sequence database to obtain MSA for each component chain. 2) Second step (corresponding to step 2 of fig. 2): the inventors group sequences within single stranded MSAs by species. 3) Third step (corresponding to step 3 of fig. 2): the inventors generated a series of column attention matrices A for all component chains MSA using MSA-transformers _lhc Where l is the number of MSA-transducer layers for which this matrix corresponds, h is the number of attention heads (attention heads) and c is the position of a residue (i.e., an amino acid in a protein) in the amino acid sequence of the protein. The column attention weight matrix calculated by each column of the MSA may be considered a measure of the pairwise similarity score between aligned residues in each column. The inventor performs symmetry on each column of attention moment array generated by MSA-transducer, and then aggregates along the three dimensionsThe symmetry matrix obtains a pair-wise similarity matrix between MSA sequences, denoted S. The symmetry of a matrix is to compute the arithmetic mean of it and its transpose. S is symmetrical and its first line can be seen as a similarity score between the query sequence (i.e. the amino acid sequence of one of the constituent chains of the complex of the structure to be predicted) and the other sequences in the MSA. Then, the similarity scores for each sequence and the corresponding query sequence are ranked within its own set of species from high to low. 4) Fourth step (corresponding to step 4 of fig. 2): sequences of MSAs ranked the same in the same species group and from different chains are linked into complex homologous sequences (interogs) that constitute the combined MSA of the target complex. Alternatively, by filling gaps, those sequences that are not paired can be placed into the combined MSA. If a sequence A1 within MSA M1 does not find a pairing in MSA M2, then the inventors add taps to the end of A1 (the number is the same as the sequence length within M2) and then add A1 to the combined MSA. Similarly, if a sequence B1 within MSA M2 does not find a pairing in MSA M1, then the inventors add taps to the front of B1 (the same number as the sequence length within M1), and then add B1 to the federated MSA.

The above description is directed to the method of constructing a dimeric combined MSA, but the method ColAttn of the inventors can be easily extended to construct a combined MSA for a complex with multiple protein chains.

Although the average performance of the MSA pairing method ColAttn of the inventor is better than that of other MSA pairing methods, the inventor finds that different MSA pairing methods have respective advantages through experiments, namely that the prediction results of different pairing methods have complementarity. To this end, the inventors developed a mixing strategy (mixing) to combine the complex structure predictions produced by the various pairing methods. Specifically, we first obtain the prediction results (i.e., predicted composite structures) of different MSA pairing methods, then rank the predicted composite interface scores (i.e., TM scores predicted by the deep learning model, ipTM) from high to low, and select the prediction with the highest score as the final output result. The inventors tested a hybrid ColAttn and two other MSA pairing methods: pairing method based on sequence identity (sequence identity) and pairing method based on gene distance. The inventors found that the mixing strategy was more efficient than any single de MSA pairing method.

Experimental device

Experimental setup evaluation index

The inventors used a DockQ (https:// journ als. Plos. Org/plosone/arocleid= 10.1371/journ. Pon. 0161879) score to assess the accuracy of the predicted protein complex. Specifically, for each test objective, the inventors calculated the highest DockQ score of their N predictive models, which were ranked by confidence scores of the predictions. The inventors refer to this index as the best dock in top-N prediction.

Data set

To test the performance of the inventors' method, the inventors constructed a test set that meets the following criteria:

1. at least 100 sequences may be paired for a given species restriction.

2. The sequence similarity between the two component chains of each heterodimeric target tested was at most 90%. Wherein each component chain has 20 to 1024 residues (due to MSA-transducer constraints) and the total number of residues in each dimer is less than 1600 (due to GPU memory limitations).

The inventors randomly extracted some heterodimers from PDB (Protein Data Bank) as the test targets. The inventors define two dimers as at most x% similar if their constituent chains do not have more than x% maximum sequence similarity. The inventors randomly sampled 801 test complexes that were up to 40% similar to the other complexes in the dataset and met both criteria. Finally, the inventors predicted their complex structures using AlphaFold-Multimer (using its own MSA matching algorithm), and constructed three test sets according to the prediction confidence score (pConf): 1) pConf70: the predictive confidence score (pConf) for 92 targets was less than 0.7; 2) Dock q49: the optimal dock q score for 155 targets is less than 0.49; 3) pConf80: the pConf for 168 complex targets was less than 0.8.

Datum line

The inventors tested the following two heuristic MSA pairing strategies.

Alpha fold-Multimer is the default policy for its own. This strategy is first proposed in ComplexContact, which first groups sequences from component chain MSA by species and orders the sequences according to similarity to the query sequence. Finally, if there are multiple sequences in a species group, then this strategy will link all sequences of the same class within the same species group.

Genetic distance. In bacteria, the interacting genes are sometimes located in an operon and co-transcribed to form a protein complex. Thus, we can determine whether two proteins interact based on the distance between their genetic locations. This strategy pairs sequences from the same species and then disambiguates the sequences based on their distance of position in the contig; the contig is retrieved from ENA. In the inventors' implementation, given a sequence from the first strand, the inventors paired it with a sequence from the second strand that is closest to it in terms of genetic distance. If there are multiple closest sequences, the inventors selected the query sequence from one to the second chain with the lowest E value; the E-value is calculated by the MSA search algorithm used to construct the chain MSA.

The inventors reported the average Top-5 best DockQ score, top-1 DockQ score and success rate (DockQ. Gtoreq. 0.23) for pConf70, quality49 and pConf80 test sets.

The MSA pairing method ColAttn of the inventor is superior to other MSA pairing methods in heterodimer prediction

The inventors predicted the structure of each target complex using five AI models from the AlphaFold-Multimer software package, and then reported the average Top-5 best dock q score, top-1 dock q score, and corresponding Success Rate (SR) in the table shown in fig. 3. ColAttn outperforms the AF-Multimer default pairing strategy on all three test sets (0.259 and 0.234 on pConf70, 0.423 and 0.406 on pConf80, 0.265 and 0.242 on Quality49 for the first 5 DockQ scores). The inventors' approach is also superior to genetic distance-based approaches.

The MSA pairing method ColAttn of the inventor performs better on complex test targets that are harder to test

As shown in this table, the advantage of ColAttn over AF-Multimer became narrower on pConf80 than on pConf70, with improvement rates of 3.7% and 10.7%, respectively.

For further analysis, the inventors quantitatively analyzed the correlation between the predicted confidence score (pConf) of the AF-Multimer estimate and the performance gap of the Top-5 dock q score between ColAttn and AF-Multimer, as shown in fig. 4. FIG. 4 shows the relationship between the predicted performance improvement and the difficulty level of the composite structure. Among them, fig. 4 (a) shows the distribution of the prediction confidence score (pConf, x-axis) and the relative improvement (%, y-axis). The red curve is a visualization fitting a linear regression model. The Pearson correlation coefficient was about-0.49, which strongly suggests a relative improvement in ColAttn narrowing relative to AF-Multimer with increasing pConf. FIG. 4 (b) shows that five regions of pConf are further partitioned at 0.2 intervals and shows an improved distribution of the different regions, indicating that ColAttn performs better on low confidence targets than AF-Multimer.

It can be seen that the relative improvement is inversely related to the predictive confidence score (Pearson correlation coefficient-0.49). When pConf is less than 0.2, the relative improvement is even 100%, whereas when pConf is greater than 0.8, the performance of ColAttn is almost equivalent to AF-Multimer. This is because AF-multimers can be done very well on a relatively easy target, and it is difficult to further refine it.

ColAttn has higher prediction precision on eukaryotic targets

As shown in FIG. 5, the inventors further compared the DockQ distribution of ColAttn, AF-Multimer, and Genome over three domains (kingdoms) (i.e., eukaryotes, eubacteria, and eukaryotes and eubacteria). ColAttn has significantly better eukaryotic data than the other two MSA pairing methods: colAttn is 0.420, AF-Multimer is 0.402, genome is 0.369. This is because it is difficult to recognize the complex homologous sequences (interlog) in eukaryotes, and thus ColAttn has a significant advantage in eukaryotes. In the eubacterium data, the three strategies have similar performance (overall data is about 0.35). Most notably, colAttn performed better than the other two methods on Euba & Euca data: colAttn is 0.394, AF-Multimer is 0.314, genome is 0.277 across the entire data. Euba is a specific domain that means that the two constituent chains in the heterodimer belong to the two domains, respectively. Specifically, heterodimers of the inventors' datasets were derived from eukaryotes, eubacteria, viruses, archaebacteria, eubacteria, respectively: eukaryotes. The inventors classified data from eubacteria, viruses and archaea as eubacterial domains.

Hybrid MSA pairing strategy to improve prediction accuracy

The inventors have found that different methods of MSA pairing each have advantages, which means that they can complement each other. To verify this, the inventors combined five models predicted by either of the two MSA pairing methods, generating ten predicted models for each target, and then reported the average of Top-5 best dock scores. The hybrid strategy is significantly better than the single strategy, e.g., the DockQ score for colattn+colattn is 0.269, while ColAttn is 0.259, indicating that simply increasing the number of predictions per model also favors the structural prediction accuracy per test target. The ColAttn plus any one strategy always has better performance than the one without ColAttn, e.g., the success rate of ColAttn+genome is 44.6% and the success rate of AF-multimer+genome is 40.4%. Finally, the mix of all three strategies achieved the best performance with a DockQ score of 0.285 and a success rate of 46.8%.

Fig. 6 shows a block diagram of an apparatus for predicting protein complex structure according to one embodiment of the present application, the apparatus 1 includes a one-to-one module 11, a two-to-two module 12, and a three-to-three module 13. One-to-one module 11 queries from a protein sequence database all single-stranded MSAs for the target protein complex, wherein each single-stranded MSA corresponds to a component strand of the target protein complex; a two-module 12 matches protein sequences within all single-stranded MSAs based on a protein language model to produce MSAs of the target protein complex; a three module 13 inputs the MSA of the target protein complex into a deep learning model to obtain a predicted structure of the target protein complex. Here, the specific embodiments of the one-to-one module 11, the two-to-one module 12, and the one-to-three module 13 shown in fig. 6 are the same as or similar to the specific embodiments of the foregoing step S11, the step S12, and the step S13, respectively, so they are not described in detail and are incorporated herein by reference

FIG. 7 illustrates an exemplary system that may be used to implement various embodiments described herein;

in some embodiments, as shown in fig. 7, system 300 can function as any of the devices of the various described embodiments. In some embodiments, system 300 can include one or more computer-readable media (e.g., system memory or NVM/storage 320) having instructions and one or more processors (e.g., processor(s) 305) coupled with the one or more computer-readable media and configured to execute the instructions to implement the modules to perform the actions described herein.

For one embodiment, the system control module 310 may include any suitable interface controller to provide any suitable interface to at least one of the processor(s) 305 and/or any suitable device or component in communication with the system control module 310.

The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. Memory controller module 330 may be a hardware module, a software module, and/or a firmware module.

The system memory 315 may be used, for example, to load and store data and/or instructions for the system 300. For one embodiment, system memory 315 may include any suitable volatile memory, such as, for example, a suitable DRAM. In some embodiments, the system memory 315 may comprise a double data rate type four synchronous dynamic random access memory (DDR 4 SDRAM).

For one embodiment, system control module 310 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 320 and communication interface(s) 325.

For example, NVM/storage 320 may be used to store data and/or instructions. NVM/storage 320 may include any suitable nonvolatile memory (e.g., flash memory) and/or may include any suitable nonvolatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 320 may include storage resources that are physically part of the device on which system 300 is installed or which may be accessed by the device without being part of the device. For example, NVM/storage 320 may be accessed over a network via communication interface(s) 325.

Communication interface(s) 325 may provide an interface for system 300 to communicate over one or more networks and/or with any other suitable device. The system 300 may wirelessly communicate with one or more components of a wireless network in accordance with any of one or more wireless network standards and/or protocols.

For one embodiment, at least one of the processor(s) 305 may be packaged together with logic of one or more controllers (e.g., memory controller module 330) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be packaged together with logic of one or more controllers of the system control module 310 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 305 may be integrated on the same die as logic of one or more controllers of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic of one or more controllers of the system control module 310 to form a system on chip (SoC).

In various embodiments, the system 300 may be, but is not limited to being: a server, workstation, desktop computing device, or mobile computing device (e.g., laptop computing device, handheld computing device, tablet, netbook, etc.). In various embodiments, system 300 may have more or fewer components and/or different architectures. For example, in some embodiments, system 300 includes one or more cameras, keyboards, liquid Crystal Display (LCD) screens (including touch screen displays), non-volatile memory ports, multiple antennas, graphics chips, application Specific Integrated Circuits (ASICs), and speakers.

In addition to the methods and apparatus described in the above embodiments, the present application also provides a computer-readable storage medium storing computer code which, when executed, performs a method as described in any one of the preceding claims.

The present application also provides a computer program product which, when executed by a computer device, performs a method as claimed in any preceding claim.

The present application also provides a computer device comprising:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions as described above. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Those skilled in the art will appreciate that the form of computer program instructions present in a computer readable medium includes, but is not limited to, source files, executable files, installation package files, etc., and accordingly, the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. Herein, a computer-readable medium may be any available computer-readable storage medium or communication medium that can be accessed by a computer.

Communication media includes media whereby a communication signal containing, for example, computer readable instructions, data structures, program modules, or other data, is transferred from one system to another. Communication media may include conductive transmission media such as electrical cables and wires (e.g., optical fibers, coaxial, etc.) and wireless (non-conductive transmission) media capable of transmitting energy waves, such as acoustic, electromagnetic, RF, microwave, and infrared. Computer readable instructions, data structures, program modules, or other data may be embodied as a modulated data signal, for example, in a wireless medium, such as a carrier wave or similar mechanism, such as that embodied as part of spread spectrum technology. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. The modulation may be analog, digital or hybrid modulation techniques.

By way of example, and not limitation, computer-readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable storage media include, but are not limited to, volatile memory, such as random access memory (RAM, DRAM, SRAM); and nonvolatile memory such as flash memory, various read only memory (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memory (MRAM, feRAM); and magnetic and optical storage devices (hard disk, tape, CD, DVD); or other now known media or later developed computer-readable information/data that can be stored for use by a computer system.

An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to operate a method and/or a solution according to the embodiments of the present application as described above.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A method for predicting protein complex structure, wherein the method comprises:

matching protein sequences within all single-stranded MSAs based on a protein language model to generate MSAs of a target protein complex, wherein the MSAs of the target protein complex are determined based on a similarity score matrix between the single-stranded MSAs or based on complex homology sequences of different species groups corresponding to the single-stranded MSAs;

2. The method of claim 1, wherein the protein-based language model matches protein sequences within all single-stranded MSAs to produce MSAs of a protein complex of interest, comprising:

grouping protein sequences in the single-stranded MSA according to species information, and constructing a complex homologous sequence corresponding to each species group, wherein the complex homologous sequence is formed by connecting protein sequences which are ranked identically in the same species group and come from different single-stranded MSAs, and each species group comprises zero or more complex homologous sequences;

All complex homologous sequences constitute one combined MSA, wherein the combined MSA is the MSA of the protein complex of interest.

3. The method of claim 2, wherein the grouping protein sequences in the single-stranded MSA according to species information and constructing a corresponding complex homology sequence for each species group comprises:

determining one or more species groups according to species information and the single-chain MSA, wherein each species group corresponds to one species in the species information, each species group comprises a plurality of sub-classification groups, each sub-classification group corresponds to one single-chain MSA, and the sub-classification groups comprise protein sequences belonging to the species in the single-chain MSA;

and determining the homologous sequence of the complex corresponding to each species group according to the one or more species groups.

4. A method according to claim 3, wherein said determining, from said one or more species groups, the corresponding complex homology sequence for each species group comprises:

determining similarity scoring information corresponding to each protein sequence in all the sub-taxonomic groups in each species group;

and determining the homologous sequence of the complex corresponding to each species group based on the similarity scoring information.

5. The method of claim 4, wherein said determining similarity score information for each protein sequence in all sub-taxonomic groups in each species group comprises:

determining a column attention matrix corresponding to each single-chain MSA;

determining a corresponding pair-wise similarity matrix based on the column attention matrix;

and determining similarity scoring information corresponding to each protein sequence in all sub-classification groups in each species group based on the pair-wise similarity matrix.

6. The method of claim 4, wherein the determining the corresponding complex homology sequences for each species group based on the similarity score information comprises:

sorting protein sequences corresponding to each sub-taxonomic group in the species group based on the similarity score information;

based on the same ranked protein sequences in each subcategory, the corresponding complex homologous sequences for that group of species are determined.

7. A method according to claim 3, wherein said determining, from said one or more species groups, the corresponding complex homology sequence for each species group comprises:

determining corresponding cosine similarity information based on each single-stranded MSA and one component strand of the single-stranded MSA corresponding to the target protein complex;

And determining the homologous sequence of the compound corresponding to each species group based on the cosine similarity information.

8. The method of claim 7, wherein said determining respective cosine similarity information based on each single-stranded MSA and the single-stranded MSA corresponding to a component chain of the target protein complex comprises:

determining a first sequence level of intercalation for each single-stranded MSA and a second sequence level of intercalation for the single-stranded MSA for a component chain of the target protein complex;

and determining corresponding cosine similarity information based on the first sequence level embedding and the second sequence level embedding.

9. The method of claim 8, wherein the determining a first sequence level embedding for each single-stranded MSA comprises:

determining a residue level embedding set corresponding to each single-chain MSA;

and determining the first sequence level embedding corresponding to each single-stranded MSA based on the residue level embedding set.

10. A computer device for predicting protein complex structures, comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1 to 9.

11. A computer readable storage medium having stored thereon a computer program/instruction which when executed by a processor performs the steps of the method according to any of claims 1 to 9.

12. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method according to any one of claims 1 to 9.