CN111145833B

CN111145833B - Deep multi-sequence alignment method for protein complex

Info

Publication number: CN111145833B
Application number: CN201911290749.3A
Authority: CN
Inventors: 於东军; 刘子; 朱一亨
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2022-09-20
Anticipated expiration: 2039-12-16
Also published as: CN111145833A

Abstract

The invention discloses a protein complex deep multi-sequence comparison method, which comprises the following steps: step 1, protein monomer multiple sequence comparison search, and then combining the two protein monomer multiple sequence comparison results based on the genome distance; step 2, protein monomer multiple sequence comparison searching, and then combining the protein monomer multiple sequence comparison results based on species matching; and 3, comparing and searching the protein monomer multiple sequences, and combining the protein monomer multiple sequence comparison results based on the protein interaction network relation. The method is used for solving the problem that the contact pattern of the protein complex cannot be predicted and analyzed in a large scale due to the fact that the protein complex multi-sequence alignment cannot be directly constructed from the protein complex sequence information in the problem of predicting the contact pattern between the protein complexes, and has the advantages of high prediction precision and strong generalization capability.

Description

Deep multi-sequence alignment method for protein complex

Technical Field

The invention relates to the field of deep multi-sequence comparison of protein complexes in bioinformatics, in particular to a method for recognizing the protein family similarity relation between protein monomer sequences and proteins constituting the complex sequences.

Background

Bioinformatics is a young subject formed by crossing biology and information science, is one of the major frontier fields of life science and natural science at present, and the research focus of bioinformatics is mainly embodied in two aspects of genomics and proteomics. The research of bioinformatics has important significance for deepening the cognition on the life process of human beings, helping people to improve the living environment and the life quality, and is widely valued by scholars at home and abroad.

Proteins are important components constituting all cellular tissue structures as one of the material bases of life phenomena, participate in important life processes in many aspects of the living body, and are important players of life activities. Although deoxyribonucleic acid (DNA) is said to be a carrier of genetic information, replication, transcription and expression of genetic information all need to rely on the cooperation between various proteins to be accomplished. Proteomics is more direct and accurate in explaining life phenomena than genomics, has rapidly developed in recent years, and is highly concerned by students in all countries of the world. In the post-genome era, with the rapid development of protein sequencing technology, the data of protein sequences are explosively increased, and at present, in the famous protein database UniProtKB, more than 120,243,849 pieces of primary sequence information (which is cut off to 2018-07-16) of proteins already exist, and the trend of rapid increase is continuously maintained. However, in view of such a huge amount of protein sequence information, taking the currently sequenced protein data as an example, 0.1% (140,000) of the proteins were solved for three-dimensional structure, and 0.3% of the true protein complexes were experimentally verified and solved for three-dimensional structure, and are included in the well-known protein structure database PDB. This gap is increasingly expanding as technology continues to advance and mature.

From the literature, it was found that abundant results have been obtained in the field of protein monomer sequence multiple sequence alignment, and that several papers of high theoretical and practical significance have been published, and that the classical protein monomer sequence alignment methods are BLAST (Kent, W.James. "BLAT-the BLAST-like alignment tool." Genome research 12.4 (2002):656 664.), PSI-BLAST (Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new alignment of protein database sequences." Nucleic acids research 25.17(1997) 3389. 3402., "HHbits (Remmer, Michael, et al." HHblanks: lighting-test tissue.: HMM-2016. J.: 2016. 1. and 20. sub.15), HH (Hammert, M., 2016. 1. and 20. K. . However, current research work also mainly investigated how to improve the quality of protein monomer multiple Sequence alignments by further combing analysis, and for protein complex multiple Sequence alignments only simple methods of protein monomer multiple Sequence alignment are followed, such as ComplexContact (Zeng, Hong, et al, "ComplexConnect: a web server for inter-protein contact alignment using" Nucleic acids research 46.W1(2018): W432-W437.), GREEN (Joan, "proteins binding the mechanism of computer cell differentiation. Cancer cell 25.6(2014) 716-717), EVPf (Hooma, Thomas, co-Sequence analysis. 3. and strain of Sequence cells, 430).

Although these studies can be used as protein complex multiple sequence alignment methods, challenges still remain. First, these methods focus on the prediction of the secondary structure of the protein complex, and therefore the accuracy of the multiple sequence alignment results is not high. Secondly, a single strategy is used for constructing a protein complex multi-sequence alignment database, so that the multi-sequence alignment result is easy to have only one query sequence, and the multi-sequence alignment precision is poor. In addition, different databases are mechanically combined to construct a protein complex database, so that alignment results of different databases are influenced mutually, and the improvement of the accuracy of multi-sequence alignment is limited.

Disclosure of Invention

The invention aims to provide a protein complex multi-sequence alignment method with high multi-sequence alignment result quality, large depth, wide sequence source range and strong generalization capability, which is used for solving the defect of low quality of the protein complex multi-sequence alignment result due to single database and low search depth in the protein complex multi-sequence alignment.

The technical solution for realizing the purpose of the invention is as follows: a protein complex multiple sequence alignment method comprises the following steps:

step 1, constructing a protein monomer sequence database and a genome distance search algorithm: first, the unicust 30 protein monomer sequence database was downloaded from the whole genome data of protein monomer sequences. Secondly, using multi-sequence alignment software HHblits software to respectively carry out multi-sequence alignment on the protein monomer sequence searching protein sequence database Unicluster 30 to obtain multi-sequence alignment information of the protein monomer sequence. Next, the multiple sequence alignment of the monomeric protein sequences is compared with the genome database (ENA), respectively. Finally, connecting the multiple sequence alignments of two different monomer sequences according to the distances between the multiple sequence alignments of the two different monomer sequences and the alignments of the genome database species, thereby obtaining the multiple sequence alignments of the protein complex based on the genome distances;

step 2, constructing a protein monomer sequence database and species similarity search algorithm: first, a species classification database (Taxonomy) is downloaded from the public database, national information center (NCBI). Secondly, using multi-sequence alignment software HHblits software to respectively perform multi-sequence alignment on the protein monomer sequence database Unicluster 30 constructed in the step 1 of searching the protein monomer sequences to obtain multi-sequence alignment information of the protein monomer sequences. Secondly, species alignment is performed on the multiple sequence alignment results of the protein monomer sequences and a species classification database (Taxonomy), respectively. Finally, two different monomer multiple sequences are compared and connected according to the species comparison result, so that the species-based multiple sequence comparison result of the protein complex is obtained;

step 3, constructing a protein interaction network database and a protein interaction search algorithm: first, protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) are downloaded from a public database protein interaction network database (STRING). Secondly, using HHblits software to search the protein monomer sequence for protein interaction sequence information database constructed in the above steps to carry out multi-sequence comparison to obtain multi-sequence comparison information of the protein monomer sequence. Finally, comparing and connecting the multiple sequences of two different protein monomer sequences according to the protein interaction information, thereby obtaining a multiple sequence comparison result of the protein complex sequence based on the protein interaction network;

step 4, selecting a protein compound multiple sequence alignment method: first, the number of valid sequences in the genome distance-based protein complex multiple sequence alignment in step 1 was calculated. Secondly, if the number of sequences in the multiple sequence alignment in step 1 meets the requirement, the alignment of step 1 is used as the input of the step of removing redundant sequences. Otherwise, the multiple sequence alignment in step 1 is combined with the species-based multiple sequence alignment in step 2 and the number of valid sequences is calculated. Thirdly, if the number of the combined effective sequences in the step 1 and the step 2 meets the condition, the combined result is used as the input of the step of dividing the redundant sequence. Otherwise, comparing and combining the multiple sequences based on the protein interaction network in the steps 1, 2 and 3, and taking the comparison and combination as the input of the step of removing the redundant sequences;

step 5, removing the redundancy of the multiple sequence alignment: and (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.

Step 6, online prediction: giving a predicted protein complex sequence, generating a multi-sequence alignment of the corresponding protein complex sequence by using the method in the steps 1-5, and returning the multi-sequence alignment result of the protein complex to a user in the form of a page or an email, thereby being convenient for a researcher to use.

Compared with the prior art, the invention has the following remarkable advantages: (1) the alignment depth of multiple sequences is improved: first, the depth of multiple sequence alignment refers to the depth of the hierarchy, not just the alignment using a single search algorithm or database; secondly, the multi-sequence comparison methods of different levels are not mechanically combined together, but are judged according to the number of effective sequences in the multi-sequence comparison result of the previous level, so that the multi-sequence comparison speed is optimized; and finally, removing redundant sequences from the multi-sequence comparison results of different levels to ensure that the fused sequences of all levels have specificity. (2) The quality of multiple sequence alignment is improved: different protein monomer databases are used and three different strategies are adopted to connect the monomer multi-sequence comparison results in the protein complex, so that the problem that two protein monomer sequences cannot be connected due to the use of a single strategy, and the failure of the multi-sequence comparison in the construction of the protein complex is avoided. Therefore, three different search connection strategies are adopted, the result of sequence comparison is ensured, the range of a database of multi-sequence comparison is expanded, and the quality of multi-sequence comparison is improved. (3) The generalization capability of the model is improved: three different protein monomer multi-sequence alignment connection strategies (based on gene distance, species and protein interaction network) are used, so that the multi-sequence alignment result of any query protein complex sequence can be generated. Therefore, the method improves the generalization capability of the model.

Drawings

FIG. 1 is a schematic diagram of the deep multiple sequence alignment method of protein complexes of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

The figure shows the system structure diagram of the multiple sequence alignment method of the invention. With reference to the attached drawings, according to an embodiment of the invention, a method for deep multiple sequence alignment of protein complexes comprises the following steps: first, a protein monomer sequence database was constructed. Searching a database by using multi-sequence comparison software to obtain a multi-sequence comparison result, and then connecting the protein monomers according to the gene distance information; next, the multiple sequence alignment of the protein monomer sequences was species aligned with a species Taxonomy database (Taxonomy). Connecting two different monomer multiple sequences by alignment according to species alignment results; next, a protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) were constructed. And using HHblits software to search the protein monomer sequence for protein interaction sequence information database constructed in the above steps to perform multi-sequence comparison to obtain multi-sequence comparison information of the protein monomer sequence. Connecting two different protein monomer sequences by aligning the multiple sequences according to protein interaction information; then, judging whether to carry out multi-sequence comparison construction of the next-level strategy according to the number of the effective sequences of different strategies; thus, the protein complex multiple sequence alignment generated in the above steps is performed to remove redundancy, such that the similarity of any two sequences in the multiple sequence alignment after removing redundancy is less than 90%. Finally, a predicted protein complex sequence is given, the method in the steps 1-5 is utilized to generate the multi-sequence alignment of the corresponding protein complex sequence, and the multi-sequence alignment result of the protein complex is returned to the user in the form of a page or an email, which is convenient for researchers to use. (ii) a The foregoing process will be described in more detail with reference to the accompanying drawings.

Step 1, constructing a protein monomer sequence database and a genome distance search algorithm:

given the number of protein complexes, it contains two monomer sequences, sequence a and sequence B. Then respectively searching a protein database for the sequence A and the sequence B by using a multi-sequence comparison search algorithm, and then carrying out multi-sequence comparison for constructing a strategy 1 according to gene distance information, wherein the specific steps are as follows:

(1) downloading a Unicluster 30 protein monomer sequence database from the protein monomer sequence complete genome data (https:// unicust. mmseqs. com /);

(2) respectively searching protein sequence database Unicluster 30 for multi-sequence alignment by using multi-sequence alignment software HHblits software for the sequence A and the sequence B to respectively obtain multi-sequence alignment information MSA _ A and MSA _ B of the sequence A and the sequence B;

(3) comparing the multiple sequence comparison results MSA _ A and MSA _ B with a genome database, and respectively comparing the gene information MSA _ A _ gene and MSA _ B _ gene of the multiple sequence comparison results;

(4) calculating the gene distance delta gene of two proteins i and j with the same gene in MSA _ A _ gene and MSA _ B _ gene, and if the delta gene is more than or equal to 1 and less than or equal to 20, connecting the protein i with the protein j;

(5) according to the steps (1) to (4), a protein complex Multiple Sequence Alignment (MSA) based on gene distance is constructed.

Step 2, building a monomer sequence database and species similarity search algorithm

(1) Downloading species classification database (Taxonomy) from public database american national information center (NCBI)

(2) Respectively carrying out species comparison on the multi-sequence comparison information MSA _ A and MSA _ B of the sequence A and the sequence B in the step 1 with a species classification database (Taxinomy), and respectively obtaining species information of proteins in the MSA _ A and the MSA _ B;

(3) ranking similarity of proteins to query sequence from high to low in each species of MSA _ a and MSA _ B, respectively;

(4) let P ₁ ,P ₂ ,…,P _m Proteins ordered by sequence similarity for a particular species in MSA _ A, and Q ₁ ,Q ₂ ,…,Q _n Proteins ordered by sequence similarity for a particular species in MSA _ B. Then P is added _i And Q _i And (c) performing ligation, wherein i ═ min (m, n).

Step 3, constructing a protein interaction network database and a protein interaction search algorithm

(1) Downloading protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) from a public database protein interaction network database (STRING);

(2) and (3) using multi-sequence alignment software HHblits software to respectively perform multi-sequence alignment on the protein interaction sequence information databases searched by the sequence A and the sequence B to respectively obtain multi-sequence alignment information MSA _ stringA and MSA _ stringB.

(3) Finally, whether any two proteins of MSA _ stringA and MSA _ stringB have interaction is judged according to the protein interaction information. If there is an interaction, the two are connected.

Step 4, selecting a protein complex multiple sequence alignment method

Calculating the number of effective sequences in the multi-sequence comparison

Wherein L is the chain length of the protein complex, N is the number of sequences in a protein complex Multiple Sequence Alignment (MSA), S _iA,jA Is the sequence of chain A in sequence i and chain A in sequence jColumn similarity score, S _iB,jB Is the sequence similarity score of chain B in sequence i with chain B in sequence j. In addition, the optimized Necs value is 128. Namely, if the Necs is larger than or equal to 176, the comparison of the next strategy is not carried out, otherwise, the comparison is continued.

The duplicates generated in the above step were clustered using the protein structure clustering algorithm software SPICKER, and the average of the atomic coordinates in all conformations in each class was calculated. The obtained atomic coordinate mean value is used as the atomic coordinate of the cluster center conformation.

Step 5, removing the redundancy of multi-sequence alignment

And (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.

Step 6, on-line prediction

For a given predicted protein complex sequence, generating a corresponding three-dimensional structure of the protein complex by using the method in the steps 1-5, and returning the three-dimensional structure of the protein to the user in the form of a page or a mail, thereby being convenient for a researcher to use.

In summary, first, the depth of multiple sequence alignment refers to the depth of the hierarchy, rather than just using a single search algorithm or database for alignment; secondly, the multi-sequence comparison methods of different levels are not mechanically combined together, but are judged according to the number of effective sequences in the multi-sequence comparison result of the previous level, so that the multi-sequence comparison speed is optimized; and finally, removing redundant sequences from the multi-sequence comparison results of different levels to ensure that the sequences fused at all levels have specificity. Therefore, the invention improves the depth of multi-sequence comparison; secondly, different protein monomer databases are used and three different strategies are adopted to connect the monomer multi-sequence comparison results in the protein complex, so that the problem that two protein monomer sequences cannot be connected due to the use of a single strategy, and the failure of multi-sequence comparison in the construction of the protein complex is avoided. Therefore, three different search connection strategies are adopted, the result of sequence comparison is ensured, the range of a database of multi-sequence comparison is expanded, and the quality of multi-sequence comparison is improved; finally, a connection strategy (based on gene distance, species and protein interaction network) for multi-sequence alignment of three different protein monomers is used, so that a multi-sequence alignment result can be generated for any query protein complex sequence. Therefore, the method improves the generalization capability of the model.

Claims

1. A method for deep multi-sequence alignment of protein complexes is characterized by comprising the following steps:

firstly, downloading a Unicluster 30 protein monomer sequence database from the whole genome data of the protein monomer sequence;

secondly, using multi-sequence comparison software HHblits software to respectively carry out multi-sequence comparison on the protein monomer sequence search protein sequence database Unicluster 30 to obtain multi-sequence comparison information of the protein monomer sequence;

thirdly, respectively comparing the multi-sequence comparison result of the protein monomer sequence with a genome database ENA;

finally, connecting the multiple sequence alignments of two different monomer sequences according to the distances between the multiple sequence alignments of the two different monomer sequences and the alignments of the genome database species, thereby obtaining the multiple sequence alignments of the protein complex based on the genome distances;

step 2, constructing a protein monomer sequence database and species similarity search algorithm:

firstly, downloading a species classification database Taxonom from a public database, namely a national information center NCBI;

secondly, using multi-sequence comparison software HHblits software to respectively carry out multi-sequence comparison on the protein monomer sequence database Unicluster 30 constructed in the step 1 to obtain multi-sequence comparison information of the protein monomer sequence;

thirdly, respectively carrying out species comparison on the multi-sequence comparison result of the protein monomer sequence and a species classification database Taxonom;

finally, two different monomer multiple sequences are compared and connected according to the species comparison result, so that the species-based multiple sequence comparison result of the protein complex is obtained;

step 3, constructing a protein interaction network database and a protein interaction search algorithm:

firstly, downloading a protein interaction information STRING linker and a protein interaction sequence information database STRING database from a public database protein interaction network database STRING;

secondly, using multi-sequence comparison software HHblits software to respectively search the protein monomer sequences for multi-sequence comparison of the protein interaction sequence information database constructed in the above step to obtain multi-sequence comparison information of the protein monomer sequences; finally, comparing and connecting the multiple sequences of two different protein monomer sequences according to the protein interaction information, thereby obtaining a multiple sequence comparison result of the protein complex sequence based on the protein interaction network;

and 4, selecting a protein complex multiple sequence alignment method:

firstly, calculating the number of effective sequences in the protein complex multi-sequence alignment based on the genome distance in the step 1;

secondly, if the number of sequences in the multiple sequence alignment in the step 1 meets the requirement, the alignment of the sequence in the step 1 is used as the input of the step of removing redundant sequences, otherwise, the multiple sequence alignment in the step 1 and the multiple sequence alignment based on species types in the step 2 are merged, and the number of effective sequences is calculated;

thirdly, if the number of the combined effective sequences in the step 1 and the step 2 meets the condition, taking the combined result as the input of the step of removing the redundant sequences, otherwise, comparing and combining the multiple sequences based on the protein interaction network in the step 1, the step 2 and the step 3 as the input of the step of removing the redundant sequences;

step 5, removing multiple sequence alignment redundancy: and (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.

2. The method of multiple sequence alignment of claim 1, wherein: in the step (1), the step (2),

(1) downloading a Unicluster 30 protein monomer sequence database from the protein monomer sequence complete genome data;

(2) respectively searching protein sequence databases Unicluster 30 for the sequence A and the sequence B by using multi-sequence alignment software HHblits software to carry out multi-sequence alignment so as to respectively obtain multi-sequence alignment information MSA _ A and MSA _ B of the sequence A and the sequence B;

(3) comparing the multiple sequence comparison results MSA _ A and MSA _ B with a genome database to respectively obtain gene information MSA _ A _ gene and MSA _ B _ gene of the multiple sequence comparison results;

(5) according to the steps (1) to (4), the protein complex multiple sequence alignment MSA based on the gene distance is constructed.

3. The method of claim 2, wherein the sequence comprises: in the step 2, in the step of processing,

(1) downloading a species classification database Taxonom from a public database, namely a national information center NCBI;

(2) respectively carrying out species comparison on the multi-sequence comparison information MSA _ A and MSA _ B of the sequence A and the sequence B in the step 1 and a species classification database Taxonomy, and respectively obtaining species information of proteins in the MSA _ A and the MSA _ B;

let P ₁ ,P ₂ ,…,P _m Proteins ordered by sequence similarity for a particular species in MSA _ A, and Q ₁ ,Q ₂ ,…,Q _n Is a specific species in MSA _ BProteins ordered by sequence similarity; then P is added _i And Q _i And (c) performing ligation, wherein i ═ min (m, n).

4. The method of multiple sequence alignment of claim 1, wherein: in the step 3, downloading a protein interaction information STRING linker and a protein interaction sequence information database STRING database from a public database protein interaction network database STRING;

(1) using multi-sequence comparison software HHblits software to respectively perform multi-sequence comparison on protein interaction sequence information databases searched by the sequence A and the sequence B to respectively obtain multi-sequence comparison information MSA _ stringA and MSA _ stringB;

(2) and judging whether any two proteins in MSA _ stringA and MSA _ stringB have interaction according to the protein interaction information, and connecting the two proteins if the two proteins have interaction.

5. The method of multiple sequence alignment of claim 1, wherein: in the step 4, the number of effective sequences in the multi-sequence comparison is calculated

Wherein L is the chain length of the protein complex, N is the number of sequences in the protein complex multiple sequence alignment MSA, S _iA,jA Is the sequence similarity score of chain A in sequence i with chain A in sequence j, S _iB,jB Is the sequence similarity of chain B in sequence i to chain B in sequence jDividing;

clustering the copies generated in the step by using protein structure clustering algorithm software SPICKER, and calculating the average value of atomic coordinates in all conformations in each class; the obtained atomic coordinate mean value is used as the atomic coordinate of the cluster center conformation.