CN111145833A

CN111145833A - A method for deep multiple sequence alignment of protein complexes

Info

Publication number: CN111145833A
Application number: CN201911290749.3A
Authority: CN
Inventors: 於东军; 刘子; 朱一亨
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2020-05-12
Anticipated expiration: 2039-12-16
Also published as: CN111145833B

Abstract

The invention discloses a protein complex deep multi-sequence comparison method, which comprises the following steps: step 1, protein monomer multiple sequence comparison search, and then combining the two protein monomer multiple sequence comparison results based on the genome distance; step 2, protein monomer multiple sequence comparison searching, and then combining the protein monomer multiple sequence comparison results based on species matching; and 3, comparing and searching the protein monomer multiple sequences, and combining the protein monomer multiple sequence comparison results based on the protein interaction network relation. The method is used for solving the problem that the contact pattern of the protein complex cannot be predicted and analyzed in a large scale due to the fact that the protein complex multi-sequence alignment cannot be directly constructed from the protein complex sequence information in the problem of predicting the contact pattern between the protein complexes, and has the advantages of high prediction precision and strong generalization capability.

Description

Deep multi-sequence alignment method for protein complex

Technical Field

The invention relates to the field of deep multi-sequence comparison of protein complexes in bioinformatics, in particular to a method for recognizing the protein family similarity relation between protein monomer sequences and proteins constituting the complex sequences.

Background

Bioinformatics is a young subject formed by crossing biology and information science, is one of the major frontier fields of life science and natural science at present, and the research focus of bioinformatics is mainly embodied in two aspects of genomics and proteomics. The research of bioinformatics has important significance for deepening the cognition on the life process of human beings, helping people to improve the living environment and the life quality, and is widely valued by scholars at home and abroad.

Proteins, which are one of the material bases of life phenomena, are important components constituting all cell tissue structures, participate in many important life processes in living bodies, and are important players of life activities. Although deoxyribonucleic acid (DNA) is said to be a carrier of genetic information, replication, transcription, and expression of genetic information all need to rely on the cooperation of various proteins to accomplish. Proteomics is more direct and accurate in explaining life phenomena than genomics, has rapidly developed in recent years, and is highly concerned by students in all countries of the world. In the post-genome era, with the rapid development of protein sequencing technology, the data of protein sequences are explosively increased, and at present, in the famous protein database UniProtKB, more than 120,243,849 pieces of primary sequence information (which is cut off to 2018-07-16) of proteins already exist, and the trend of rapid increase is continuously maintained. However, in view of such a huge amount of protein sequence information, taking the currently sequenced protein data as an example, 0.1% (140,000) of the proteins were solved for three-dimensional structure, and 0.3% of the true protein complexes were experimentally verified and solved for three-dimensional structure, and are included in the well-known protein structure database PDB. This gap is increasingly expanding as technology continues to advance and mature.

Through the reading of the literature, the results achieved in the field of the alignment of multiple sequences of protein monomer sequences can be found, a plurality of papers with high theoretical significance and practical value are published, the classical methods of protein monomer sequence alignment are BLAST (Kent, W.James. "BLAT-the BLAST-like alignment tool." Genome research12.4 (2002):656 664.), PSI-BLAST (Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." nucleic acids research 25.17(1997): 3389. 3402.), HHBlits (Remmer, Michael., HHBlues: light-failure sequence search by nucleic acids by HMM-mapping, "Nature methods 9.2 (Ocular), and HH-gene analysis (Jaccement) 2016. J.S.S.6. and Japanese analysis 2016 (U.S.: Japanese analysis and analysis). However, current research work also mainly investigated how to improve the quality of protein monomer multiple Sequence alignments by further combing analysis, and for protein complex multiple Sequence alignments only simple methods of protein monomer multiple Sequence alignment are followed, such as ComplexContact (Zeng, Hong, et al. "ComplexConnect: a web server for inter-protein contact alignment using" Nucleic acids research 46.W1(2018): W432-W437.), GREEN (Joan. "proteins of the interaction of the cell tissue sample differentiation. Cancer cell 25.6(2014) 716-717;" EVPf, Thomas A., Co-Sequence analysis of cells 033. and 430: "electronic products of Sequence alignment of 3. C3. and 3. C. for protein complex multiple Sequence alignment.

Although these studies can be used as protein complex multiple sequence alignment methods, challenges still remain. First, these methods focus on the prediction of the secondary structure of the protein complex, and therefore the accuracy of the multiple sequence alignment results is not high. Secondly, a single strategy is used for constructing a protein complex multi-sequence alignment database, so that the multi-sequence alignment result is easy to have only one query sequence, and the multi-sequence alignment precision is poor. In addition, different databases are mechanically combined to construct a protein complex database, so that alignment results of different databases are influenced mutually, and the improvement of the accuracy of multi-sequence alignment is limited.

Disclosure of Invention

The invention aims to provide a protein complex multi-sequence alignment method with high multi-sequence alignment result quality, large depth, wide sequence source range and strong generalization capability, which is used for solving the defect of low quality of the protein complex multi-sequence alignment result due to single database and low search depth in the protein complex multi-sequence alignment.

The technical solution for realizing the purpose of the invention is as follows: a protein complex multiple sequence alignment method comprises the following steps:

step 1, constructing a protein monomer sequence database and a genome distance search algorithm: first, the unicust 30 protein monomer sequence database was downloaded from the whole genome data of protein monomer sequences. Secondly, using multi-sequence alignment software HHblits software to respectively carry out multi-sequence alignment on the protein monomer sequence searching protein sequence database Unicluster 30 to obtain multi-sequence alignment information of the protein monomer sequence. Next, the multiple sequence alignment of the monomeric protein sequences is compared with the genome database (ENA), respectively. Finally, connecting the multiple sequence alignments of two different monomer sequences according to the distances between the multiple sequence alignments of the two different monomer sequences and the alignments of the genome database species, thereby obtaining the multiple sequence alignments of the protein complex based on the genome distances;

step 2, constructing a protein monomer sequence database and species similarity search algorithm: first, a species classification database (Taxonomy) is downloaded from the public database, national information center (NCBI). Secondly, using multi-sequence alignment software HHblits software to respectively perform multi-sequence alignment on the protein monomer sequence database Unicluster 30 constructed in the step 1 of searching the protein monomer sequences to obtain multi-sequence alignment information of the protein monomer sequences. Secondly, the multiple sequence alignment results of the protein monomer sequences are respectively subjected to species alignment with a species classification database (Taxonomy). Finally, two different monomer multiple sequences are compared and connected according to the species comparison result, so that the species-based multiple sequence comparison result of the protein complex is obtained;

step 3, constructing a protein interaction network database and a protein interaction search algorithm: first, protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) are downloaded from a public database protein interaction network database (STRING). Secondly, using HHblits software to search the protein monomer sequence for protein interaction sequence information database constructed in the above steps to carry out multi-sequence comparison to obtain multi-sequence comparison information of the protein monomer sequence. Finally, comparing and connecting the multiple sequences of two different protein monomer sequences according to the protein interaction information, thereby obtaining a multiple sequence comparison result of the protein complex sequence based on the protein interaction network;

and 4, selecting a protein complex multiple sequence alignment method: first, the number of valid sequences in the genome distance-based protein complex multiple sequence alignment in step 1 was calculated. Secondly, if the number of sequences in the multiple sequence alignment in step 1 meets the requirement, the alignment of step 1 is used as the input of the step of removing redundant sequences. Otherwise, the multiple sequence alignment in step 1 is combined with the species-based multiple sequence alignment in step 2 and the number of valid sequences is calculated. Thirdly, if the number of the combined effective sequences in the step 1 and the step 2 meets the condition, the combined result is used as the input of the step of dividing the redundant sequence. Otherwise, comparing and combining the multiple sequences based on the protein interaction network in the steps 1, 2 and 3, and taking the multiple sequences as the input of the step of removing the redundant sequences;

step 5, removing the redundancy of the multiple sequence alignment: and (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.

Step 6, online prediction: giving a predicted protein complex sequence, generating a multi-sequence alignment of the corresponding protein complex sequence by using the method in the steps 1-5, and returning the multi-sequence alignment result of the protein complex to a user in the form of a page or an email, thereby being convenient for a researcher to use.

Compared with the prior art, the invention has the following remarkable advantages: (1) the alignment depth of multiple sequences is improved: first, the depth of multiple sequence alignment refers to the depth of the hierarchy, not just the alignment using a single search algorithm or database; secondly, the multi-sequence comparison methods of different levels are not mechanically combined together, but are judged according to the number of effective sequences in the multi-sequence comparison result of the previous level, so that the multi-sequence comparison speed is optimized; and finally, removing redundant sequences from the multi-sequence comparison results of different levels to ensure that the fused sequences of all levels have specificity. (2) The quality of multiple sequence alignment is improved: different protein monomer databases are used and three different strategies are adopted to connect the monomer multi-sequence comparison results in the protein complex, so that the problem that two protein monomer sequences cannot be connected due to the use of a single strategy, and the failure of the multi-sequence comparison in the construction of the protein complex is avoided. Therefore, three different search connection strategies are adopted, the result of sequence comparison is ensured, the range of a database of multi-sequence comparison is expanded, and the quality of multi-sequence comparison is improved. (3) The generalization capability of the model is improved: three different protein monomer multi-sequence alignment connection strategies (based on gene distance, species and protein interaction network) are used, so that the multi-sequence alignment result of any query protein complex sequence can be generated. Therefore, the method improves the generalization capability of the model.

Drawings

FIG. 1 is a schematic diagram of the deep multiple sequence alignment method of protein complexes of the present invention.

Detailed Description

The invention will be further explained with reference to the drawings.

The figure shows the system structure diagram of the multiple sequence alignment method of the invention. With reference to the attached drawings, according to an embodiment of the invention, a method for deep multiple sequence alignment of protein complexes comprises the following steps: first, a protein monomer sequence database was constructed. Searching a database by using multi-sequence comparison software to obtain a multi-sequence comparison result, and then connecting the protein monomers according to the gene distance information; next, the multiple sequence alignment of the protein monomer sequences was species aligned with a species Taxonomy database (Taxonomy). Connecting two different monomer multiple sequences by alignment according to species alignment results; next, a protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) were constructed. And using HHblits software to search the protein monomer sequence for protein interaction sequence information database constructed in the above steps to perform multi-sequence comparison to obtain multi-sequence comparison information of the protein monomer sequence. Connecting two different protein monomer sequences by aligning the multiple sequences according to protein interaction information; then, judging whether to carry out multi-sequence comparison construction of the next-level strategy according to the number of the effective sequences of different strategies; thus, the protein complex multiple sequence alignment generated in the above steps is subjected to redundancy removal, so that the similarity of any two sequences in the multiple sequence alignment after the redundancy removal is less than 90%. Finally, a predicted protein complex sequence is given, the method in the step 1-5 is used for generating the multi-sequence alignment of the corresponding protein complex sequence, and the multi-sequence alignment result of the protein complex is returned to the user in the form of a page or an email, so that the protein complex sequence is convenient for a researcher to use. (ii) a The foregoing process will be described in more detail with reference to the accompanying drawings.

Step 1, constructing a protein monomer sequence database and a genome distance search algorithm:

given the number of protein complexes, it contains two monomer sequences, sequence a and sequence B. Then respectively searching a protein database for the sequence A and the sequence B by using a multi-sequence comparison search algorithm, and then carrying out multi-sequence comparison for constructing a strategy 1 according to gene distance information, wherein the specific steps are as follows:

(1) downloading a Unicluster 30 protein monomer sequence database from the protein monomer sequence complete genome data (https:// unicust. mmseqs. com /);

(2) respectively searching protein sequence database Unicluster 30 for multi-sequence alignment by using multi-sequence alignment software HHblits software for the sequence A and the sequence B to respectively obtain multi-sequence alignment information MSA _ A and MSA _ B of the sequence A and the sequence B;

(3) comparing the multiple sequence comparison results MSA _ A and MSA _ B with a genome database, and respectively comparing the gene information MSA _ A _ gene and MSA _ B _ gene of the multiple sequence comparison results;

(4) calculating the gene distance delta gene of two proteins i and j with the same gene in MSA _ A _ gene and MSA _ B _ gene, and if the delta gene is more than or equal to 1 and less than or equal to 20, connecting the protein i with the protein j;

(5) according to the steps (1) to (4), a protein complex Multiple Sequence Alignment (MSA) based on gene distance is constructed.

Step 2, building a monomer sequence database and species similarity search algorithm

(1) Downloading species classification database (Taxonomy) from public database american national information center (NCBI)

(2) Respectively carrying out species comparison on the multi-sequence comparison information MSA _ A and MSA _ B of the sequence A and the sequence B in the step 1 with a species classification database (Taxinomy), and respectively obtaining species information of proteins in the MSA _ A and the MSA _ B;

(3) ranking similarity of proteins to query sequence from high to low in each species of MSA _ a and MSA _ B, respectively;

(4) let P₁,P₂,…,P_mProteins ordered by sequence similarity for a particular species in MSA _ A, and Q₁,Q₂,…,Q_nProteins ordered by sequence similarity for a particular species in MSA _ B. Then P is added_iAnd Q_iAnd (c) performing ligation, wherein i ═ min (m, n).

Step 3, constructing a protein interaction network database and a protein interaction search algorithm

(1) Downloading protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) from a public database protein interaction network database (STRING);

(2) and (3) using multi-sequence alignment software HHblits software to respectively perform multi-sequence alignment on the protein interaction sequence information databases searched by the sequence A and the sequence B to respectively obtain multi-sequence alignment information MSA _ stringA and MSA _ stringB.

(3) Finally, whether any two proteins of MSA _ stringA and MSA _ stringB have interaction is judged according to the protein interaction information. If there is an interaction, the two are connected.

Step 4, selecting a protein complex multiple sequence alignment method

Calculating the number of effective sequences in the multi-sequence comparison

Wherein L is the chain length of the protein complex, N is the number of sequences in a protein complex Multiple Sequence Alignment (MSA), S_iA,jAIs the sequence similarity score of chain A in sequence i with chain A in sequence j, S_iB,jBIs the sequence similarity score of chain B in sequence i with chain B in sequence j. In addition, the optimized Necs value is 128. Namely, if the Necs is larger than or equal to 176, the comparison of the next strategy is not carried out, otherwise, the comparison is continued.

The duplicates generated in the above step were clustered using the protein structure clustering algorithm software SPICKER, and the average of the atomic coordinates in all conformations in each class was calculated. The obtained atomic coordinate mean value is used as the atomic coordinate of the cluster center conformation.

Step 5, removing the redundancy of multi-sequence alignment

And (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.

Step 6, on-line prediction

For a given predicted protein complex sequence, generating a corresponding three-dimensional structure of the protein complex by using the method in the steps 1-5, and returning the three-dimensional structure of the protein to the user in the form of a page or a mail, thereby being convenient for a researcher to use.

In summary, first, the depth of multiple sequence alignment refers to the depth of the hierarchy, rather than just using a single search algorithm or database for alignment; secondly, the multi-sequence comparison methods of different levels are not mechanically combined together, but are judged according to the number of effective sequences in the multi-sequence comparison result of the previous level, so that the multi-sequence comparison speed is optimized; and finally, removing redundant sequences from the multi-sequence comparison results of different levels to ensure that the fused sequences of all levels have specificity. Therefore, the invention improves the depth of multi-sequence comparison; secondly, different protein monomer databases are used and three different strategies are adopted to connect the monomer multi-sequence comparison results in the protein complex, so that the problem that two protein monomer sequences cannot be connected due to the use of a single strategy, and the failure of multi-sequence comparison in the construction of the protein complex is avoided. Therefore, three different search connection strategies are adopted, the result of sequence comparison is ensured, the range of a database of multi-sequence comparison is expanded, and the quality of multi-sequence comparison is improved; finally, a connection strategy (based on gene distance, species and protein interaction network) for multi-sequence alignment of three different protein monomers is used, so that a multi-sequence alignment result can be generated for any query protein complex sequence. Therefore, the method improves the generalization capability of the model.

Claims

1. A method for deep multi-sequence alignment of protein complexes is characterized by comprising the following steps:

firstly, downloading a Unicluster 30 protein monomer sequence database from the whole genome data of the protein monomer sequence;

secondly, using multi-sequence comparison software HHblits software to respectively carry out multi-sequence comparison on the protein monomer sequence search protein sequence database Unicluster 30 to obtain multi-sequence comparison information of the protein monomer sequence;

thirdly, respectively comparing the multi-sequence comparison result of the protein monomer sequence with a genome database ENA;

finally, connecting the multiple sequence alignments of two different monomer sequences according to the distances between the multiple sequence alignments of the two different monomer sequences and the alignments of the genome database species, thereby obtaining the multiple sequence alignments of the protein complex based on the genome distances;

step 2, constructing a protein monomer sequence database and species similarity search algorithm:

firstly, downloading a species classification database Taxonom from a public database, namely a national information center NCBI;

secondly, using multi-sequence comparison software HHblits software to respectively carry out multi-sequence comparison on the protein monomer sequence database Unicluster 30 constructed in the step 1 to obtain multi-sequence comparison information of the protein monomer sequence;

thirdly, respectively carrying out species comparison on the multi-sequence comparison result of the protein monomer sequence and a species classification database Taxonom;

finally, two different monomer multiple sequences are compared and connected according to the species comparison result, so that the species-based multiple sequence comparison result of the protein complex is obtained;

step 3, constructing a protein interaction network database and a protein interaction search algorithm:

firstly, downloading a protein interaction information STRING linker and a protein interaction sequence information database STRING database from a public database protein interaction network database STRING;

secondly, using multi-sequence comparison software HHblits software to respectively search the protein monomer sequences for multi-sequence comparison of the protein interaction sequence information database constructed in the above step to obtain multi-sequence comparison information of the protein monomer sequences; finally, comparing and connecting the multiple sequences of two different protein monomer sequences according to the protein interaction information, thereby obtaining a multiple sequence comparison result of the protein complex sequence based on the protein interaction network;

and 4, selecting a protein complex multiple sequence alignment method:

firstly, calculating the number of effective sequences in the protein complex multi-sequence alignment based on the genome distance in the step 1;

secondly, if the number of sequences in the multiple sequence alignment in the step 1 meets the requirement, the alignment of the sequence in the step 1 is used as the input of the step of removing redundant sequences, otherwise, the multiple sequence alignment in the step 1 and the multiple sequence alignment based on species types in the step 2 are merged, and the number of effective sequences is calculated;

thirdly, if the number of the combined effective sequences in the step 1 and the step 2 meets the condition, taking the combined result as the input of the step of removing the redundant sequences, or else, comparing and combining the multiple sequences based on the protein interaction network in the step 1, the step 2 and the step 3 as the input of the step of removing the redundant sequences;

2. The method of multiple sequence alignment of claim 1, wherein: in the step 1, the step of processing the raw material,

(1) downloading a Unicluster 30 protein monomer sequence database from the protein monomer sequence complete genome data;

(2) respectively searching protein sequence databases Unicluster 30 for the sequence A and the sequence B by using multi-sequence alignment software HHblits software to carry out multi-sequence alignment so as to respectively obtain multi-sequence alignment information MSA _ A and MSA _ B of the sequence A and the sequence B;

(3) comparing the multiple sequence comparison results MSA _ A and MSA _ B with a genome database to respectively obtain gene information MSA _ A _ gene and MSA _ B _ gene of the multiple sequence comparison results;

(5) according to the steps (1) to (4), the protein complex multiple sequence alignment MSA based on the gene distance is constructed.

3. The method of multiple sequence alignment of claim 1, wherein: in the step 2, in the step of processing,

(1) downloading a species classification database Taxonom from a public database, namely a national information center NCBI;

(2) respectively carrying out species comparison on the multi-sequence comparison information MSA _ A and MSA _ B of the sequence A and the sequence B in the step 1 and a species classification database Taxonomy, and respectively obtaining species information of proteins in the MSA _ A and the MSA _ B;

let P₁，P₂，…，P_mProteins ordered by sequence similarity for a particular species in MSA _ A, and Q₁，Q₂，…，Q_nProteins ordered by sequence similarity for a particular species in MSA _ B; then P is added_iAnd Q_iAnd (c) performing ligation, wherein i ═ min (m, n).

4. The method of multiple sequence alignment of claim 1, wherein: in the step 3, downloading a protein interaction information STRING linker and a protein interaction sequence information database STRING database from a public database protein interaction network database STRING;

(1) using multi-sequence comparison software HHblits software to respectively perform multi-sequence comparison on protein interaction sequence information databases searched by the sequence A and the sequence B to respectively obtain multi-sequence comparison information MSA _ stringA and MSA _ stringB;

(2) and judging whether any two proteins in MSA _ stringA and MSA _ stringB have interaction according to the protein interaction information, and connecting the two proteins if the two proteins have interaction.

5. The method of multiple sequence alignment of claim 1, wherein: in the step 4, the number of effective sequences in the multi-sequence comparison is calculated

Wherein L is the chain length of the protein complex, N is the number of sequences in the MSA for multiple sequence alignment of the protein complex, S_iA，jAIs the sequence similarity score of chain A in sequence i with chain A in sequence j, S_iB，jBIs the sequence similarity score of chain B in sequence i with chain B in sequence j;

clustering the copies generated in the step by using protein structure clustering algorithm software SPICKER, and calculating the average value of atomic coordinates in all conformations in each class; the obtained atomic coordinate mean value is used as the atomic coordinate of the cluster center conformation.