CN111145833B - Deep multi-sequence alignment method for protein complex - Google Patents

Deep multi-sequence alignment method for protein complex Download PDF

Info

Publication number
CN111145833B
CN111145833B CN201911290749.3A CN201911290749A CN111145833B CN 111145833 B CN111145833 B CN 111145833B CN 201911290749 A CN201911290749 A CN 201911290749A CN 111145833 B CN111145833 B CN 111145833B
Authority
CN
China
Prior art keywords
sequence
protein
database
msa
sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911290749.3A
Other languages
Chinese (zh)
Other versions
CN111145833A (en
Inventor
於东军
刘子
朱一亨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN201911290749.3A priority Critical patent/CN111145833B/en
Publication of CN111145833A publication Critical patent/CN111145833A/en
Application granted granted Critical
Publication of CN111145833B publication Critical patent/CN111145833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Abstract

The invention discloses a protein complex deep multi-sequence comparison method, which comprises the following steps: step 1, protein monomer multiple sequence comparison search, and then combining the two protein monomer multiple sequence comparison results based on the genome distance; step 2, protein monomer multiple sequence comparison searching, and then combining the protein monomer multiple sequence comparison results based on species matching; and 3, comparing and searching the protein monomer multiple sequences, and combining the protein monomer multiple sequence comparison results based on the protein interaction network relation. The method is used for solving the problem that the contact pattern of the protein complex cannot be predicted and analyzed in a large scale due to the fact that the protein complex multi-sequence alignment cannot be directly constructed from the protein complex sequence information in the problem of predicting the contact pattern between the protein complexes, and has the advantages of high prediction precision and strong generalization capability.

Description

Deep multi-sequence alignment method for protein complex
Technical Field
The invention relates to the field of deep multi-sequence comparison of protein complexes in bioinformatics, in particular to a method for recognizing the protein family similarity relation between protein monomer sequences and proteins constituting the complex sequences.
Background
Bioinformatics is a young subject formed by crossing biology and information science, is one of the major frontier fields of life science and natural science at present, and the research focus of bioinformatics is mainly embodied in two aspects of genomics and proteomics. The research of bioinformatics has important significance for deepening the cognition on the life process of human beings, helping people to improve the living environment and the life quality, and is widely valued by scholars at home and abroad.
Proteins are important components constituting all cellular tissue structures as one of the material bases of life phenomena, participate in important life processes in many aspects of the living body, and are important players of life activities. Although deoxyribonucleic acid (DNA) is said to be a carrier of genetic information, replication, transcription and expression of genetic information all need to rely on the cooperation between various proteins to be accomplished. Proteomics is more direct and accurate in explaining life phenomena than genomics, has rapidly developed in recent years, and is highly concerned by students in all countries of the world. In the post-genome era, with the rapid development of protein sequencing technology, the data of protein sequences are explosively increased, and at present, in the famous protein database UniProtKB, more than 120,243,849 pieces of primary sequence information (which is cut off to 2018-07-16) of proteins already exist, and the trend of rapid increase is continuously maintained. However, in view of such a huge amount of protein sequence information, taking the currently sequenced protein data as an example, 0.1% (140,000) of the proteins were solved for three-dimensional structure, and 0.3% of the true protein complexes were experimentally verified and solved for three-dimensional structure, and are included in the well-known protein structure database PDB. This gap is increasingly expanding as technology continues to advance and mature.
From the literature, it was found that abundant results have been obtained in the field of protein monomer sequence multiple sequence alignment, and that several papers of high theoretical and practical significance have been published, and that the classical protein monomer sequence alignment methods are BLAST (Kent, W.James. "BLAT-the BLAST-like alignment tool." Genome research 12.4 (2002):656 664.), PSI-BLAST (Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new alignment of protein database sequences." Nucleic acids research 25.17(1997) 3389. 3402., "HHbits (Remmer, Michael, et al." HHblanks: lighting-test tissue.: HMM-2016. J.: 2016. 1. and 20. sub.15), HH (Hammert, M., 2016. 1. and 20. K. . However, current research work also mainly investigated how to improve the quality of protein monomer multiple Sequence alignments by further combing analysis, and for protein complex multiple Sequence alignments only simple methods of protein monomer multiple Sequence alignment are followed, such as ComplexContact (Zeng, Hong, et al, "ComplexConnect: a web server for inter-protein contact alignment using" Nucleic acids research 46.W1(2018): W432-W437.), GREEN (Joan, "proteins binding the mechanism of computer cell differentiation. Cancer cell 25.6(2014) 716-717), EVPf (Hooma, Thomas, co-Sequence analysis. 3. and strain of Sequence cells, 430).
Although these studies can be used as protein complex multiple sequence alignment methods, challenges still remain. First, these methods focus on the prediction of the secondary structure of the protein complex, and therefore the accuracy of the multiple sequence alignment results is not high. Secondly, a single strategy is used for constructing a protein complex multi-sequence alignment database, so that the multi-sequence alignment result is easy to have only one query sequence, and the multi-sequence alignment precision is poor. In addition, different databases are mechanically combined to construct a protein complex database, so that alignment results of different databases are influenced mutually, and the improvement of the accuracy of multi-sequence alignment is limited.
Disclosure of Invention
The invention aims to provide a protein complex multi-sequence alignment method with high multi-sequence alignment result quality, large depth, wide sequence source range and strong generalization capability, which is used for solving the defect of low quality of the protein complex multi-sequence alignment result due to single database and low search depth in the protein complex multi-sequence alignment.
The technical solution for realizing the purpose of the invention is as follows: a protein complex multiple sequence alignment method comprises the following steps:
step 1, constructing a protein monomer sequence database and a genome distance search algorithm: first, the unicust 30 protein monomer sequence database was downloaded from the whole genome data of protein monomer sequences. Secondly, using multi-sequence alignment software HHblits software to respectively carry out multi-sequence alignment on the protein monomer sequence searching protein sequence database Unicluster 30 to obtain multi-sequence alignment information of the protein monomer sequence. Next, the multiple sequence alignment of the monomeric protein sequences is compared with the genome database (ENA), respectively. Finally, connecting the multiple sequence alignments of two different monomer sequences according to the distances between the multiple sequence alignments of the two different monomer sequences and the alignments of the genome database species, thereby obtaining the multiple sequence alignments of the protein complex based on the genome distances;
step 2, constructing a protein monomer sequence database and species similarity search algorithm: first, a species classification database (Taxonomy) is downloaded from the public database, national information center (NCBI). Secondly, using multi-sequence alignment software HHblits software to respectively perform multi-sequence alignment on the protein monomer sequence database Unicluster 30 constructed in the step 1 of searching the protein monomer sequences to obtain multi-sequence alignment information of the protein monomer sequences. Secondly, species alignment is performed on the multiple sequence alignment results of the protein monomer sequences and a species classification database (Taxonomy), respectively. Finally, two different monomer multiple sequences are compared and connected according to the species comparison result, so that the species-based multiple sequence comparison result of the protein complex is obtained;
step 3, constructing a protein interaction network database and a protein interaction search algorithm: first, protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) are downloaded from a public database protein interaction network database (STRING). Secondly, using HHblits software to search the protein monomer sequence for protein interaction sequence information database constructed in the above steps to carry out multi-sequence comparison to obtain multi-sequence comparison information of the protein monomer sequence. Finally, comparing and connecting the multiple sequences of two different protein monomer sequences according to the protein interaction information, thereby obtaining a multiple sequence comparison result of the protein complex sequence based on the protein interaction network;
step 4, selecting a protein compound multiple sequence alignment method: first, the number of valid sequences in the genome distance-based protein complex multiple sequence alignment in step 1 was calculated. Secondly, if the number of sequences in the multiple sequence alignment in step 1 meets the requirement, the alignment of step 1 is used as the input of the step of removing redundant sequences. Otherwise, the multiple sequence alignment in step 1 is combined with the species-based multiple sequence alignment in step 2 and the number of valid sequences is calculated. Thirdly, if the number of the combined effective sequences in the step 1 and the step 2 meets the condition, the combined result is used as the input of the step of dividing the redundant sequence. Otherwise, comparing and combining the multiple sequences based on the protein interaction network in the steps 1, 2 and 3, and taking the comparison and combination as the input of the step of removing the redundant sequences;
step 5, removing the redundancy of the multiple sequence alignment: and (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.
Step 6, online prediction: giving a predicted protein complex sequence, generating a multi-sequence alignment of the corresponding protein complex sequence by using the method in the steps 1-5, and returning the multi-sequence alignment result of the protein complex to a user in the form of a page or an email, thereby being convenient for a researcher to use.
Compared with the prior art, the invention has the following remarkable advantages: (1) the alignment depth of multiple sequences is improved: first, the depth of multiple sequence alignment refers to the depth of the hierarchy, not just the alignment using a single search algorithm or database; secondly, the multi-sequence comparison methods of different levels are not mechanically combined together, but are judged according to the number of effective sequences in the multi-sequence comparison result of the previous level, so that the multi-sequence comparison speed is optimized; and finally, removing redundant sequences from the multi-sequence comparison results of different levels to ensure that the fused sequences of all levels have specificity. (2) The quality of multiple sequence alignment is improved: different protein monomer databases are used and three different strategies are adopted to connect the monomer multi-sequence comparison results in the protein complex, so that the problem that two protein monomer sequences cannot be connected due to the use of a single strategy, and the failure of the multi-sequence comparison in the construction of the protein complex is avoided. Therefore, three different search connection strategies are adopted, the result of sequence comparison is ensured, the range of a database of multi-sequence comparison is expanded, and the quality of multi-sequence comparison is improved. (3) The generalization capability of the model is improved: three different protein monomer multi-sequence alignment connection strategies (based on gene distance, species and protein interaction network) are used, so that the multi-sequence alignment result of any query protein complex sequence can be generated. Therefore, the method improves the generalization capability of the model.
Drawings
FIG. 1 is a schematic diagram of the deep multiple sequence alignment method of protein complexes of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
The figure shows the system structure diagram of the multiple sequence alignment method of the invention. With reference to the attached drawings, according to an embodiment of the invention, a method for deep multiple sequence alignment of protein complexes comprises the following steps: first, a protein monomer sequence database was constructed. Searching a database by using multi-sequence comparison software to obtain a multi-sequence comparison result, and then connecting the protein monomers according to the gene distance information; next, the multiple sequence alignment of the protein monomer sequences was species aligned with a species Taxonomy database (Taxonomy). Connecting two different monomer multiple sequences by alignment according to species alignment results; next, a protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) were constructed. And using HHblits software to search the protein monomer sequence for protein interaction sequence information database constructed in the above steps to perform multi-sequence comparison to obtain multi-sequence comparison information of the protein monomer sequence. Connecting two different protein monomer sequences by aligning the multiple sequences according to protein interaction information; then, judging whether to carry out multi-sequence comparison construction of the next-level strategy according to the number of the effective sequences of different strategies; thus, the protein complex multiple sequence alignment generated in the above steps is performed to remove redundancy, such that the similarity of any two sequences in the multiple sequence alignment after removing redundancy is less than 90%. Finally, a predicted protein complex sequence is given, the method in the steps 1-5 is utilized to generate the multi-sequence alignment of the corresponding protein complex sequence, and the multi-sequence alignment result of the protein complex is returned to the user in the form of a page or an email, which is convenient for researchers to use. (ii) a The foregoing process will be described in more detail with reference to the accompanying drawings.
Step 1, constructing a protein monomer sequence database and a genome distance search algorithm:
given the number of protein complexes, it contains two monomer sequences, sequence a and sequence B. Then respectively searching a protein database for the sequence A and the sequence B by using a multi-sequence comparison search algorithm, and then carrying out multi-sequence comparison for constructing a strategy 1 according to gene distance information, wherein the specific steps are as follows:
(1) downloading a Unicluster 30 protein monomer sequence database from the protein monomer sequence complete genome data (https:// unicust. mmseqs. com /);
(2) respectively searching protein sequence database Unicluster 30 for multi-sequence alignment by using multi-sequence alignment software HHblits software for the sequence A and the sequence B to respectively obtain multi-sequence alignment information MSA _ A and MSA _ B of the sequence A and the sequence B;
(3) comparing the multiple sequence comparison results MSA _ A and MSA _ B with a genome database, and respectively comparing the gene information MSA _ A _ gene and MSA _ B _ gene of the multiple sequence comparison results;
(4) calculating the gene distance delta gene of two proteins i and j with the same gene in MSA _ A _ gene and MSA _ B _ gene, and if the delta gene is more than or equal to 1 and less than or equal to 20, connecting the protein i with the protein j;
(5) according to the steps (1) to (4), a protein complex Multiple Sequence Alignment (MSA) based on gene distance is constructed.
Step 2, building a monomer sequence database and species similarity search algorithm
(1) Downloading species classification database (Taxonomy) from public database american national information center (NCBI)
(2) Respectively carrying out species comparison on the multi-sequence comparison information MSA _ A and MSA _ B of the sequence A and the sequence B in the step 1 with a species classification database (Taxinomy), and respectively obtaining species information of proteins in the MSA _ A and the MSA _ B;
(3) ranking similarity of proteins to query sequence from high to low in each species of MSA _ a and MSA _ B, respectively;
(4) let P 1 ,P 2 ,…,P m Proteins ordered by sequence similarity for a particular species in MSA _ A, and Q 1 ,Q 2 ,…,Q n Proteins ordered by sequence similarity for a particular species in MSA _ B. Then P is added i And Q i And (c) performing ligation, wherein i ═ min (m, n).
Step 3, constructing a protein interaction network database and a protein interaction search algorithm
(1) Downloading protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) from a public database protein interaction network database (STRING);
(2) and (3) using multi-sequence alignment software HHblits software to respectively perform multi-sequence alignment on the protein interaction sequence information databases searched by the sequence A and the sequence B to respectively obtain multi-sequence alignment information MSA _ stringA and MSA _ stringB.
(3) Finally, whether any two proteins of MSA _ stringA and MSA _ stringB have interaction is judged according to the protein interaction information. If there is an interaction, the two are connected.
Step 4, selecting a protein complex multiple sequence alignment method
Calculating the number of effective sequences in the multi-sequence comparison
Figure BDA0002319032920000061
Figure BDA0002319032920000062
Figure BDA0002319032920000063
Wherein L is the chain length of the protein complex, N is the number of sequences in a protein complex Multiple Sequence Alignment (MSA), S iA,jA Is the sequence of chain A in sequence i and chain A in sequence jColumn similarity score, S iB,jB Is the sequence similarity score of chain B in sequence i with chain B in sequence j. In addition, the optimized Necs value is 128. Namely, if the Necs is larger than or equal to 176, the comparison of the next strategy is not carried out, otherwise, the comparison is continued.
The duplicates generated in the above step were clustered using the protein structure clustering algorithm software SPICKER, and the average of the atomic coordinates in all conformations in each class was calculated. The obtained atomic coordinate mean value is used as the atomic coordinate of the cluster center conformation.
Step 5, removing the redundancy of multi-sequence alignment
And (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.
Step 6, on-line prediction
For a given predicted protein complex sequence, generating a corresponding three-dimensional structure of the protein complex by using the method in the steps 1-5, and returning the three-dimensional structure of the protein to the user in the form of a page or a mail, thereby being convenient for a researcher to use.
In summary, first, the depth of multiple sequence alignment refers to the depth of the hierarchy, rather than just using a single search algorithm or database for alignment; secondly, the multi-sequence comparison methods of different levels are not mechanically combined together, but are judged according to the number of effective sequences in the multi-sequence comparison result of the previous level, so that the multi-sequence comparison speed is optimized; and finally, removing redundant sequences from the multi-sequence comparison results of different levels to ensure that the sequences fused at all levels have specificity. Therefore, the invention improves the depth of multi-sequence comparison; secondly, different protein monomer databases are used and three different strategies are adopted to connect the monomer multi-sequence comparison results in the protein complex, so that the problem that two protein monomer sequences cannot be connected due to the use of a single strategy, and the failure of multi-sequence comparison in the construction of the protein complex is avoided. Therefore, three different search connection strategies are adopted, the result of sequence comparison is ensured, the range of a database of multi-sequence comparison is expanded, and the quality of multi-sequence comparison is improved; finally, a connection strategy (based on gene distance, species and protein interaction network) for multi-sequence alignment of three different protein monomers is used, so that a multi-sequence alignment result can be generated for any query protein complex sequence. Therefore, the method improves the generalization capability of the model.

Claims (5)

1. A method for deep multi-sequence alignment of protein complexes is characterized by comprising the following steps:
step 1, constructing a protein monomer sequence database and a genome distance search algorithm:
firstly, downloading a Unicluster 30 protein monomer sequence database from the whole genome data of the protein monomer sequence;
secondly, using multi-sequence comparison software HHblits software to respectively carry out multi-sequence comparison on the protein monomer sequence search protein sequence database Unicluster 30 to obtain multi-sequence comparison information of the protein monomer sequence;
thirdly, respectively comparing the multi-sequence comparison result of the protein monomer sequence with a genome database ENA;
finally, connecting the multiple sequence alignments of two different monomer sequences according to the distances between the multiple sequence alignments of the two different monomer sequences and the alignments of the genome database species, thereby obtaining the multiple sequence alignments of the protein complex based on the genome distances;
step 2, constructing a protein monomer sequence database and species similarity search algorithm:
firstly, downloading a species classification database Taxonom from a public database, namely a national information center NCBI;
secondly, using multi-sequence comparison software HHblits software to respectively carry out multi-sequence comparison on the protein monomer sequence database Unicluster 30 constructed in the step 1 to obtain multi-sequence comparison information of the protein monomer sequence;
thirdly, respectively carrying out species comparison on the multi-sequence comparison result of the protein monomer sequence and a species classification database Taxonom;
finally, two different monomer multiple sequences are compared and connected according to the species comparison result, so that the species-based multiple sequence comparison result of the protein complex is obtained;
step 3, constructing a protein interaction network database and a protein interaction search algorithm:
firstly, downloading a protein interaction information STRING linker and a protein interaction sequence information database STRING database from a public database protein interaction network database STRING;
secondly, using multi-sequence comparison software HHblits software to respectively search the protein monomer sequences for multi-sequence comparison of the protein interaction sequence information database constructed in the above step to obtain multi-sequence comparison information of the protein monomer sequences; finally, comparing and connecting the multiple sequences of two different protein monomer sequences according to the protein interaction information, thereby obtaining a multiple sequence comparison result of the protein complex sequence based on the protein interaction network;
and 4, selecting a protein complex multiple sequence alignment method:
firstly, calculating the number of effective sequences in the protein complex multi-sequence alignment based on the genome distance in the step 1;
secondly, if the number of sequences in the multiple sequence alignment in the step 1 meets the requirement, the alignment of the sequence in the step 1 is used as the input of the step of removing redundant sequences, otherwise, the multiple sequence alignment in the step 1 and the multiple sequence alignment based on species types in the step 2 are merged, and the number of effective sequences is calculated;
thirdly, if the number of the combined effective sequences in the step 1 and the step 2 meets the condition, taking the combined result as the input of the step of removing the redundant sequences, otherwise, comparing and combining the multiple sequences based on the protein interaction network in the step 1, the step 2 and the step 3 as the input of the step of removing the redundant sequences;
step 5, removing multiple sequence alignment redundancy: and (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.
2. The method of multiple sequence alignment of claim 1, wherein: in the step (1), the step (2),
(1) downloading a Unicluster 30 protein monomer sequence database from the protein monomer sequence complete genome data;
(2) respectively searching protein sequence databases Unicluster 30 for the sequence A and the sequence B by using multi-sequence alignment software HHblits software to carry out multi-sequence alignment so as to respectively obtain multi-sequence alignment information MSA _ A and MSA _ B of the sequence A and the sequence B;
(3) comparing the multiple sequence comparison results MSA _ A and MSA _ B with a genome database to respectively obtain gene information MSA _ A _ gene and MSA _ B _ gene of the multiple sequence comparison results;
(4) calculating the gene distance delta gene of two proteins i and j with the same gene in MSA _ A _ gene and MSA _ B _ gene, and if the delta gene is more than or equal to 1 and less than or equal to 20, connecting the protein i with the protein j;
(5) according to the steps (1) to (4), the protein complex multiple sequence alignment MSA based on the gene distance is constructed.
3. The method of claim 2, wherein the sequence comprises: in the step 2, in the step of processing,
(1) downloading a species classification database Taxonom from a public database, namely a national information center NCBI;
(2) respectively carrying out species comparison on the multi-sequence comparison information MSA _ A and MSA _ B of the sequence A and the sequence B in the step 1 and a species classification database Taxonomy, and respectively obtaining species information of proteins in the MSA _ A and the MSA _ B;
(3) ranking similarity of proteins to query sequence from high to low in each species of MSA _ a and MSA _ B, respectively;
let P 1 ,P 2 ,…,P m Proteins ordered by sequence similarity for a particular species in MSA _ A, and Q 1 ,Q 2 ,…,Q n Is a specific species in MSA _ BProteins ordered by sequence similarity; then P is added i And Q i And (c) performing ligation, wherein i ═ min (m, n).
4. The method of multiple sequence alignment of claim 1, wherein: in the step 3, downloading a protein interaction information STRING linker and a protein interaction sequence information database STRING database from a public database protein interaction network database STRING;
(1) using multi-sequence comparison software HHblits software to respectively perform multi-sequence comparison on protein interaction sequence information databases searched by the sequence A and the sequence B to respectively obtain multi-sequence comparison information MSA _ stringA and MSA _ stringB;
(2) and judging whether any two proteins in MSA _ stringA and MSA _ stringB have interaction according to the protein interaction information, and connecting the two proteins if the two proteins have interaction.
5. The method of multiple sequence alignment of claim 1, wherein: in the step 4, the number of effective sequences in the multi-sequence comparison is calculated
Figure FDA0003756176130000031
Figure FDA0003756176130000032
Figure FDA0003756176130000033
Wherein L is the chain length of the protein complex, N is the number of sequences in the protein complex multiple sequence alignment MSA, S iA,jA Is the sequence similarity score of chain A in sequence i with chain A in sequence j, S iB,jB Is the sequence similarity of chain B in sequence i to chain B in sequence jDividing;
clustering the copies generated in the step by using protein structure clustering algorithm software SPICKER, and calculating the average value of atomic coordinates in all conformations in each class; the obtained atomic coordinate mean value is used as the atomic coordinate of the cluster center conformation.
CN201911290749.3A 2019-12-16 2019-12-16 Deep multi-sequence alignment method for protein complex Active CN111145833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911290749.3A CN111145833B (en) 2019-12-16 2019-12-16 Deep multi-sequence alignment method for protein complex

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911290749.3A CN111145833B (en) 2019-12-16 2019-12-16 Deep multi-sequence alignment method for protein complex

Publications (2)

Publication Number Publication Date
CN111145833A CN111145833A (en) 2020-05-12
CN111145833B true CN111145833B (en) 2022-09-20

Family

ID=70518302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911290749.3A Active CN111145833B (en) 2019-12-16 2019-12-16 Deep multi-sequence alignment method for protein complex

Country Status (1)

Country Link
CN (1) CN111145833B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634988B (en) * 2021-01-07 2021-10-08 内江师范学院 Python language-based gene variation detection method and system
CN114300038B (en) * 2021-12-27 2023-09-29 山东师范大学 Multi-sequence comparison method and system based on improved biological geography optimization algorithm
CN116206675B (en) * 2022-09-05 2023-09-15 北京分子之心科技有限公司 Method, apparatus, medium and program product for predicting protein complex structure

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7058650B2 (en) * 2001-02-20 2006-06-06 Yonghong Yang Methods for establishing a pathways database and performing pathway searches
US20170211142A1 (en) * 2015-10-22 2017-07-27 The Broad Institute, Inc. Novel crispr enzymes and systems
CN106202998B (en) * 2016-07-05 2019-01-25 集美大学 A kind of method of non-mode biology transcript profile gene order structural analysis
CN107103205A (en) * 2017-05-27 2017-08-29 湖北普罗金科技有限公司 A kind of bioinformatics method based on proteomic image data notes eukaryotic gene group

Also Published As

Publication number Publication date
CN111145833A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
CN111145833B (en) Deep multi-sequence alignment method for protein complex
Janin et al. Protein–protein interaction and quaternary structure
Lin et al. Efficient classification of hot spots and hub protein interfaces by recursive feature elimination and gradient boosting
Baek et al. Accurate prediction of nucleic acid and protein-nucleic acid complexes using RoseTTAFoldNA
Olson et al. In search of the protein native state with a probabilistic sampling approach
CN106503486B (en) A kind of differential evolution protein structure ab initio prediction method based on multistage subgroup coevolution strategy
Zheng et al. Integrating deep learning, threading alignments, and a multi‐MSA strategy for high‐quality protein monomer and complex structure prediction in CASP15
US20030078374A1 (en) Complementary peptide ligands generated from the human genome
Hao et al. Conformational space sampling method using multi-subpopulation differential evolution for de novo protein structure prediction
Aung et al. An efficient index-based protein structure database searching method
Olson et al. Enhancing sampling of the conformational space near the protein native state
Zhao et al. GIFDTI: Prediction of drug-target interactions based on global molecular and intermolecular interaction representation learning
Fang et al. Discover protein sequence signatures from protein-protein interaction data
CN108595910A (en) A kind of group's protein conformation space optimization method based on diversity index
Juang et al. Multiple sequence alignment using modified dynamic programming and particle swarm optimization
Nguyen et al. A knowledge-based multiple-sequence alignment algorithm
Wani et al. Position Specific Scoring Matrix and Synergistic Multiclass SVM for Identification of Genes
GB2356401A (en) Method for manipulating protein or DNA sequence data
Mehmood et al. RPPSP: a robust and precise protein solubility predictor by utilizing novel protein sequence encoder
CN111261228A (en) Method and system for calculating conserved nucleic acid sequence
Du et al. DeepBtoD: improved RNA-binding proteins prediction via integrated deep learning
Haritha et al. A Comprehensive Review on Protein Sequence Analysis Techniques
CN117174164B (en) Method for screening lead compounds based on predicted protein-small molecule binding posture
Chowdhury et al. An optimized approach for annotation of large eukaryotic genomic sequences using genetic algorithm
Miranker et al. Biosequence Use Cases in MoBIoS SQL.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant