CN111145833A - Deep multi-sequence alignment method for protein complex - Google Patents

Deep multi-sequence alignment method for protein complex Download PDF

Info

Publication number
CN111145833A
CN111145833A CN201911290749.3A CN201911290749A CN111145833A CN 111145833 A CN111145833 A CN 111145833A CN 201911290749 A CN201911290749 A CN 201911290749A CN 111145833 A CN111145833 A CN 111145833A
Authority
CN
China
Prior art keywords
sequence
protein
database
msa
sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911290749.3A
Other languages
Chinese (zh)
Other versions
CN111145833B (en
Inventor
於东军
刘子
朱一亨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN201911290749.3A priority Critical patent/CN111145833B/en
Publication of CN111145833A publication Critical patent/CN111145833A/en
Application granted granted Critical
Publication of CN111145833B publication Critical patent/CN111145833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a protein complex deep multi-sequence comparison method, which comprises the following steps: step 1, protein monomer multiple sequence comparison search, and then combining the two protein monomer multiple sequence comparison results based on the genome distance; step 2, protein monomer multiple sequence comparison searching, and then combining the protein monomer multiple sequence comparison results based on species matching; and 3, comparing and searching the protein monomer multiple sequences, and combining the protein monomer multiple sequence comparison results based on the protein interaction network relation. The method is used for solving the problem that the contact pattern of the protein complex cannot be predicted and analyzed in a large scale due to the fact that the protein complex multi-sequence alignment cannot be directly constructed from the protein complex sequence information in the problem of predicting the contact pattern between the protein complexes, and has the advantages of high prediction precision and strong generalization capability.

Description

Deep multi-sequence alignment method for protein complex
Technical Field
The invention relates to the field of deep multi-sequence comparison of protein complexes in bioinformatics, in particular to a method for recognizing the protein family similarity relation between protein monomer sequences and proteins constituting the complex sequences.
Background
Bioinformatics is a young subject formed by crossing biology and information science, is one of the major frontier fields of life science and natural science at present, and the research focus of bioinformatics is mainly embodied in two aspects of genomics and proteomics. The research of bioinformatics has important significance for deepening the cognition on the life process of human beings, helping people to improve the living environment and the life quality, and is widely valued by scholars at home and abroad.
Proteins, which are one of the material bases of life phenomena, are important components constituting all cell tissue structures, participate in many important life processes in living bodies, and are important players of life activities. Although deoxyribonucleic acid (DNA) is said to be a carrier of genetic information, replication, transcription, and expression of genetic information all need to rely on the cooperation of various proteins to accomplish. Proteomics is more direct and accurate in explaining life phenomena than genomics, has rapidly developed in recent years, and is highly concerned by students in all countries of the world. In the post-genome era, with the rapid development of protein sequencing technology, the data of protein sequences are explosively increased, and at present, in the famous protein database UniProtKB, more than 120,243,849 pieces of primary sequence information (which is cut off to 2018-07-16) of proteins already exist, and the trend of rapid increase is continuously maintained. However, in view of such a huge amount of protein sequence information, taking the currently sequenced protein data as an example, 0.1% (140,000) of the proteins were solved for three-dimensional structure, and 0.3% of the true protein complexes were experimentally verified and solved for three-dimensional structure, and are included in the well-known protein structure database PDB. This gap is increasingly expanding as technology continues to advance and mature.
Through the reading of the literature, the results achieved in the field of the alignment of multiple sequences of protein monomer sequences can be found, a plurality of papers with high theoretical significance and practical value are published, the classical methods of protein monomer sequence alignment are BLAST (Kent, W.James. "BLAT-the BLAST-like alignment tool." Genome research12.4 (2002):656 664.), PSI-BLAST (Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." nucleic acids research 25.17(1997): 3389. 3402.), HHBlits (Remmer, Michael., HHBlues: light-failure sequence search by nucleic acids by HMM-mapping, "Nature methods 9.2 (Ocular), and HH-gene analysis (Jaccement) 2016. J.S.S.6. and Japanese analysis 2016 (U.S.: Japanese analysis and analysis). However, current research work also mainly investigated how to improve the quality of protein monomer multiple Sequence alignments by further combing analysis, and for protein complex multiple Sequence alignments only simple methods of protein monomer multiple Sequence alignment are followed, such as ComplexContact (Zeng, Hong, et al. "ComplexConnect: a web server for inter-protein contact alignment using" Nucleic acids research 46.W1(2018): W432-W437.), GREEN (Joan. "proteins of the interaction of the cell tissue sample differentiation. Cancer cell 25.6(2014) 716-717;" EVPf, Thomas A., Co-Sequence analysis of cells 033. and 430: "electronic products of Sequence alignment of 3. C3. and 3. C. for protein complex multiple Sequence alignment.
Although these studies can be used as protein complex multiple sequence alignment methods, challenges still remain. First, these methods focus on the prediction of the secondary structure of the protein complex, and therefore the accuracy of the multiple sequence alignment results is not high. Secondly, a single strategy is used for constructing a protein complex multi-sequence alignment database, so that the multi-sequence alignment result is easy to have only one query sequence, and the multi-sequence alignment precision is poor. In addition, different databases are mechanically combined to construct a protein complex database, so that alignment results of different databases are influenced mutually, and the improvement of the accuracy of multi-sequence alignment is limited.
Disclosure of Invention
The invention aims to provide a protein complex multi-sequence alignment method with high multi-sequence alignment result quality, large depth, wide sequence source range and strong generalization capability, which is used for solving the defect of low quality of the protein complex multi-sequence alignment result due to single database and low search depth in the protein complex multi-sequence alignment.
The technical solution for realizing the purpose of the invention is as follows: a protein complex multiple sequence alignment method comprises the following steps:
step 1, constructing a protein monomer sequence database and a genome distance search algorithm: first, the unicust 30 protein monomer sequence database was downloaded from the whole genome data of protein monomer sequences. Secondly, using multi-sequence alignment software HHblits software to respectively carry out multi-sequence alignment on the protein monomer sequence searching protein sequence database Unicluster 30 to obtain multi-sequence alignment information of the protein monomer sequence. Next, the multiple sequence alignment of the monomeric protein sequences is compared with the genome database (ENA), respectively. Finally, connecting the multiple sequence alignments of two different monomer sequences according to the distances between the multiple sequence alignments of the two different monomer sequences and the alignments of the genome database species, thereby obtaining the multiple sequence alignments of the protein complex based on the genome distances;
step 2, constructing a protein monomer sequence database and species similarity search algorithm: first, a species classification database (Taxonomy) is downloaded from the public database, national information center (NCBI). Secondly, using multi-sequence alignment software HHblits software to respectively perform multi-sequence alignment on the protein monomer sequence database Unicluster 30 constructed in the step 1 of searching the protein monomer sequences to obtain multi-sequence alignment information of the protein monomer sequences. Secondly, the multiple sequence alignment results of the protein monomer sequences are respectively subjected to species alignment with a species classification database (Taxonomy). Finally, two different monomer multiple sequences are compared and connected according to the species comparison result, so that the species-based multiple sequence comparison result of the protein complex is obtained;
step 3, constructing a protein interaction network database and a protein interaction search algorithm: first, protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) are downloaded from a public database protein interaction network database (STRING). Secondly, using HHblits software to search the protein monomer sequence for protein interaction sequence information database constructed in the above steps to carry out multi-sequence comparison to obtain multi-sequence comparison information of the protein monomer sequence. Finally, comparing and connecting the multiple sequences of two different protein monomer sequences according to the protein interaction information, thereby obtaining a multiple sequence comparison result of the protein complex sequence based on the protein interaction network;
and 4, selecting a protein complex multiple sequence alignment method: first, the number of valid sequences in the genome distance-based protein complex multiple sequence alignment in step 1 was calculated. Secondly, if the number of sequences in the multiple sequence alignment in step 1 meets the requirement, the alignment of step 1 is used as the input of the step of removing redundant sequences. Otherwise, the multiple sequence alignment in step 1 is combined with the species-based multiple sequence alignment in step 2 and the number of valid sequences is calculated. Thirdly, if the number of the combined effective sequences in the step 1 and the step 2 meets the condition, the combined result is used as the input of the step of dividing the redundant sequence. Otherwise, comparing and combining the multiple sequences based on the protein interaction network in the steps 1, 2 and 3, and taking the multiple sequences as the input of the step of removing the redundant sequences;
step 5, removing the redundancy of the multiple sequence alignment: and (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.
Step 6, online prediction: giving a predicted protein complex sequence, generating a multi-sequence alignment of the corresponding protein complex sequence by using the method in the steps 1-5, and returning the multi-sequence alignment result of the protein complex to a user in the form of a page or an email, thereby being convenient for a researcher to use.
Compared with the prior art, the invention has the following remarkable advantages: (1) the alignment depth of multiple sequences is improved: first, the depth of multiple sequence alignment refers to the depth of the hierarchy, not just the alignment using a single search algorithm or database; secondly, the multi-sequence comparison methods of different levels are not mechanically combined together, but are judged according to the number of effective sequences in the multi-sequence comparison result of the previous level, so that the multi-sequence comparison speed is optimized; and finally, removing redundant sequences from the multi-sequence comparison results of different levels to ensure that the fused sequences of all levels have specificity. (2) The quality of multiple sequence alignment is improved: different protein monomer databases are used and three different strategies are adopted to connect the monomer multi-sequence comparison results in the protein complex, so that the problem that two protein monomer sequences cannot be connected due to the use of a single strategy, and the failure of the multi-sequence comparison in the construction of the protein complex is avoided. Therefore, three different search connection strategies are adopted, the result of sequence comparison is ensured, the range of a database of multi-sequence comparison is expanded, and the quality of multi-sequence comparison is improved. (3) The generalization capability of the model is improved: three different protein monomer multi-sequence alignment connection strategies (based on gene distance, species and protein interaction network) are used, so that the multi-sequence alignment result of any query protein complex sequence can be generated. Therefore, the method improves the generalization capability of the model.
Drawings
FIG. 1 is a schematic diagram of the deep multiple sequence alignment method of protein complexes of the present invention.
Detailed Description
The invention will be further explained with reference to the drawings.
The figure shows the system structure diagram of the multiple sequence alignment method of the invention. With reference to the attached drawings, according to an embodiment of the invention, a method for deep multiple sequence alignment of protein complexes comprises the following steps: first, a protein monomer sequence database was constructed. Searching a database by using multi-sequence comparison software to obtain a multi-sequence comparison result, and then connecting the protein monomers according to the gene distance information; next, the multiple sequence alignment of the protein monomer sequences was species aligned with a species Taxonomy database (Taxonomy). Connecting two different monomer multiple sequences by alignment according to species alignment results; next, a protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) were constructed. And using HHblits software to search the protein monomer sequence for protein interaction sequence information database constructed in the above steps to perform multi-sequence comparison to obtain multi-sequence comparison information of the protein monomer sequence. Connecting two different protein monomer sequences by aligning the multiple sequences according to protein interaction information; then, judging whether to carry out multi-sequence comparison construction of the next-level strategy according to the number of the effective sequences of different strategies; thus, the protein complex multiple sequence alignment generated in the above steps is subjected to redundancy removal, so that the similarity of any two sequences in the multiple sequence alignment after the redundancy removal is less than 90%. Finally, a predicted protein complex sequence is given, the method in the step 1-5 is used for generating the multi-sequence alignment of the corresponding protein complex sequence, and the multi-sequence alignment result of the protein complex is returned to the user in the form of a page or an email, so that the protein complex sequence is convenient for a researcher to use. (ii) a The foregoing process will be described in more detail with reference to the accompanying drawings.
Step 1, constructing a protein monomer sequence database and a genome distance search algorithm:
given the number of protein complexes, it contains two monomer sequences, sequence a and sequence B. Then respectively searching a protein database for the sequence A and the sequence B by using a multi-sequence comparison search algorithm, and then carrying out multi-sequence comparison for constructing a strategy 1 according to gene distance information, wherein the specific steps are as follows:
(1) downloading a Unicluster 30 protein monomer sequence database from the protein monomer sequence complete genome data (https:// unicust. mmseqs. com /);
(2) respectively searching protein sequence database Unicluster 30 for multi-sequence alignment by using multi-sequence alignment software HHblits software for the sequence A and the sequence B to respectively obtain multi-sequence alignment information MSA _ A and MSA _ B of the sequence A and the sequence B;
(3) comparing the multiple sequence comparison results MSA _ A and MSA _ B with a genome database, and respectively comparing the gene information MSA _ A _ gene and MSA _ B _ gene of the multiple sequence comparison results;
(4) calculating the gene distance delta gene of two proteins i and j with the same gene in MSA _ A _ gene and MSA _ B _ gene, and if the delta gene is more than or equal to 1 and less than or equal to 20, connecting the protein i with the protein j;
(5) according to the steps (1) to (4), a protein complex Multiple Sequence Alignment (MSA) based on gene distance is constructed.
Step 2, building a monomer sequence database and species similarity search algorithm
(1) Downloading species classification database (Taxonomy) from public database american national information center (NCBI)
(2) Respectively carrying out species comparison on the multi-sequence comparison information MSA _ A and MSA _ B of the sequence A and the sequence B in the step 1 with a species classification database (Taxinomy), and respectively obtaining species information of proteins in the MSA _ A and the MSA _ B;
(3) ranking similarity of proteins to query sequence from high to low in each species of MSA _ a and MSA _ B, respectively;
(4) let P1,P2,…,PmProteins ordered by sequence similarity for a particular species in MSA _ A, and Q1,Q2,…,QnProteins ordered by sequence similarity for a particular species in MSA _ B. Then P is addediAnd QiAnd (c) performing ligation, wherein i ═ min (m, n).
Step 3, constructing a protein interaction network database and a protein interaction search algorithm
(1) Downloading protein interaction information (STRING linker) and a protein interaction sequence information database (STRING database) from a public database protein interaction network database (STRING);
(2) and (3) using multi-sequence alignment software HHblits software to respectively perform multi-sequence alignment on the protein interaction sequence information databases searched by the sequence A and the sequence B to respectively obtain multi-sequence alignment information MSA _ stringA and MSA _ stringB.
(3) Finally, whether any two proteins of MSA _ stringA and MSA _ stringB have interaction is judged according to the protein interaction information. If there is an interaction, the two are connected.
Step 4, selecting a protein complex multiple sequence alignment method
Calculating the number of effective sequences in the multi-sequence comparison
Figure BDA0002319032920000061
Figure BDA0002319032920000062
Figure BDA0002319032920000063
Wherein L is the chain length of the protein complex, N is the number of sequences in a protein complex Multiple Sequence Alignment (MSA), SiA,jAIs the sequence similarity score of chain A in sequence i with chain A in sequence j, SiB,jBIs the sequence similarity score of chain B in sequence i with chain B in sequence j. In addition, the optimized Necs value is 128. Namely, if the Necs is larger than or equal to 176, the comparison of the next strategy is not carried out, otherwise, the comparison is continued.
The duplicates generated in the above step were clustered using the protein structure clustering algorithm software SPICKER, and the average of the atomic coordinates in all conformations in each class was calculated. The obtained atomic coordinate mean value is used as the atomic coordinate of the cluster center conformation.
Step 5, removing the redundancy of multi-sequence alignment
And (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.
Step 6, on-line prediction
For a given predicted protein complex sequence, generating a corresponding three-dimensional structure of the protein complex by using the method in the steps 1-5, and returning the three-dimensional structure of the protein to the user in the form of a page or a mail, thereby being convenient for a researcher to use.
In summary, first, the depth of multiple sequence alignment refers to the depth of the hierarchy, rather than just using a single search algorithm or database for alignment; secondly, the multi-sequence comparison methods of different levels are not mechanically combined together, but are judged according to the number of effective sequences in the multi-sequence comparison result of the previous level, so that the multi-sequence comparison speed is optimized; and finally, removing redundant sequences from the multi-sequence comparison results of different levels to ensure that the fused sequences of all levels have specificity. Therefore, the invention improves the depth of multi-sequence comparison; secondly, different protein monomer databases are used and three different strategies are adopted to connect the monomer multi-sequence comparison results in the protein complex, so that the problem that two protein monomer sequences cannot be connected due to the use of a single strategy, and the failure of multi-sequence comparison in the construction of the protein complex is avoided. Therefore, three different search connection strategies are adopted, the result of sequence comparison is ensured, the range of a database of multi-sequence comparison is expanded, and the quality of multi-sequence comparison is improved; finally, a connection strategy (based on gene distance, species and protein interaction network) for multi-sequence alignment of three different protein monomers is used, so that a multi-sequence alignment result can be generated for any query protein complex sequence. Therefore, the method improves the generalization capability of the model.

Claims (5)

1. A method for deep multi-sequence alignment of protein complexes is characterized by comprising the following steps:
step 1, constructing a protein monomer sequence database and a genome distance search algorithm:
firstly, downloading a Unicluster 30 protein monomer sequence database from the whole genome data of the protein monomer sequence;
secondly, using multi-sequence comparison software HHblits software to respectively carry out multi-sequence comparison on the protein monomer sequence search protein sequence database Unicluster 30 to obtain multi-sequence comparison information of the protein monomer sequence;
thirdly, respectively comparing the multi-sequence comparison result of the protein monomer sequence with a genome database ENA;
finally, connecting the multiple sequence alignments of two different monomer sequences according to the distances between the multiple sequence alignments of the two different monomer sequences and the alignments of the genome database species, thereby obtaining the multiple sequence alignments of the protein complex based on the genome distances;
step 2, constructing a protein monomer sequence database and species similarity search algorithm:
firstly, downloading a species classification database Taxonom from a public database, namely a national information center NCBI;
secondly, using multi-sequence comparison software HHblits software to respectively carry out multi-sequence comparison on the protein monomer sequence database Unicluster 30 constructed in the step 1 to obtain multi-sequence comparison information of the protein monomer sequence;
thirdly, respectively carrying out species comparison on the multi-sequence comparison result of the protein monomer sequence and a species classification database Taxonom;
finally, two different monomer multiple sequences are compared and connected according to the species comparison result, so that the species-based multiple sequence comparison result of the protein complex is obtained;
step 3, constructing a protein interaction network database and a protein interaction search algorithm:
firstly, downloading a protein interaction information STRING linker and a protein interaction sequence information database STRING database from a public database protein interaction network database STRING;
secondly, using multi-sequence comparison software HHblits software to respectively search the protein monomer sequences for multi-sequence comparison of the protein interaction sequence information database constructed in the above step to obtain multi-sequence comparison information of the protein monomer sequences; finally, comparing and connecting the multiple sequences of two different protein monomer sequences according to the protein interaction information, thereby obtaining a multiple sequence comparison result of the protein complex sequence based on the protein interaction network;
and 4, selecting a protein complex multiple sequence alignment method:
firstly, calculating the number of effective sequences in the protein complex multi-sequence alignment based on the genome distance in the step 1;
secondly, if the number of sequences in the multiple sequence alignment in the step 1 meets the requirement, the alignment of the sequence in the step 1 is used as the input of the step of removing redundant sequences, otherwise, the multiple sequence alignment in the step 1 and the multiple sequence alignment based on species types in the step 2 are merged, and the number of effective sequences is calculated;
thirdly, if the number of the combined effective sequences in the step 1 and the step 2 meets the condition, taking the combined result as the input of the step of removing the redundant sequences, or else, comparing and combining the multiple sequences based on the protein interaction network in the step 1, the step 2 and the step 3 as the input of the step of removing the redundant sequences;
step 5, removing the redundancy of the multiple sequence alignment: and (4) removing redundancy of the protein complex multi-sequence alignment generated in the step 4, so that the similarity of any two sequences in the multi-sequence alignment after the redundancy is removed is less than 90%.
2. The method of multiple sequence alignment of claim 1, wherein: in the step 1, the step of processing the raw material,
(1) downloading a Unicluster 30 protein monomer sequence database from the protein monomer sequence complete genome data;
(2) respectively searching protein sequence databases Unicluster 30 for the sequence A and the sequence B by using multi-sequence alignment software HHblits software to carry out multi-sequence alignment so as to respectively obtain multi-sequence alignment information MSA _ A and MSA _ B of the sequence A and the sequence B;
(3) comparing the multiple sequence comparison results MSA _ A and MSA _ B with a genome database to respectively obtain gene information MSA _ A _ gene and MSA _ B _ gene of the multiple sequence comparison results;
(4) calculating the gene distance delta gene of two proteins i and j with the same gene in MSA _ A _ gene and MSA _ B _ gene, and if the delta gene is more than or equal to 1 and less than or equal to 20, connecting the protein i with the protein j;
(5) according to the steps (1) to (4), the protein complex multiple sequence alignment MSA based on the gene distance is constructed.
3. The method of multiple sequence alignment of claim 1, wherein: in the step 2, in the step of processing,
(1) downloading a species classification database Taxonom from a public database, namely a national information center NCBI;
(2) respectively carrying out species comparison on the multi-sequence comparison information MSA _ A and MSA _ B of the sequence A and the sequence B in the step 1 and a species classification database Taxonomy, and respectively obtaining species information of proteins in the MSA _ A and the MSA _ B;
(3) ranking similarity of proteins to query sequence from high to low in each species of MSA _ a and MSA _ B, respectively;
let P1,P2,…,PmProteins ordered by sequence similarity for a particular species in MSA _ A, and Q1,Q2,…,QnProteins ordered by sequence similarity for a particular species in MSA _ B; then P is addediAnd QiAnd (c) performing ligation, wherein i ═ min (m, n).
4. The method of multiple sequence alignment of claim 1, wherein: in the step 3, downloading a protein interaction information STRING linker and a protein interaction sequence information database STRING database from a public database protein interaction network database STRING;
(1) using multi-sequence comparison software HHblits software to respectively perform multi-sequence comparison on protein interaction sequence information databases searched by the sequence A and the sequence B to respectively obtain multi-sequence comparison information MSA _ stringA and MSA _ stringB;
(2) and judging whether any two proteins in MSA _ stringA and MSA _ stringB have interaction according to the protein interaction information, and connecting the two proteins if the two proteins have interaction.
5. The method of multiple sequence alignment of claim 1, wherein: in the step 4, the number of effective sequences in the multi-sequence comparison is calculated
Figure FDA0002319032910000031
Figure FDA0002319032910000032
Figure FDA0002319032910000033
Wherein L is the chain length of the protein complex, N is the number of sequences in the MSA for multiple sequence alignment of the protein complex, SiA,jAIs the sequence similarity score of chain A in sequence i with chain A in sequence j, SiB,jBIs the sequence similarity score of chain B in sequence i with chain B in sequence j;
clustering the copies generated in the step by using protein structure clustering algorithm software SPICKER, and calculating the average value of atomic coordinates in all conformations in each class; the obtained atomic coordinate mean value is used as the atomic coordinate of the cluster center conformation.
CN201911290749.3A 2019-12-16 2019-12-16 Deep multi-sequence alignment method for protein complex Active CN111145833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911290749.3A CN111145833B (en) 2019-12-16 2019-12-16 Deep multi-sequence alignment method for protein complex

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911290749.3A CN111145833B (en) 2019-12-16 2019-12-16 Deep multi-sequence alignment method for protein complex

Publications (2)

Publication Number Publication Date
CN111145833A true CN111145833A (en) 2020-05-12
CN111145833B CN111145833B (en) 2022-09-20

Family

ID=70518302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911290749.3A Active CN111145833B (en) 2019-12-16 2019-12-16 Deep multi-sequence alignment method for protein complex

Country Status (1)

Country Link
CN (1) CN111145833B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634988A (en) * 2021-01-07 2021-04-09 内江师范学院 Python language-based gene variation detection method and system
CN114300038A (en) * 2021-12-27 2022-04-08 山东师范大学 Multi-sequence comparison method and system based on improved biophysical optimization algorithm
CN116206675A (en) * 2022-09-05 2023-06-02 北京分子之心科技有限公司 Method, apparatus, medium and program product for predicting protein complex structure

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002067181A1 (en) * 2001-02-20 2002-08-29 Genmetrics, Inc. Methods for establishing a pathways database and performing pathway searches
CN106202998A (en) * 2016-07-05 2016-12-07 集美大学 A kind of method of non-mode biology transcript profile gene order structural analysis
CN107103205A (en) * 2017-05-27 2017-08-29 湖北普罗金科技有限公司 A kind of bioinformatics method based on proteomic image data notes eukaryotic gene group
CN109153980A (en) * 2015-10-22 2019-01-04 布罗德研究所有限公司 VI-B type CRISPR enzyme and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002067181A1 (en) * 2001-02-20 2002-08-29 Genmetrics, Inc. Methods for establishing a pathways database and performing pathway searches
CN109153980A (en) * 2015-10-22 2019-01-04 布罗德研究所有限公司 VI-B type CRISPR enzyme and system
CN106202998A (en) * 2016-07-05 2016-12-07 集美大学 A kind of method of non-mode biology transcript profile gene order structural analysis
CN107103205A (en) * 2017-05-27 2017-08-29 湖北普罗金科技有限公司 A kind of bioinformatics method based on proteomic image data notes eukaryotic gene group

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634988A (en) * 2021-01-07 2021-04-09 内江师范学院 Python language-based gene variation detection method and system
CN112634988B (en) * 2021-01-07 2021-10-08 内江师范学院 Python language-based gene variation detection method and system
CN114300038A (en) * 2021-12-27 2022-04-08 山东师范大学 Multi-sequence comparison method and system based on improved biophysical optimization algorithm
CN114300038B (en) * 2021-12-27 2023-09-29 山东师范大学 Multi-sequence comparison method and system based on improved biological geography optimization algorithm
CN116206675A (en) * 2022-09-05 2023-06-02 北京分子之心科技有限公司 Method, apparatus, medium and program product for predicting protein complex structure
CN116206675B (en) * 2022-09-05 2023-09-15 北京分子之心科技有限公司 Method, apparatus, medium and program product for predicting protein complex structure

Also Published As

Publication number Publication date
CN111145833B (en) 2022-09-20

Similar Documents

Publication Publication Date Title
CN111145833B (en) Deep multi-sequence alignment method for protein complex
Zhang et al. Searching genomes for noncoding RNA using FastR
Sun et al. Smolign: a spatial motifs-based protein multiple structural alignment method
Rangwala et al. Introduction to protein structure prediction
Olson et al. Enhancing sampling of the conformational space near the protein native state
Fang et al. Discover protein sequence signatures from protein-protein interaction data
JP2003515148A (en) Automated methods for identifying cognate biomolecular sequences
Nguyen et al. A knowledge-based multiple-sequence alignment algorithm
Wani et al. Position Specific Scoring Matrix and Synergistic Multiclass SVM for Identification of Genes
Bernardes et al. Structural descriptor database: a new tool for sequence-based functional site prediction
GB2356401A (en) Method for manipulating protein or DNA sequence data
Haritha et al. A Comprehensive Review on Protein Sequence Analysis Techniques
Kharsikar et al. A weighted k-nearest neighbor method for gene ontology based protein function prediction
Croce Towards a genome-scale coevolutionary analysis
Mariano et al. Bioinformatics and Computational Biology Research at the Computer Science Department at UFMG
Hu et al. Testing whether hot regions in protein-protein interactions are conserved in different species
Jiquan et al. Sequence Assembly Method Based on a Single Reference Genome
Chowdhury et al. An optimized approach for annotation of large eukaryotic genomic sequences using genetic algorithm
Chen et al. PseU-KeMRF: A novel method for identifying RNA pseudouridine sites
Mosig et al. APPLICATIONS NOTE
El Haji et al. A categorization of relevant sequence alignment algorithms with respect to data structures
Ko et al. The development of a proteomic analyzing pipeline to identify proteins with multiple RRMs and predict their domain boundaries
Askari Rad 3-Way Alignment can Improve Multiple Sequence Alignment of Highly Diverged Sequences
Johnson A Search for Protein Function Similarity using Protein Structure Alignment Quality Metrics
Bhat et al. PSSM amino-acid composition based rules for gene identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant