CN112185459A

CN112185459A - Prediction method for interaction of plant and pathogenic bacteria protein

Info

Publication number: CN112185459A
Application number: CN202011020892.3A
Authority: CN
Inventors: 张利达; 郑存俭; 刘源; 孙方楠
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2021-01-05

Abstract

The invention relates to a prediction method of plant and pathogenic bacteria protein interaction, which comprises the following steps: 1) collecting host-pathogen protein interaction positive data; 2) collecting the spatial structure of the protein complex template, and analyzing the interaction interface of the subunit pair; 3) carrying out homologous structure modeling on a host-pathogenic bacterium protein sequence to obtain a protein homologous space structure model; 4) comparing the protein homologous spatial structure with the protein complex template spatial structure to obtain structural characteristics; 5) extracting non-structural features; 6) and building a machine learning model, testing and adjusting the machine learning model based on the structural characteristics and the non-structural characteristics, and predicting the rice-rice blast germ protein interaction of the genome scale. Compared with the prior art, the invention fully utilizes the determined protein structure data and the information of homology, structural domain interaction and the like, and can effectively, quickly and simply extract the interaction characteristic information related to the plant-pathogenic bacteria protein.

Description

Prediction method for interaction of plant and pathogenic bacteria protein

Technical Field

The invention relates to the technical field of biological data processing, in particular to a prediction method for the interaction between plants and pathogenic bacteria proteins.

Background

Plant-pathogen interactions are a two-way biological process of communication. On the one hand, plants attempt to recognize molecules secreted by pathogenic bacteria to avoid infection, and on the other hand, pathogenic bacteria manipulate plants as much as possible, thereby making the plant host environment more favorable to them. This makes many known intra-species protein interaction prediction methods unsuitable for plant-pathogen, and there is little research focused on plant-pathogen protein interaction prediction.

Although experimental detection methods for protein interactions have been developed, the experimental methods are time consuming, laborious, low in data accumulation, and most of these data focus on interactions between humans and pathogens (especially viruses). In contrast, other hosts, especially plant-pathogen protein interaction data, are very limited.

Although the protein interaction is very easy to explain from the perspective of the protein space structure, the protein space structure is complex, the number of proteins with known structures is limited, and how to extract relevant interaction characteristic information by fully using the measured protein structure data becomes a key problem to be solved urgently in the current plant-pathogenic bacterium interaction.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a method for predicting the interaction between the plant and the pathogenic bacteria protein, which can effectively, quickly and simply extract the plant-pathogenic bacteria protein related interaction characteristic information by means of the measured protein spatial structure data and the information of homology, domain interaction and the like.

The purpose of the invention can be realized by the following technical scheme:

a method for predicting the interaction of a plant with a pathogen protein, comprising the steps of:

s1, collecting host-pathogenic bacterium protein interaction positive data and genome data of rice and rice blast germs;

and collecting host-pathogen protein interaction positive data by using an HPIDB database, wherein the host-pathogen protein interaction positive data are obtained by at least one experimental method in protein interaction detection means such as yeast two-hybrid and the like.

Downloading genome data of rice from an MSU database, and deleting a transposon gene; downloading genome data of rice blast germs from an Ensembl Genomes database, performing transmembrane helix prediction on a TMHMM website, and selecting proteins with predicted transmembrane helix prediction quantity larger than 0; performing signal peptide prediction on a SignalP website, performing subcellular localization prediction on a WoLF PSORT website, wherein the protein which is the signal peptide and is localized outside cells belongs to secreted protein of rice blast; after removing the repeated protein obtained in each step, screening to obtain the rice blast germ protein with potential interaction with the rice protein.

S2, collecting the spatial structure of the protein complex template, and splitting the protein complex into different subunits to obtain the interaction interface of the subunit pair;

acquiring experimentally measured protein three-dimensional structure data by using a PDB protein structure database, wherein the protein three-dimensional structure data is measured by at least one experimental method of nuclear magnetic resonance, X-ray crystal diffraction or an electron microscope; after the three-dimensional structure data of the protein is obtained, the protein complex is split into different subunits, the structural data of the subunit pairs is read by PIBASE software, and interaction interface information is extracted.

And S3, taking the spatial structure of the protein complex template in the step S2 as a template, and carrying out homologous structure modeling on the host-pathogen protein sequence by using MODPIPE to obtain a protein homologous spatial structure model.

S4, comparing the protein homologous spatial structure with the protein complex template spatial structure to obtain structural characteristics;

further, comparing the protein homologous spatial structure with the protein complex template spatial structure by using TM-align software to obtain structural features. The structural characteristics comprise similarity and structural deviation of a protein homologous spatial structure and a protein complex, and the number and the proportion of conserved residues of an interaction interface of the protein homologous spatial structure and a protein complex template spatial structure.

S5, collecting protein interaction data of the model organisms, acquiring a positive interaction data set of the model organisms, and extracting non-structural features;

the cross-species conservation of plant-pathogenic bacteria protein interaction is analyzed by utilizing homology mapping to obtain a protein homology mapping relation, and a related interaction protein pair supported by an interaction structural domain, namely a structural domain interaction relation, is obtained by combining a structural domain interaction data set.

And S6, building a machine learning model based on the structural features and the non-structural features, testing and adjusting, and predicting the rice-blast germ protein interaction of the genome scale.

And S1, performing sequence clustering and random combination on the host-pathogenic bacteria protein interaction positive data set obtained in the step S1 to generate a certain amount of negative data set, generating a training set and a testing set by the positive data set and the negative data set according to a certain proportion, utilizing sciit-leran random forest to build a machine learning initial model according to the structural characteristics and the non-structural characteristics of the training set, performing batch optimization test and adjustment on parameters of the initial model through a grid search function, utilizing the optimization model to perform relation prediction on all rice-rice blast bacteria protein pairs which can possibly interact pairwise in a genome scale, and drawing a rice-rice blast bacteria protein interaction network by adopting Cytoscape software according to a prediction result.

Compared with the prior art, the method is based on the existing biological data, and can effectively, quickly and simply extract the plant-pathogenic bacteria protein related interaction characteristic information by means of the determined protein space structure data and the information of homology, structural domain interaction and the like, so as to obtain the plant-pathogenic bacteria protein interaction data and provide reference for the research of plant disease-resistant molecular mechanisms.

Drawings

FIG. 1 is a schematic flow chart of a method for predicting plant-pathogen protein interaction in the examples;

FIG. 2 is a rice-blast protein interaction network at the genomic scale in the examples.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

Examples

Computer prediction of protein interactions requires extraction of valuable features from large amounts of data using methods such as statistics, machine learning, and the like. With the exponential growth of biological data, the machine learning method can be applied to the analysis of biological data through improvement. The invention provides a prediction method of plant and pathogenic bacteria protein interaction based on a protein space structure, and a high-accuracy plant-pathogenic bacteria protein interaction network is constructed on a genome scale according to the prediction method.

Specifically, as shown in fig. 1, the invention relates to a prediction method of the protein interaction between plants and pathogenic bacteria, which comprises the following steps:

step one, collecting host-pathogenic bacteria protein interaction positive data and genome data of rice and rice blast fungi

A positive Host-Pathogen protein Interaction dataset was collected from the HPIDB Database (Host-Pathogen Interaction Database). The data must be obtained by at least one experimental method in protein interaction detection means such as yeast two-hybrid detection.

When rice is infected by rice blast, the membrane protein and the secretory protein are most likely to interact with the protein of rice in the rice body. The invention obtains the rice blast germ protein interacting with the rice protein potential based on the HPIDB database. Specifically, the method comprises the following steps: the genome data of rice was downloaded from the MSU database, and the transposon gene was deleted. Downloading genome data of rice blast germs from an Ensembl Genomes database, and performing transmembrane helix prediction on a TMHMM website, wherein the predicted proteins with the transmembrane helix prediction number larger than 0 are membrane proteins and are 2317 in total; performing signal peptide prediction on a SignalP website, performing subcellular localization prediction on a WoLF PSORT website, wherein the protein which is classified as the signal peptide and is localized outside belongs to secreted proteins of rice blast, and 1402 proteins are obtained in total; after the deletion of the repeats, 3491 rice blast germ proteins having potential interaction with rice proteins were obtained by co-screening.

Step two, collecting the spatial structure of the protein complex template and analyzing the subunit interaction interface

And downloading experimentally determined three-dimensional structure data of the protein from the PDB protein structure database, wherein the structure data needs to be determined by at least one experimental method of nuclear magnetic resonance, X-ray crystal diffraction or electron microscope. The complex subunit interaction interface analysis means that the PDB protein complex is divided into pairwise interacting protein subunit pairs; the protein complex is split into different subunits, and the PIBASE software is used for reading the structural data of the subunit pairs and extracting the interaction interface information.

Step three, protein homologous structure modeling

And (3) taking the three-dimensional structure data of the protein measured in the experiment in the step two as a template, and carrying out homologous structure modeling on the protein sequences of the host and the pathogenic bacteria by using MODPIPE software to obtain a spatial structure model of the host and the pathogenic bacteria.

Taking the host-pathogen protein interaction dataset in the step one as an example, downloading the protein sequence from the uniprot database, and performing homologous modeling on the protein sequence, wherein the comparison method comprises sequence-sequence comparison (sequence-to-sequence comparison), profile-sequence comparison and profile-profile comparison. Evaluation of the quality of the homology modeling model the scoring was performed using MPQS, which is a composite score comprising sequence similarity, template coverage and three independent evaluation scores: e-value, Z-DOPE and GA 341. e-value is the significance threshold for alignment between the modeled protein and the template; Z-DOPE is a statistical possibility to deduce the dependence of atomic distance from local structure samples based on probability theory, independent of any adjustable parameters (discrete optimized protein energy or DOPE); GA341 is the model reliability score based on statistics. And (3) setting a scoring threshold value of MPQS ≧ 0.5 by observing a scoring probability distribution function, and regarding the model as a stable homologous structure model.

And (3) scoring the sequence length of the homologous structure model, and filtering to remove the protein homologous structure model which is too short to judge whether an interaction interface exists or not. And (3) scoring the sequence length of the homologous structure model to obtain a score MODSEQ-sore ═ L-MOD/L-SEQ, wherein L-MOD is the length of the homologous modeling sequence, and L-SEQ is the length of the corresponding gene sequence. And (3) combining the probability density distribution function of MODSEQ-sore, considering both the data quantity and the data quality, setting the threshold value to be 30%, and obtaining the homology modeling results of 14628 proteins in total.

Fourthly, superposing and comparing the protein homologous structure model and the complex template structure to obtain the structural characteristics

And (3) carrying out spatial structure comparison on the homologous structure model of the host and the pathogenic bacteria and the complex template by using TM-align software. Taking the host-pathogen protein interaction data set in the step one as an example, controlling the TM-score value to be more than 0.4, finally obtaining a structure comparison result of 10148 positive homologous templates and the complex subunit, and calculating the RMSD value, the TM-score value, the number of conserved residues of the interaction interface and the proportion of the conserved residues between the protein homologous model and the complex template as the structural characteristics. The calculation of the RMSD value, TM-score value, the number of conserved residues at the interaction interface and the occupation ratio of the conserved residues between the protein homology model and the complex template through the structure comparison result is the prior art and will not be described in detail herein.

Step five, analyzing and extracting non-structural features

Protein interaction data of 7 model organisms including arabidopsis, mice, nematodes, humans, escherichia coli, yeast and drosophila are collected from five public databases of BioGRID, IntAct, DIP, BIND and MINT, and a model organism positive interaction data set is obtained.

And (3) analyzing the direct homologous relation between the rice and rice blast proteins obtained in the step one and 7 model biological protein groups respectively by using an inparanoid software and a blast software, and obtaining the non-structural characteristics: and (5) homologous mapping relation. According to the opanoid analysis result, a 5720 pair of rice-blast protein interaction relation supported by the homologous mapping result is obtained by combining a mode biological positive data set; according to the result of blast software, adjusting 3 parameters of e value, sequence consistency and sequence coverage, determining that the analysis parameter of blast software is that the e value is 1e-5, the sequence consistency is 45 percent and the sequence coverage is 50 percent, and obtaining 5702-rice blast protein interaction relation.

Reading protein domain information by using PfamScan, and combining a domain interaction data set collected by a 3did database to obtain a related interaction protein pair supported by an interaction domain. Obtaining the non-structural characteristics: domain interaction relationships.

Step six, construction and optimization of deep learning model

And (3) carrying out sequence clustering and random combination on the host-pathogen protein interaction data set in the step one to generate a certain amount of negative data sets, and generating a training set and a testing set by the positive data set obtained in the step one and the negative data set obtained in the step according to a certain proportion. And according to 4 structural features and 2 non-structural features of the training set, constructing a machine learning initial model by utilizing scimit-learn random forests. Using a grid search function to carry out batch optimization and adjustment on parameters of the initial model, and finally determining the parameters: the maximum iteration number is 60, the maximum depth of the decision tree is 13, the minimum sample number required by internal node subdivision is 120, the minimum sample number of leaf nodes is 20, the maximum feature number is 7, the random number seed is 10, and the other parameters are default. The optimization model is utilized to predict the relation of all rice-rice blast germ protein pairs which are possibly interacted pairwise in the genome scale, the screening threshold value is 0.5, a rice-rice blast germ protein interaction network is drawn by using Cytoscape software according to all the prediction results, and the presented visual results are shown in figure 2.

Based on the existing biological data, the invention can effectively, quickly and simply extract the plant-pathogenic bacteria protein related interaction characteristic information by means of the determined protein space structure data and the information of homology, structural domain interaction and the like, thereby obtaining the plant-pathogenic bacteria protein interaction data and providing reference for the research of plant disease-resistant molecular mechanisms.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and those skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for predicting the protein interaction between a plant and a pathogen, comprising the steps of:

1) collecting host-pathogen protein interaction positive data;

2) collecting the spatial structure of the protein complex template, and splitting the protein complex into different subunits to obtain an interaction interface of a subunit pair;

3) taking the spatial structure of the protein complex template in the step 2) as a template, and carrying out homologous structure modeling on a host-pathogenic bacteria protein sequence to obtain a protein homologous spatial structure model;

4) comparing the protein homologous spatial structure with the protein complex template spatial structure to obtain structural characteristics;

5) collecting protein interaction data of the model organism, acquiring a positive interaction data set of the model organism, and extracting non-structural features;

6) and building a machine learning model, testing and adjusting the machine learning model based on the structural characteristics and the non-structural characteristics, and predicting the rice-rice blast germ protein interaction of the genome scale.

2. The method according to claim 1, wherein in step 1), host-pathogen protein interaction positive data satisfying at least one experimental method of protein interaction detection means such as yeast two-hybrid is collected using an HPIDB database.

3. The method for predicting plant-pathogen protein interaction according to claim 1, wherein the step 2) comprises:

4. The method for predicting the protein interaction between plants and pathogenic bacteria according to claim 3, wherein in the step 3), the three-dimensional structure data of the protein experimentally measured in the step 2) is used as a template, and MODPIPE is used for carrying out homologous structure modeling on the protein sequence of the host-pathogenic bacteria to obtain a protein homologous spatial structure model.

5. The method according to claim 1, wherein the structural characteristics are obtained by comparing the spatial structure of homology of protein with the spatial structure of the template of protein complex in step 4) using TM-align software.

6. The method of claim 5, wherein the structural features include similarity of protein homology spatial structure to protein complex, structural deviation, and the number of conserved residues and the ratio of conserved residues at the interaction interface between protein homology spatial structure and protein complex template spatial structure.

7. The method of claim 1, wherein in step 5), the cross-species conservation of plant-pathogen protein interactions is analyzed using homology mapping to obtain a protein homology mapping, and the domain interaction dataset is combined to obtain pairs of related interacting protein pairs, i.e., domain interactions, supported by the interaction domains.

8. The method for predicting plant-pathogen protein interaction according to claim 1, wherein the step 6) comprises:

carrying out sequence clustering and random combination on the host-pathogenic bacteria protein interaction positive data set obtained in the step 1) to generate a certain amount of negative data set, generating a training set and a testing set by the positive data set and the negative data set according to a certain proportion, building a machine learning model by utilizing sciit-leran random forest according to the structural characteristics and the non-structural characteristics of the training set, carrying out parameter adjustment and optimization on the machine learning model by using a grid search function, predicting the interaction of the rice-blast bacteria protein pair of the genome scale, and drawing a rice-blast bacteria protein interaction network by adopting Cytoscape software.