CN110223730B - Prediction method and prediction device for protein and small molecule binding site - Google Patents

Prediction method and prediction device for protein and small molecule binding site Download PDF

Info

Publication number
CN110223730B
CN110223730B CN201910492586.0A CN201910492586A CN110223730B CN 110223730 B CN110223730 B CN 110223730B CN 201910492586 A CN201910492586 A CN 201910492586A CN 110223730 B CN110223730 B CN 110223730B
Authority
CN
China
Prior art keywords
protein
small molecule
binding
binding site
predicting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910492586.0A
Other languages
Chinese (zh)
Other versions
CN110223730A (en
Inventor
王伟
李克亮
张仕光
吕贺贺
赵远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Normal University
Original Assignee
Henan Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Normal University filed Critical Henan Normal University
Priority to CN201910492586.0A priority Critical patent/CN110223730B/en
Publication of CN110223730A publication Critical patent/CN110223730A/en
Application granted granted Critical
Publication of CN110223730B publication Critical patent/CN110223730B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention relates to a prediction method and a prediction device for a binding site of protein and small molecules, belonging to the technical field of prediction of the binding site of the small molecule binding protein. The invention provides a novel prediction method of protein and small molecule binding sites, wherein a sliding sampling window method is used for extracting data; the protein binding effect is influenced by the surrounding environment of the binding residues, so that the characteristics of the residues at the central position are represented by extracting data by adopting a sampling window method, and a better representation effect is achieved; after the XGboost classification model is constructed by using the extracted features, the classification model has a better prediction effect and can more accurately predict the binding sites of the proteins and the small molecules.

Description

Prediction method and prediction device for protein and small molecule binding site
Technical Field
The invention relates to a prediction method and a prediction device for a binding site of protein and small molecules, belonging to the technical field of prediction of the binding site of the small molecule binding protein.
Background
Proteins are not independent and they must interact with proteins and small molecules such as DNA, RNA, and others to perform their biological functions. The protein-small molecule binding site is an active site of the protein surface for performing the biological function of the protein, and knowing the interaction site of the protein and other small molecules can also help scientists to understand more about the biological function of the protein, and provide technical support for the drug design of the protein.
Usually, one can find the binding site of small molecules on the surface of protein through biological experimental method, and the binding obtained through the experiment is also very reliable. However, the use of experimentation to identify binding sites is often inefficient and costly, subject to technical, time and economic constraints. In contrast, using bioinformatics means to predict protein and small molecule binding sites is extremely vital, saving considerable time and economic cost. Drug design based on biological structure is an extremely important research field in bioinformatics, and the first step of the structure-based drug design is to accurately predict the position of the binding site of a protein and a small molecule on the surface of the protein.
At present, some prediction methods for predicting protein-small molecule binding sites have been developed, for example, a novel computer prediction algorithm is developed in ' new algorithm research and development for protein-small molecule binding site prediction, university of Zhejiang university's academic thesis ', and traditional algorithms such as LIGSITE, PASS, Q-SiteFinder, SURFNET and the like are integrated, so that a high prediction effect is achieved. However, in the algorithm, when a prediction model is constructed and calculated, only the properties of a single amino acid residue are usually focused, and the properties are not considered from the whole of the position, so the prediction effect is relatively limited.
Disclosure of Invention
The invention aims to provide a method for predicting a binding site of a protein and a small molecule.
The invention also provides a device for predicting the binding site of the protein and the small molecule, which can more accurately predict the binding site of the protein and the small molecule.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the prediction method of the binding site of the protein and the small molecule comprises the following steps:
1) extracting a small molecule binding protein data set capable of interacting with small molecules from a protein database; extracting a characteristic matrix corresponding to the small molecule binding protein residues by using a sliding sampling window method, wherein if the residues at the central position of the window are binding sites, the matrix extracted by the sampling window is a positive set; if the residue at the center of the window is a non-binding site, the matrix extracted by the sampling window is a negative set;
2) converting the characteristic matrix corresponding to the small molecule binding protein residue into a one-dimensional vector, and constructing a classification model;
3) inputting the corresponding characteristics of the protein to be detected into the classification model, and predicting the binding site of the protein to be detected.
According to the prediction method of the protein and small molecule binding site, a sliding sampling window method is used for extracting data, and the protein binding effect can be influenced by the surrounding environment of a binding residue, so that the data extracted by the sampling window method is used for representing the characteristics of a central position residue, and a better representation effect is achieved; after the extracted features are used for constructing the classification model, the classification model has a better prediction effect and can more accurately predict the binding sites of the proteins and the small molecules.
The corresponding features in step 1) and step 3) include: one or more of amino acid class, PSSM matrix, hydrophilicity, hydrophobicity, and electrostatic charge.
The analysis in the invention shows that the distribution difference of the residues can be related to the functions of the residues of the binding site and the non-binding site; the conservation of the evolution of residues at different positions in a protein sequence can be reflected by a Position Specific Scoring Matrix (PSSM) of the protein; it has also been found that small molecules exhibit a greater tendency to become hydrophilic with residues in the protein binding domain, while binding residues are also more charged than non-binding residues; therefore, the classification model can be constructed on the basis of one or more of the five characteristics.
The window length in the sliding sampling window method in the step 1) is equal to 13-17.
The binding of proteins to small molecules is not only related to the characteristics of the binding residues, but also to the environment surrounding the residues. When the window length is small, the extracted matrix is sampled and cannot effectively represent the characteristics of the surrounding environment of the binding residues; when the sampling length is too large, a large amount of irrelevant information is contained in the sampled matrix, which affects the classification effect of the classification model. And finally selecting the length of the sampling window to be 13-17 as a final characteristic calculation standard according to repeated experiment results with different lengths.
The classification model in the step 1) is constructed by using an XGboost algorithm.
Tree boosting is an efficient and widely used machine learning method. In this context, an scalable end-to-end Tree boost system is used: XGBoost, a widely used algorithm by scientists, has achieved significant results in many machine learning jobs. The Gradient Boosting is an improvement on the basis of Boosting. The idea of this algorithm is to continuously reduce the residual (residual) and further reduce the residual of the previous model in the gradient direction to obtain a new model.
The protein database in the step 1) is an SC-PDB database. And the SC-PDB database collects more protein and ligand small molecule data.
Carrying out redundancy removal treatment on all small molecule binding protein data in the step 1) to obtain a small molecule binding protein data set; the method specifically comprises the following steps: removing the sequence with homology more than or equal to 30%, removing the sequence with sequence length less than or equal to 40%, removing
Figure BDA0002087521990000031
The sequence of (a).
The data set obtained after the redundancy removal processing has stronger effectiveness.
Screening one or more types of small molecule binding proteins selected from the group consisting of ACO, ADP, ANP, ATP, COA, FAD, FMN, GDP, GNP, NAD, NAP, NDP, SAH and SAM in step 1) as a small molecule binding protein data set.
The data size of the 14 kinds of small molecule binding protein is large enough to be used for classification, and training and testing of a classification model.
The device for predicting the binding site of the protein and the small molecule comprises the following modules:
a module for extracting a small molecule binding protein data set from a protein database;
a module for obtaining corresponding characteristics of small molecule binding protein residues;
a module for extracting a characteristic matrix corresponding to the small molecule binding protein residue by adopting a sliding sampling window method; if the residue at the center of the window is a binding site, the matrix extracted by the sampling window is a positive set; if the residue at the center of the window is a non-binding site, the matrix extracted by the sampling window is a negative set;
a module for converting the characteristic matrix corresponding to the small molecule binding protein residue into a one-dimensional vector; a module for constructing a classification model;
and the module is used for inputting the corresponding characteristics of the protein to be detected into the classification model and predicting the binding site of the protein to be detected.
The prediction device of the protein and small molecule binding site comprises a module for extracting a characteristic matrix corresponding to small molecule binding protein residues by adopting a sliding sampling window method, and data are extracted by adopting the sampling window method to represent the characteristics of the residues at the central position, so that the prediction device has better representation effect; after the extracted features are used for constructing the classification model, the classification model has a better prediction effect, and further the prediction device can more accurately predict the binding sites of the proteins and the small molecules.
The corresponding features include: one or more of amino acid class, PSSM matrix, hydrophilicity, hydrophobicity, and electrostatic charge.
The analysis in the invention shows that the distribution difference of the residues can be related to the functions of the residues of the binding site and the non-binding site; the conservation of the evolution of residues at different positions in a protein sequence can be reflected by a Position Specific Scoring Matrix (PSSM) of the protein; it has also been found that small molecules exhibit a greater tendency to become hydrophilic with residues in the protein binding domain, while binding residues are also more charged than non-binding residues; therefore, the classification model can be constructed on the basis of one or more of the five characteristics.
The window length in the sliding sampling window method is equal to 13-17.
The binding of proteins to small molecules is not only related to the characteristics of the binding residues, but also to the environment surrounding the residues. When the window length is small, the extracted matrix is sampled and cannot effectively represent the characteristics of the surrounding environment of the binding residues; when the sampling length is too large, a large amount of irrelevant information is contained in the sampled matrix, which affects the classification effect of the classification model. In summary, the sampling window length of 13-17 is finally selected as the final feature calculation criterion.
Drawings
FIG. 1 is a flowchart illustrating the analysis and prediction of small molecule binding sites of proteins according to example 1 of the method for predicting small molecule binding sites of proteins of the present invention;
FIG. 2 is a diagram showing a PSSM matrix file corresponding to the protein number 12GS in the PDB database in example 1 of the method for predicting binding sites of proteins and small molecules according to the present invention;
FIG. 3 is a diagram showing an example of the geometry of the binding region of the RNA-binding protein in example 1 of the method for predicting the binding site of a protein and a small molecule of the present invention;
FIG. 4 is a schematic diagram of the sliding window sampling of the protein sequence in example 1 of the method for predicting the binding site of the protein and small molecule of the present invention;
FIG. 5 is a diagram showing the distribution of binding residues and non-binding residues of 20 types of amino acids in a small molecule sequence of a protein according to example 1 of the method for predicting binding sites between a protein and a small molecule of the present invention;
FIG. 6 is a diagram showing the distribution of the hydrophilicity property values of the residues of the small molecule binding site and non-binding site of the protein in example 1 of the method for predicting the binding site of a protein and a small molecule according to the present invention;
FIG. 7 is a distribution diagram of the hydrophobicity property values of the residues of the small molecule binding site and the non-binding site of the protein in example 1 of the method for predicting the binding site of the protein and the small molecule of the present invention;
FIG. 8 is a graph showing electrostatic charge property distribution of residues at the small molecule binding site and non-binding site of the protein in example 1 of the method for predicting the binding site of the protein and the small molecule according to the present invention;
FIG. 9 is a diagram showing the distribution of the hydrogen bond property values of the residues of the small molecule binding site and the non-binding site of the protein in example 1 of the method for predicting the binding site of the protein and the small molecule according to the present invention;
FIG. 10 is a graph showing the relationship between the sampling window w and the AUC of the classification results in example 1 of the method for predicting a binding site of a protein or a small molecule of the present invention;
FIG. 11 is a graph showing the relationship between the sampling window w and the accuracy of classification results in example 1 of the method for predicting binding sites of proteins and small molecules according to the present invention;
FIG. 12 is a graph showing the calculation of importance scores for various features by the average reduction accuracy method in example 1 of the method for predicting binding sites of proteins and small molecules according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
Example 1 method for predicting binding site of protein and Small molecule
The invention provides a novel method for predicting binding sites of small molecules and proteins based on protein small molecule PDB data. By extracting the physical and chemical properties, evolution information and other characteristics of residues on a protein sequence, generating a data set required by training and testing, training and testing a classification model, and finally using the classification model for classifying and predicting the binding sites and non-binding sites of the protein. PSSM matrix information for proteins has been applied by many scientists in such tasks as predicting binding sites, secondary structure and analyzing the function of proteins. Since whether or not a residue in the sequence of a small molecule of a protein is mutated is determined by many factors in the evolution process that also affect the binding of the protein to the small molecule. Also the physicochemical properties of the binding and non-binding residues are different, and in order to investigate these differences, the protein sequences were analyzed for hydrophobicity, hydrophilicity, electrostatic charge and hydrogen bonding of the binding and non-binding residues, and at the same time, the type information of the amino acid residues was expressed in the form of unique thermal codes. On the basis of the work, an XGboost classification model is constructed, and classification prediction is carried out on binding sites and non-binding sites of the protein micromolecules. The classification model constructed by training and testing the data set of the 14 classes of protein small molecules finally obtains a very remarkable classification result, and the experimental flow is shown in FIG. 1.
1. Construction of protein Small molecule datasets
Protein ligand small molecule data sets were extracted from the SC-PDB database. By 12 months in 2018, 4782 protein and 6326 ligand small molecule data were collected from the SC-PDB database. All small molecule binding protein data were processed to remove redundancy using the PISCES program (http:// dunback. fccc. edu/Guoli/PISCES. php) during the experiment. The first condition set for screening protein data in software is that the sequence homology is not more than 30%, and at the same time, in order to remove the protein sequence short data, the standard whose sequence length should be greater than 40 is setIn order to obtain protein data with relatively high precision, low-resolution structures are removed in experiments, and only the protein data with resolution higher than that of the protein data are required to be left
Figure BDA0002087521990000052
The data of (1). Finally, 5090 protein ligand small molecules are screened out, 14 kinds of protein small molecule data are selected in the experiment, and the data size of the protein small molecules is large enough to be used for training and testing; as shown in Table 1, the number of protein sequences in each class, and the number of binding and non-binding residues in the sequences are shown, respectively.
TABLE 1 sequence and binding site information for different classes of protein Small molecule datasets
Figure BDA0002087521990000051
Figure BDA0002087521990000061
2. Feature calculation
1) Distribution of amino acid classes
A total of 6582 protein sequence data were collected from the SC-PDB database in combination with 14 species of ligand small molecules. The distribution of various residues is obtained by calculating the amino acid classes of the binding residues and the non-binding residues in the protein sequence.
2) PSSM matrix
The conservation of the evolution of residues at different positions in a protein sequence can be reflected by a position weight matrix of the protein. Position Weight Matrices (PWM), also known as Position Specific Weight Matrices (PSWM) or Position Specific Scoring Matrices (PSSM), are common representations of motifs (patterns) in biological sequences. The Position weight matrix (Position weight matrix) was introduced by the american geneticist Gary storm and colleagues in 1982.
PSI-BLAST is a location specific iterative basic local alignment search tool. This procedure was used to find the distant parent of the protein. First, a list of all closely related proteins is created, which are combined into a general "profile" sequence that summarizes the important features present in these sequences. This profile is then used to run a query against a protein database and find a larger proteome. This larger group is used to build another configuration file and the process is repeated. PSI-BLAST is more sensitive to acquire long-range evolutionary relationships than the standard BLAST program. The PSI-BLAST program was used to scan the NCBI database for experiments in which the value of the parameter e was set to 0.001 and the number of iterations was set to 3. The PSSM file obtained after processing the protein data by the PSI-BLAST program is shown in FIG. 2.
3) Physical and chemical properties
AAindex is a database containing a variety of amino acid attribute indices. At present, 544 amino acid indexes are collected in the database, as shown in FIG. 3, the amino acid indexes represent the physicochemical properties of amino acids; h represents the index number of the attribute type; d represents the type description of the attribute, A represents the author of the attribute value publication, T represents the title of the related paper, J represents the journal information published by the article, and C represents the correlation coefficient with the attribute and other attributes. In proteome studies, protein sequences are symbol sequences consisting of 20 amino acids in random order. The symbol sequence can be converted into an amino acid attribute index sequence in an experiment to digitally encode the protein.
Hopp provides a hydrophilicity scale with fixed values for 20 amino acids (AAindex ID: HOPT 810101). Jones gives a hydrophobicity scale for 20 amino acids (AAindex ID: HOPT 810101). It is generally considered that amino acids having hydrophobicity tend to be present inside the protein structure, and amino acids having hydrophilicity tend to be distributed on the surface of the protein molecule. Thus, hydrophilic amino acids are more likely to interact with small molecules, a property that facilitates the construction of classification models to classify predicted binding and non-binding residues.
The electrostatic charge of the binding domain of a protein to a small molecule is one of the most influential properties of the interaction of a protein with a ligand molecule. Electrostatic complementation facilitates non-specific binding of proteins to small molecules. Of the 20 standard amino acids, ARG, HIS and LYS are generally considered to be positively charged, and ASP and GLU negatively charged (AAindex ID: FAUJ880111, AAindex ID: FAUJ 880112).
Hydrogen bonds are an interaction force that is slightly stronger than intermolecular forces, but slightly weaker than covalent and ionic bonds. Hydrogen bonding plays a key role in determining the specificity of ligand molecule binding. Thus, hydrogen bonding of amino acids is one of the important physicochemical properties that need to be studied. Amino acid hydrogen bond property values are available from the AAindex database (AAindex ID: FAUJ 880109).
3. Sliding sampling window
In the work of predicting the binding sites of proteins and small molecules, extracting information from protein sequences to construct feature vectors is a key step. Since protein data binding residues alternate with non-binding residues, a sliding window is used to sample across the protein sequence. As shown in fig. 4, any amino acid residue in the protein sequence is taken as a central point, and a (w-1)/2 is taken as a boundary, and a matrix of attribute values of the residue is extracted; the length of the sampling window was (w-1)/2, and the binding site residues and flanking residues were included in the window, and eight values were used as the window length. Residues at the center of the window, if binding sites, are the positive set of matrices extracted by the sampling window. The matrix extracted by the sampling window is the negative set if the residue at the center of the window is a non-binding site. In the above steps, the sampling window method is used to extract data because protein binding is affected by the environment surrounding the binding residues.
4. Results and discussion
1) Distribution of amino acid classes
The distribution of amino acid classes is shown in FIG. 5, with a greater distribution of ALA, LEU, GLY, VAL and GLU for non-binding residues. The four residues GLY, GLU, CYS and LYS show statistical differences. The difference in the distribution of the above residues may be related to the function of the residues at the binding site as well as the non-binding site.
2) Protein binding site physicochemical property analysis
In addition to statistical amino acid class distribution, the distribution of hydrophilicity, hydrophobicity, electrostatic charge and hydrogen bonding of binding sites and non-binding sites was also calculated in the experiment. The results are shown in fig. 6 and 7, and the difference between the hydrophilicity value and the hydrophobicity value between the protein binding site and the non-binding site is very significant. This is because the binding site residues are distributed mainly on the protein surface and therefore have a greater probability of coming into contact with water molecules. As shown in fig. 8, the non-binding region has a smaller electrostatic charge distribution than the binding region because electrostatic complementation facilitates the binding of proteins to small molecules. Previous studies found that hydrogen bonds play an important role in the binding of ligands to proteins, but calculations showed that the distribution of hydrogen bonds between bound and unbound residues did not show a statistically significant difference (as shown in figure 9). In summary, the hydrophilicity, hydrophobicity, and electrostatic charge of small protein molecule residues are helpful in constructing classification models that predict binding and non-binding sites.
3) Determining the length of a sampling window
The experiment is based on a 14-class protein micromolecule data set, and a constructed classification model is trained, evaluated and tested. Class 14 small molecules are ACO, ADP, ANP, ATP, COA, FAD, FMN, GDP, GNP, NAD, NAP, NDP, SAH, and SAM, respectively. In the process of feature extraction, the length of a sampling window can determine the quality of an extracted data set, and the effect of a classification experiment can also be influenced. Thus, the experiment tested several window lengths of 3, 5, 7, 9, 11, 13, 15, 17, and 19, respectively, for a class 14 small molecule dataset. Because the number of protein binding sites and non-binding sites in the data set is greatly different, in order to balance the positive and negative samples of the data set, downsampling is used in the experiment to solve the problem of data set unbalance. And (3) combining hydrophilicity, hydrophobicity, electrostatic charge, amino acid classes and a PSSM matrix to form a characteristic matrix for training and testing aiming at each kind of protein data set.
And finally, training and testing the XGboost classification model by using a 10-fold cross validation method. For comparison of results, AUC and accuracy were selected as criteria. AUC values versus window length as shown in fig. 10, the AUC values showed a gradual upward trend as the window length increased from 3, with the AUC of the majority of protein small molecules peaking at a window length equal to 15; as the window length continues to increase, the AUC begins to gradually decrease. As shown in fig. 11, the relationship between accuracy rate and sampling window length tends to behave similarly to AUC, with the accuracy rate values increasing gradually as the sampling window length increases, with all accuracy rate values peaking at w 15 and then beginning to decrease. This is due to the fact that the binding of proteins to small molecules is not only related to the nature of the binding residues, but also to the environment surrounding the residues. When the window length is small, the extracted matrix is sampled and does not effectively characterize the environment around the binding residues, and therefore the AUC and accuracy are small. However, the number of binding sites in the sequence is limited due to the small number of atoms in the molecule relative to the protein. When the sampling length is too large, a large amount of irrelevant information is contained in a matrix obtained by sampling, which affects the classification effect of the classification model, thereby causing the values of AUC and accuracy to be reduced. In conclusion, the sampling window length of 15 is finally selected as the final feature calculation criterion.
4) Feature importance testing
The invention totally analyzes 6 types of characteristics before: amino acid species, PSSM matrix, hydrophilicity, hydrophobicity, electrostatic charge, and hydrogen bonding. In order to measure the importance degree of the features to the construction of the classification model, an average reduction accuracy method (mean decrease accuracy uracy) is used in the experiment, and the influence of 6 types of features (amino acid types, PSSM matrix, hydrophilicity, hydrophobicity, electrostatic charge and hydrogen bonds) on the classification model is calculated. The basic idea of the algorithm is to change the value of one feature into a random number, then input the feature matrix into a classification model for classification prediction, and observe the accuracy reduction range of the classification model. For features with lower importance, the accuracy of the output of the classification model is not greatly affected, and if the more important feature value is changed into a random number, the accuracy of the classification model is greatly affected. The results of the test using the average reduction accuracy method are shown in fig. 12, from which it can be seen that the PSSM matrix has the highest importance score, followed by amino acid class, hydrophilicity, hydrophobicity, and charge, and the hydrogen bond has the least contribution to the classification model. This phenomenon is consistent with the aforementioned analysis of protein binding sites.
5. Constructing classification models
The feature matrix constructed in the experiment fuses the information of amino acid class, PSSM matrix, hydrophilicity, hydrophobicity and electrostatic charge. Converting the characteristic matrix corresponding to the small molecule binding protein residue into a one-dimensional vector, and constructing a classification model by using an XGboost algorithm. The sampling window length in the feature calculation process is 15.
Subsequently, 10-fold cross validation testing was performed on the class 14 protein small molecule dataset using the XGBoost classification model. Test results as shown in table 2, the classification model was trained and tested on ATP small molecule data set, and the results of AUC and accuracy were 0.935 and 0.927, respectively, which is the highest result of all small molecule data sets. The average AUC and accuracy of all data sets are respectively 0.918 and 0.913, the larger the AUC and accuracy values are, the better the classification effect of the trained classification model is, which also shows that the classification effect of the experimentally constructed classification model is very obvious. The precision, recall, and F1 (combined precision and recall) values in the table are used as references.
TABLE 2 Classification prediction results obtained on XGboost model protein small molecule dataset
Figure BDA0002087521990000091
Figure BDA0002087521990000101
6. Prediction of unknown proteins
Inputting the corresponding characteristics of the protein to be detected into the classification model, and predicting the binding site of the protein to be detected. The corresponding feature here refers to the feature of amino acid class, PSSM matrix, hydrophilicity, hydrophobicity and electrostatic charge mentioned above, which converts the protein residue corresponding feature matrix into a one-dimensional vector before inputting into the classification model.
7. Small knot
The invention focuses on the classification and prediction work of the binding site and the non-binding site of the protein small molecule. As for the binding site and non-binding site of a small protein molecule, the composition of the amino acids at the binding site and non-binding site was analyzed, and it was found from the results that residues such as ALA and LEU were distributed more than other residues. The four residues GLY, GLU, CYS and LYS are clearly distinct in their distribution at the binding and non-binding sites. At the same time, the experiment also analyzed several properties of hydrophilicity, hydrophobicity, electrostatic charge and hydrogen bonding of the residues, thereby comparing the degree of difference in physicochemical properties between the two types of residues. The analysis result shows that the small molecule and the protein binding region residue show stronger hydrophilic tendency, the charge of the binding residue is stronger than that of the non-binding residue, and the difference of the hydrogen bond values of the two types of residues is not large, which proves the result of the important analysis of the characteristics. And then, constructing a feature vector by simultaneously fusing the PSSM matrix based on the features. Based on the 14 protein small molecule data sets, a classification model is constructed by using an XGboost algorithm to execute a 10-fold cross validation test. The final classification model achieved significant results with average AUC and accuracy of 0.918 and 0.913, respectively. This shows that the method proposed in the present invention can better predict the binding site and non-binding site of the protein small molecule, and the analysis in the experimental process is also helpful to understand the binding mechanism between the protein and the small molecule.
Example 1 of a device for predicting binding sites of proteins and Small molecules
The device for predicting the binding site of the protein and the small molecule in the embodiment comprises the following modules:
a module for extracting a small molecule binding protein data set from a protein database;
a module for obtaining corresponding characteristics of small molecule binding protein residues; the corresponding features include: one or more of amino acid class, PSSM matrix, hydrophilicity, hydrophobicity, and electrostatic charge;
a module for extracting a characteristic matrix corresponding to the small molecule binding protein residue by adopting a sliding sampling window method; if the residue at the center of the window is a binding site, the matrix extracted by the sampling window is a positive set; the matrix extracted by the sampling window is the negative set if the residue at the center of the window is a non-binding site. The window length in the sliding sampling window method is equal to 13-17, preferably 15;
a module for converting the characteristic matrix corresponding to the small molecule binding protein residue into a one-dimensional vector; a module for constructing a classification model;
and the module is used for inputting the corresponding characteristics of the protein to be detected into the classification model and predicting the binding sites of the protein to be detected. The corresponding features include: one or more of amino acid class, PSSM matrix, hydrophilicity, hydrophobicity, and electrostatic charge.

Claims (10)

1. The prediction method of the binding site of the protein and the small molecule is characterized in that: the method comprises the following steps:
1) extracting a small molecule binding protein data set capable of interacting with small molecules from a protein database; extracting a characteristic matrix corresponding to the small molecule binding protein residues by using a sliding sampling window method, wherein if the residues at the central position of the window are binding sites, the matrix extracted by the sampling window is a positive set; if the residue at the center of the window is a non-binding site, the matrix extracted by the sampling window is a negative set;
2) converting the corresponding characteristic matrix of the small molecule binding protein residues into a one-dimensional vector to construct a classification model;
3) inputting the corresponding characteristics of the protein to be detected into the classification model, and predicting the binding site of the protein to be detected.
2. The method of predicting the binding site of a protein to a small molecule according to claim 1, wherein: the corresponding features in step 1) and step 3) include: one or more of amino acid class, PSSM matrix, hydrophilicity, hydrophobicity, and electrostatic charge.
3. The method of predicting the binding site of a protein to a small molecule according to claim 1, wherein: the window length in the sliding sampling window method in the step 1) is equal to 13-17.
4. The method of predicting the binding site of a protein to a small molecule according to claim 1, wherein: the classification model in the step 1) is constructed by using an XGboost algorithm.
5. The method of predicting the binding site of a protein to a small molecule according to claim 1, wherein: the protein database in the step 1) is an SC-PDB database.
6. The method of predicting the binding site of a protein to a small molecule according to claim 1, wherein: carrying out redundancy removal treatment on all small molecule binding protein data in the step 1) to obtain a small molecule binding protein data set; the method specifically comprises the following steps: removing sequences with homology more than or equal to 30 percent, removing sequences with sequence length less than or equal to 40 and removing sequences with resolution less than or equal to 3A.
7. The method of predicting the binding site of a protein to a small molecule according to claim 1, wherein: screening one or more types of small molecule binding proteins selected from the group consisting of ACO, ADP, ANP, ATP, COA, FAD, FMN, GDP, GNP, NAD, NAP, NDP, SAH and SAM in step 1) as a small molecule binding protein data set.
8. The device for predicting the binding site of the protein and the small molecule is characterized in that: the system comprises the following modules:
a module for extracting a small molecule binding protein data set from a protein database;
a module for obtaining corresponding characteristics of small molecule binding protein residues;
a module for extracting a characteristic matrix corresponding to the small molecule binding protein residue by adopting a sliding sampling window method; if the residue at the center of the window is a binding site, the matrix extracted by the sampling window is a positive set; if the residue at the center of the window is a non-binding site, the matrix extracted by the sampling window is a negative set;
a module for converting the characteristic matrix corresponding to the small molecule binding protein residue into a one-dimensional vector; a module for constructing a classification model;
and the module is used for inputting the corresponding characteristics of the protein to be detected into the classification model and predicting the binding sites of the protein to be detected.
9. The device for predicting the binding site of a protein or small molecule according to claim 8, wherein: the corresponding features include: one or more of amino acid class, PSSM matrix, hydrophilicity, hydrophobicity, and electrostatic charge.
10. The apparatus for predicting the binding site of a protein to a small molecule according to claim 8, wherein: the window length in the sliding sampling window method is equal to 13-17.
CN201910492586.0A 2019-06-06 2019-06-06 Prediction method and prediction device for protein and small molecule binding site Active CN110223730B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910492586.0A CN110223730B (en) 2019-06-06 2019-06-06 Prediction method and prediction device for protein and small molecule binding site

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910492586.0A CN110223730B (en) 2019-06-06 2019-06-06 Prediction method and prediction device for protein and small molecule binding site

Publications (2)

Publication Number Publication Date
CN110223730A CN110223730A (en) 2019-09-10
CN110223730B true CN110223730B (en) 2022-09-27

Family

ID=67816036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910492586.0A Active CN110223730B (en) 2019-06-06 2019-06-06 Prediction method and prediction device for protein and small molecule binding site

Country Status (1)

Country Link
CN (1) CN110223730B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111048145B (en) * 2019-12-20 2024-01-19 东软集团股份有限公司 Method, apparatus, device and storage medium for generating protein prediction model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001092990A2 (en) * 2000-06-01 2001-12-06 Variagenics, Inc. Structure-based methods for assessing amino acid variances
CN106650309A (en) * 2016-12-30 2017-05-10 中国科学院深圳先进技术研究院 Prediction method and prediction device for membrane protein residue interaction relation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001092990A2 (en) * 2000-06-01 2001-12-06 Variagenics, Inc. Structure-based methods for assessing amino acid variances
CN106650309A (en) * 2016-12-30 2017-05-10 中国科学院深圳先进技术研究院 Prediction method and prediction device for membrane protein residue interaction relation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
支持向量回归预测蛋白质残基的B因子;尹辉等;《计算机与应用化学》;20111128(第11期);全文 *

Also Published As

Publication number Publication date
CN110223730A (en) 2019-09-10

Similar Documents

Publication Publication Date Title
Qiu et al. Predicting protein submitochondrial locations by incorporating the pseudo-position specific scoring matrix into the general Chou's pseudo-amino acid composition
Jia et al. S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique
Kusuma et al. Prediction of ATP-binding sites in membrane proteins using a two-dimensional convolutional neural network
CN111863121A (en) Protein self-interaction prediction method based on graph convolution neural network
You et al. DeepMHCII: a novel binding core-aware deep interaction model for accurate MHC-II peptide binding affinity prediction
CN110265085A (en) A kind of protein-protein interaction sites recognition methods
US20020072887A1 (en) Interaction fingerprint annotations from protein structure models
KR20220083649A (en) Chemical binding similarity searching method using evolutionary information of protein
Nanni et al. Set of approaches based on 3D structure and position specific-scoring matrix for predicting DNA-binding proteins
Ghualm et al. Identification of pathway-specific protein domain by incorporating hyperparameter optimization based on 2D convolutional neural network
Wang et al. UMAP-DBP: an improved DNA-binding proteins prediction method based on uniform manifold approximation and projection
CN110223730B (en) Prediction method and prediction device for protein and small molecule binding site
Alzahrani et al. Identification of stress response proteins through fusion of machine learning models and statistical paradigms
Du et al. Improving protein domain classification for third-generation sequencing reads using deep learning
WO2008007630A1 (en) Method of searching for protein and apparatus therefor
Liang et al. Prediction of protein structural class based on different autocorrelation descriptors of position-specific scoring matrix
Liu et al. Recognizing ion ligand–binding residues by random forest algorithm based on optimized dihedral angle
CN111048145A (en) Method, device, equipment and storage medium for generating protein prediction model
Thota et al. Performance comparative in classification algorithms using real datasets
Li et al. Multidimensional scaling method for prediction of lysine glycation sites
Outeiral et al. Current protein structure predictors do not produce meaningful folding pathways
Hu et al. Improving protein-protein interaction site prediction using deep residual neural network
Kazm et al. Transformer Encoder with Protein Language Model for Protein Secondary Structure Prediction
Garcia et al. Identifying Schistosoma mansoni essential protein candidates based on machine learning
Giard et al. Regression applied to protein binding site prediction and comparison with classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant