CN112053742A - Method and device for screening molecular target protein, computer equipment and storage medium - Google Patents

Method and device for screening molecular target protein, computer equipment and storage medium Download PDF

Info

Publication number
CN112053742A
CN112053742A CN202010716290.5A CN202010716290A CN112053742A CN 112053742 A CN112053742 A CN 112053742A CN 202010716290 A CN202010716290 A CN 202010716290A CN 112053742 A CN112053742 A CN 112053742A
Authority
CN
China
Prior art keywords
target
target protein
prediction
screening
molecules
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010716290.5A
Other languages
Chinese (zh)
Inventor
印明柱
曹东升
杨素青
陈翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangya Hospital of Central South University
Original Assignee
Xiangya Hospital of Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangya Hospital of Central South University filed Critical Xiangya Hospital of Central South University
Priority to CN202010716290.5A priority Critical patent/CN112053742A/en
Publication of CN112053742A publication Critical patent/CN112053742A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • G16B5/20Probabilistic models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Bioethics (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Artificial Intelligence (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The application relates to a molecular target protein screening method, a molecular target protein screening device, a computer device and a storage medium. The method comprises the following steps: the method comprises the steps of obtaining target molecules, conducting target prediction on the target molecules by adopting different target prediction modes to obtain predicted target proteins, screening the predicted target proteins according to the number of the prediction modes corresponding to the predicted target proteins to obtain potential target proteins corresponding to the target molecules, obtaining quantitative structure-activity relationship models corresponding to the potential target proteins, inputting the target molecules into the quantitative structure-activity relationship models to obtain the binding activity probability of the target molecules and the potential target proteins, and screening key target proteins matched with the target molecules from the potential target proteins according to the binding activity probability and a preset probability threshold. Accurate and reliable screening of target protein is realized.

Description

Method and device for screening molecular target protein, computer equipment and storage medium
Technical Field
The application relates to the technical field of biological medicines, in particular to a molecular target protein screening method, a molecular target protein screening device, a computer device and a storage medium.
Background
With the development of biological medicine technology, the determination of targets is the basis of the development of modern new drugs in the stages of drug discovery and development. The experimental approach to target validation has become difficult to detect involving large amounts of protein targets due to the high time and economic cost. In addition, the target identification method by Affinity-based Protein Profiling (ABPP) may involve a large amount of non-specific binding Protein.
In the traditional technology, target prediction is realized by a computer target prediction method, and a small number of proteins with high probability of acting with a predicted compound are found out from a huge protein space, so that protein enrichment is realized on the premise of ensuring higher recovery rate, and the load of an experiment is reduced.
However, the current target prediction method has the disadvantages of low accuracy and reliability.
Disclosure of Invention
In view of the above, there is a need to provide a method, an apparatus, a computer device and a storage medium for screening a molecular target protein, which can improve accuracy and reliability.
A method of screening for molecular target proteins, comprising:
obtaining target molecules, and performing target prediction on the target molecules by adopting different target prediction modes to obtain predicted target proteins;
screening the predicted target protein according to the number of prediction modes corresponding to the predicted target protein to obtain a potential target protein corresponding to the target molecule;
obtaining a quantitative structure-activity relationship model corresponding to the potential target protein;
inputting the target molecule into a quantitative structure-activity relationship model to obtain the binding activity probability of the target molecule and the potential target protein;
and screening out key target proteins matched with the target molecules from the potential target proteins according to the binding activity probability and a preset probability threshold.
In one embodiment, the target prediction of the target molecule by using different target prediction modes to obtain the predicted target protein comprises:
respectively inputting target molecules into a plurality of target prediction platforms based on ligands to obtain prediction target proteins corresponding to the target prediction platforms;
screening the predicted target protein according to the number of prediction modes corresponding to the predicted target protein to obtain the potential target protein corresponding to the target molecule, wherein the screening comprises the following steps:
respectively obtaining the number of target prediction platforms corresponding to the same prediction target protein, and screening the prediction target proteins of which the number of the corresponding target prediction platforms is not less than 2;
and marking the predicted target protein obtained by screening as a potential target protein corresponding to the target molecule.
In one embodiment, after obtaining a target molecule and performing target prediction on the target molecule by using different target prediction modes to obtain a predicted target protein, the method further includes:
introducing the predicted target protein into a preset functional annotation tool;
performing GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment treatment on the predicted target protein based on a functional annotation tool to obtain an enriched GO biological process and a KEGG passage;
performing reliability verification on the predicted target protein according to the enriched GO biological process and the KEGG pathway;
and updating the predicted target protein according to the verified predicted target protein.
In one embodiment, the quantitative structure-activity relationship model comprises an integrated quantitative structure-activity relationship model;
obtaining an integrated quantitative structure-activity relationship model corresponding to the potential target protein comprises:
obtaining active molecules of potential target proteins;
performing data preprocessing on the active molecules;
a plurality of molecular descriptors based on the pretreated active molecules;
constructing a single quantitative structure-activity relationship model corresponding to each molecular descriptor;
and carrying out integrated processing on the single quantitative structure-activity relationship model to obtain an integrated quantitative structure-activity relationship model corresponding to the potential target protein.
In one embodiment, the data pre-processing of the reactive molecules comprises at least one of:
screening active molecules according to a preset quantitative activity data type, and quantifying the activity data type;
removing undefined active molecules and active molecules with abnormal activity values in the active molecules;
carrying out molecular structure format standardization on the active molecules, and carrying out active molecule duplication removal according to format standardization results;
performing molecular skeleton analysis on the active molecules, and removing redundant active molecules with the same skeleton;
and (3) carrying out any one of metal ion removal, deprotonation and hydrogenation on the active molecules.
In one embodiment, after screening the potential target proteins for a key target protein matching the target molecule according to the binding activity probability, the method further comprises:
identifying a target protein of interest with a three-dimensional structure among the potential target proteins;
obtaining a target specificity re-scoring function model corresponding to the three-dimensional structure of the target protein;
inputting the target molecules into a target specificity re-grading function model to obtain the binding activity probability of the target molecules and the target protein, and screening the verification target protein of which the binding activity probability meets the requirement of preset probability;
and when the key target protein comprises various verification target proteins, feeding back a verification result that the verification of the key target protein is passed.
In one embodiment, obtaining a target-specific re-scoring function model corresponding to the three-dimensional structure of the target protein of interest comprises:
performing molecular docking treatment on a target molecule and a target protein according to the three-dimensional structure of the target protein;
re-scoring is carried out according to the optimal docking conformation generated by docking the target protein and the corresponding binding small molecules to obtain energy terms of a plurality of scoring functions;
and combining the energy terms of the scoring functions to obtain a target specificity re-scoring function model.
An apparatus for screening molecular target proteins, comprising:
the target protein prediction module is used for obtaining target molecules and performing target prediction on the target molecules by adopting different target prediction modes to obtain predicted target proteins;
the potential target protein screening module is used for screening the predicted target protein according to the number of prediction modes corresponding to the predicted target protein to obtain the potential target protein corresponding to the target molecule;
the model acquisition module is used for acquiring a quantitative structure-activity relationship model corresponding to the potential target protein;
the binding activity probability analysis module is used for inputting the target molecules into the quantitative structure-activity relationship model to obtain the binding activity probability of the target molecules and the potential target proteins;
and the key target protein screening module is used for screening key target proteins matched with the target molecules from the potential target proteins according to the binding activity probability and a preset probability threshold.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
obtaining target molecules, and performing target prediction on the target molecules by adopting different target prediction modes to obtain predicted target proteins;
screening the predicted target protein according to the number of prediction modes corresponding to the predicted target protein to obtain a potential target protein corresponding to the target molecule;
obtaining a quantitative structure-activity relationship model corresponding to the potential target protein;
inputting the target molecule into a quantitative structure-activity relationship model to obtain the binding activity probability of the target molecule and the potential target protein;
and screening out key target proteins matched with the target molecules from the potential target proteins according to the binding activity probability and a preset probability threshold.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
obtaining target molecules, and performing target prediction on the target molecules by adopting different target prediction modes to obtain predicted target proteins;
screening the predicted target protein according to the number of prediction modes corresponding to the predicted target protein to obtain a potential target protein corresponding to the target molecule;
obtaining a quantitative structure-activity relationship model corresponding to the potential target protein;
inputting the target molecule into a quantitative structure-activity relationship model to obtain the binding activity probability of the target molecule and the potential target protein;
and screening out key target proteins matched with the target molecules from the potential target proteins according to the binding activity probability and a preset probability threshold.
The molecular target protein screening method, the molecular target protein screening device, the computer equipment and the storage medium have the advantages that target prediction is carried out on target molecules by adopting different target prediction modes to obtain predicted target proteins, various target proteins with different data sources are obtained, the predicted target proteins are subjected to optimized screening according to the number of the prediction modes corresponding to the predicted target proteins to obtain potential target proteins corresponding to the target molecules, the optimized screening of the target proteins is realized, the binding activity probability of the target molecules and the potential target proteins is analyzed based on a quantitative structure-activity relationship model corresponding to the potential target proteins, key target proteins matched with the target molecules are screened from the potential target proteins based on the binding activity probability and a preset probability threshold, various different target prediction modes are combined, and the condition that most of the molecular target proteins excessively depend on a single target prediction method is avoided, the problem of larger prediction deviation is caused, and accurate and reliable target protein screening is further realized by combining the analysis of the binding activity of a quantitative structure-activity relationship model.
Drawings
FIG. 1 is a diagram illustrating an environment in which the method for screening a molecular target protein is applied in one embodiment;
FIG. 2 is a schematic flow chart showing a method for screening a molecular target protein according to one embodiment;
FIG. 3 is a schematic flow chart showing a method for screening a molecular target protein according to another embodiment;
FIG. 4 is a flowchart of a method for screening a molecular target protein according to an embodiment;
FIG. 5 is a block diagram showing the structure of a screening apparatus for a molecular target protein according to an embodiment;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The screening method of the molecular target protein provided by the application can be applied to the application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 obtains target molecules uploaded by the terminal 102, performs target prediction on the target molecules by adopting different target prediction modes to obtain predicted target proteins, screens the predicted target proteins according to the number of prediction modes corresponding to the predicted target proteins to obtain potential target proteins corresponding to the target molecules, obtains quantitative structure-activity relationship models corresponding to the potential target proteins, inputs the target molecules into the quantitative structure-activity relationship models to obtain the binding activity probability of the target molecules and the potential target proteins, screens out key target proteins matched with the target molecules from the potential target proteins according to the binding activity probability and a preset probability threshold, and feeds back screening results to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a method for screening a molecular target protein is provided, which is exemplified by the application of the method to the server in fig. 1, and includes the following steps 202 to 210.
Step 202, obtaining target molecules, and performing target prediction on the target molecules by adopting different target prediction modes to obtain predicted target proteins.
The target molecule refers to an object to be subjected to target protein prediction, such as a drug molecule whose action needs to be determined as a target protein. The target protein refers to the action target of the molecule, and taking the drug molecule as an example, the determination of the drug target provides a more thorough explanation for the clinical application of the drug. The determination of the drug target is particularly important for natural products of the traditional Chinese medicine, and is beneficial to the global development of the traditional Chinese medicine. For example, the traditional Chinese medicine toad has an anti-tumor effect, and the reason for the later confirmation is that the main component of bufalin inhibits Na/K-ATP enzyme. The determination of the off-target of the drug is beneficial to the structural modification of the drug, the selectivity of the drug is optimized, a larger development space is provided for the development of the drug, and the discovery of a new action target of the drug is beneficial to the redirection of the drug.
There are many ways of target prediction, mainly including three methods, receptor-based, ligand-structure-based, and data mining-based.
The receptor-based method is generally referred to as molecular docking (molecular docking) method, and the theoretical basis of the method is the 'lock-and-key principle', the interaction between a ligand molecule and a macromolecular biological target is simulated through molecular docking, and a better candidate is obtained as a potential target of a query molecule according to docking scores and binding conditions. The molecular docking method is suitable for exploring the binding affinity between a small molecule compound and a target protein with a specific known three-dimensional structure.
Ligand structure-based methods map targets by similarity between ligands, such as chemical similarity search methods. The rationale is that if the compounds are structurally or physico-chemically similar, they may act on the same target. Therefore, similarity comparison of structure and physicochemical properties is carried out on the query molecule and the small molecule compound of the known target, and finally the target protein is obtained through similarity score and is sequenced, wherein the target in the front sequence is the potential action target of the query small molecule.
The target prediction based on the data mining method is mainly a machine learning algorithm, and related targets are predicted by fully mining the activity related information hidden in the existing massive biological activity experimental data. From the machine learning perspective, target prediction is considered as a typical classification problem, and a target classified as positive is a potential target.
In the embodiment, the adopted target prediction modes comprise a plurality of ligand-based target prediction modes, and target prediction is performed on the target molecules through a plurality of different target prediction modes to obtain target protein prediction results corresponding to the modes.
And 204, screening the predicted target protein according to the number of the prediction modes corresponding to the predicted target protein to obtain the potential target protein corresponding to the target molecule.
Each predicted target protein can be obtained by predicting through one or more prediction modes, and the predicted target proteins meeting the preset quantity requirements are screened as the potential target proteins corresponding to the target molecules according to the quantity of the prediction modes corresponding to the predicted target proteins.
In one embodiment, the target prediction of the target molecule by using different target prediction modes to obtain the predicted target protein comprises: and respectively inputting the target molecules into a plurality of target prediction platforms based on the ligand to obtain the prediction target protein corresponding to each target prediction platform.
Screening the predicted target protein according to the number of prediction modes corresponding to the predicted target protein to obtain the potential target protein corresponding to the target molecule, wherein the screening comprises the following steps: respectively obtaining the number of target prediction platforms corresponding to the same prediction target protein, screening the prediction target proteins with the number of the corresponding target prediction platforms not less than 2, and marking the screened prediction target proteins as potential target proteins corresponding to target molecules.
In one embodiment, the target space is rapidly and efficiently reduced to obtain the potential target protein. Firstly, 4 target prediction platforms based on ligands, including SEA, PPB2, SwissTargetPrediction and HitPickV2, are used as the first step of a target screening strategy to perform target prediction on a query molecule, and a target prediction result is obtained. The 4 ligand-based target prediction platforms each employ different prediction methods, and specifically SEA is an online target prediction tool based on two-dimensional molecular fingerprinting (ECFP4), which calculates similarity scores between ligands of query molecules and known targets mainly through ligand topology, and uses a statistical model for ranking to obtain predicted targets of query molecules. Swisstargetprereduction then calculates the similarity values of the query molecule to known target ligands based on 2D and 3D similarity metrics (FP2, ElectroShape), respectively, to obtain the final score for the predicted target. Hitpick V2 uses the k-nearest neighbor (k-NN) method to search for structurally similar compounds of the query molecule for target prediction. The PPB2 provides a combination of molecular fingerprints (ECFP4, MQN, Xfp) and prediction methods (K-NN, NB), wherein the combination of the K-NN and NB methods based on ECFP4 fingerprints has higher prediction accuracy. Then selecting N (N is a natural number, such as 20) predicted target proteins in the top of each prediction platform, sorting and summarizing the predicted target proteins as query molecules, and finally screening out at least two target proteins predicted simultaneously by ligand-based target prediction as potential targets for deep screening research.
In one embodiment, after obtaining a target molecule and performing target prediction on the target molecule by using different target prediction modes to obtain a predicted target protein, the method further includes:
the predicted target protein is introduced into a pre-defined functional annotation tool. And carrying out GO and KEGG enrichment treatment on the predicted target protein based on a function annotation tool to obtain an enriched GO biological process and a KEGG passage. And performing reliability verification on the predicted target protein according to the enriched GO biological process and the KEGG pathway. And updating the predicted target protein directly obtained by prediction according to the verified predicted target protein.
In order to verify the accuracy and reliability of the predicted target protein, the DAVID 6.8 database is used for carrying out function and pathway enrichment analysis on the predicted target protein. The database aims to provide comprehensive and systematic biological function annotation to mine the biological meaning of a target protein list, and is a function annotation tool based on high-throughput biological data.
GO enrichment refers to the enrichment process of the gene body function of the predicted target protein, and KEGG enrichment refers to the enrichment process of a gene pathway. In particular, genetic ontology biological processes include three major classes of Biological Processes (BP), Cellular Components (CC), and molecular functions.
After enrichment treatment, the predicted target proteins are correspondingly distributed according to the corresponding gene ontology biological process and the maternal and fetal genetic pathways, each GO biological process and KEGG pathway which meet the random correlation P <0.05 are data which pass reliability verification, and the corresponding enriched target proteins are the predicted target proteins which pass the verification.
Specifically, the server performs Gene Ontology (GO) bioanalysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis by introducing the obtained predicted target proteins into a functional annotation tool in the form of a target protein list. The study species were defined as humans (Homo sapiens), 3 parts of Biological Process (BP), Cellular Component (CC) and Molecular Function (MF) under GO enrichment analysis were selected for analysis, then KEGG pathways in pathway enrichment analysis were selected and enriched pathways were selected with a random correlation P <0.05 as a threshold. Sequencing is carried out according to the number of the enriched target proteins, the reliability of the predicted target is verified by utilizing the GO biological process and the KEGG passage which are sequenced at the front, and the predicted target protein which passes the verification is used as the subsequent research target.
And step 206, obtaining a quantitative structure-activity relationship model corresponding to the potential target protein.
A Quantitative Structure Activity Relationship model (QSAR) is a mathematical model describing the Relationship between a molecular Structure and a certain biological Activity of a molecule, and it is basically assumed that the molecular Structure of a compound contains property information determining its physical, chemical, biological, etc., and these physicochemical properties further determine the biological Activity of the compound. Furthermore, the molecular structural property data of a compound should also be correlated to some extent with its biological activity.
In one embodiment, the quantitative structure-activity relationship model comprises an integrated quantitative structure-activity relationship model.
Obtaining an integrated quantitative structure-activity relationship model corresponding to the potential target protein comprises: the method comprises the steps of obtaining active molecules of potential target proteins, conducting data preprocessing on the active molecules, constructing a single quantitative structure-activity relationship model corresponding to each molecular descriptor according to various molecular descriptors of the preprocessed active molecules, conducting integrated processing on the single quantitative structure-activity relationship model, and obtaining an integrated quantitative structure-activity relationship model corresponding to the potential target proteins.
The integrated quantitative structure-activity relationship model is a model obtained by integrating a plurality of single quantitative structure-activity relationship models.
In The examples, for potential target proteins, different types of quantitatively active molecules (Ki values, IC) corresponding to potential target proteins were collected from ChEMBL (chemical drug library) and Binding db (The Binding Database) databases50)。
The number of active molecules of the target protein is large, and reasonable and effective active molecules can be screened out by preprocessing the active molecules of the target protein, so that data optimization is realized, and data interference is reduced.
In one embodiment, the data pre-processing of the reactive molecules comprises at least one of:
the first item: and screening the active molecules according to a preset quantitative activity data type, and quantifying the activity data type.
The second term is: removing undefined active molecules and active molecules with abnormal activity values in the active molecules.
The third item: and (4) carrying out molecular structure format standardization on the active molecules, and carrying out active molecule duplication elimination according to format standardization results.
The fourth item: performing molecular skeleton analysis on the active molecules, and removing redundant active molecules with the same skeleton.
The fifth item: and (3) carrying out any one of metal ion removal, deprotonation and hydrogenation on the active molecules.
It should be noted that the above preprocessing processes may be executed in sequence, or one or more items (items ≧ 2) may be selected for execution.
In one embodiment, the instances are performed in sequence. The pretreatment specifically comprises the following treatment processes:
(1) screening active molecules of each target protein, and only retaining active molecules with half inhibitory concentration (IC50), half effective concentration (EC50) or equilibrium inhibition constant Ki;
(2) removing active molecules with undefined structures and active molecules with no defined correspondence to the target;
(3) removing active molecules with undefined activity values or undefined activity value units;
(4) if the difference in activity values for different types of the same active molecule is less than 30%, the arithmetic mean is calculated as the activity value, otherwise it is removed.
(5) Molecular structures in multiple formats (e.g., mol, sdf, etc. formats) were uniformly converted to Canonical SMILES format using OpenBabel (version 2.4.1), and the generated SMILES was used to calculate InChIKey to remove duplicate reactive molecules.
(6) In order to ensure the structural diversity of active molecules and avoid larger prediction deviation caused by overlarge specific gravity of a single framework, Molecular framework analysis is carried out on the active molecules of any target protein based on Molecular Operating Environment Software (such as Molecular Operating Environment Software, MOE,2018 edition), and redundant active molecules with part of the same framework are randomly removed.
(7) All active molecules were pretreated using the wash module of MOE software, including removal of metal ions, deprotonation, hydrogenation, etc.
The molecular descriptors are data for describing the structure and various physicochemical properties of small molecular compounds such as active molecules. In the examples, the molecular descriptors mainly comprise 3 kinds, namely CATS, MACCS and MOE2D, each of which describes the shape and physicochemical properties of the bound small molecule from different aspects. Specifically, CATS, a chemical advanced template search topological pharmacophore, is a 2D pharmacophore descriptor containing a total of 210 features including six atom types, namely hydrogen bond donor (D), hydrogen bond acceptor (a), hydrophobic group (H), aromatic group (R), positive (+) and negative (-). MACCS uses an MDL key dictionary, which contains 166 common substructure features. MOE2D contained a total of 206 features including 20 physical descriptors, 18 subdivided molecular surface area descriptors, 14 Huckel theory descriptors, 16 Kier & Hall connectivity and Kappa shape index descriptors, 42 atomic number and bond number descriptors, 13 pharmacophore feature descriptors, 33 adjacency and distance matrix descriptors, and 50 partial charge descriptors.
Each type of molecular descriptor corresponds to a quantitative structure-activity relationship model.
In the embodiment, the construction process of the quantitative structure-activity relationship model is as follows: for any target protein, activity value threshold data is obtained, taking 10 μ M as a threshold value as an example, and a data set of corresponding active molecules is divided into a positive set and a negative set. 3 corresponding QSAR classification models are respectively constructed by adopting a machine learning algorithm based on 3 different descriptors (CATS, MACCS and MOE2D), model performance evaluation is carried out by adopting a cross validation mode, and the modeling process is repeated for 50 times.
In order to obtain a more robust and reliable quantitative structure-activity relationship model, for each target protein, a single classification prediction model based on different descriptors is integrated, an integrated QSAR model with average output characteristics is constructed to reduce the variability and volatility of the model, the ubiquitination capability of the model is improved, and the robustness of each QSAR model is evaluated through evaluation indexes, wherein the robustness comprises the Area under a working characteristic curve (AUC), the Accuracy (Accuracy, ACC), the Sensitivity (SE) and the Specificity (SP) of a subject.
In one example, three molecular descriptors, CATS, MACCS and MOE2D, were calculated for each active molecule to describe its shape and physicochemical properties from different aspects. For any target protein, a single QSAR model is constructed by adopting an RF (random forest) algorithm based on 3 different descriptors, model performance evaluation is carried out by utilizing a five-fold cross validation method, and AUC, ACC, SE and SP are used as evaluation indexes. In machine learning, the results of a plurality of QSAR models are averaged, so that an integrated QSAR model can be obtained to effectively reduce the variability and the volatility of the model and improve the prediction capability of the model. And compared with a single model, the integrated model can capture the latent internal relation between the binding small molecular structure and the biological activity value more efficiently. For any target protein, the results of the single prediction model constructed based on 3 different descriptors are averaged, and an integrated QSAR model with average output characteristics is constructed. Compared with a single prediction model, the integrated model effectively reduces the variability and volatility of the model and improves the ubiquitination capability of the model. Compared with a single model, the integrated QSAR model is more stable, and the target prediction accuracy and reliability are higher.
And step 208, inputting the target molecule into the quantitative structure-activity relationship model to obtain the binding activity probability of the target molecule and the potential target protein.
Inputting the target molecule into the quantitative structure-activity relationship model, and analyzing based on the quantitative structure-activity relationship model to obtain the binding activity probability of the target molecule and the potential target protein, wherein the value range with large binding activity probability is [0,1 ]. The greater the value of the probability of binding activity, the greater the likelihood of characterizing the potential target protein as a target for the target molecule, and vice versa.
And step 210, screening key target proteins matched with the target molecules from the potential target proteins according to the binding activity probability and a preset probability threshold.
The predetermined probability threshold is a condition for screening potential target proteins. In an embodiment, the preset probability threshold may be 0.5, the quantitative structure-activity relationship model corresponding to each potential target protein outputs a binding activity probability, the binding activity probability not less than the preset probability threshold is screened out, and the key target protein matched with the target molecule can be obtained according to the potential target protein corresponding to the screened binding activity probability.
The screening method of the molecular target protein obtains the predicted target protein by adopting different target prediction modes to carry out target prediction on the target molecule, obtains a plurality of target proteins with different data sources, screens the predicted target protein according to the number of the prediction modes corresponding to the predicted target protein to obtain the potential target protein corresponding to the target molecule, realizes the optimized screening of the target protein, analyzes the binding activity probability of the target molecule and the potential target protein based on the quantitative structure-activity relationship model corresponding to the potential target protein, screens the key target protein matched with the target molecule from the potential target protein based on the binding activity probability and a preset probability threshold value, combines a plurality of different target prediction modes, avoids the problem that most of the target proteins excessively depend on a single target prediction method to cause larger prediction deviation and combines the binding activity analysis of the quantitative structure-activity relationship model, further realizes accurate and reliable screening of the target protein.
In one embodiment, as shown in fig. 3, after the key target proteins matching the target molecule are screened from the potential target proteins according to the binding activity probability, steps 302 to 308 are further included.
And step 302, identifying target proteins with three-dimensional structures in the potential target proteins.
And 304, acquiring a target specificity re-grading function model corresponding to the three-dimensional structure of the target protein.
Step 306, inputting the target molecule into the target specificity re-scoring function model to obtain the binding activity probability of the target molecule and the target protein, and screening the verification target protein with the binding activity probability meeting the preset probability requirement.
And 308, feeding back a verification result that the key target protein passes verification when the key target protein comprises various verification target proteins.
In order to verify the screened potential target protein, a target specificity re-grading function model based on a machine learning method is adopted as an auxiliary method to carry out deep research on the target protein with a three-dimensional structure. First, a target protein having a three-dimensional structure is downloaded from a PDB (protein data bank) protein database. Before molecular docking, target protein is preprocessed, and various docking parameters are set. For any target protein data set, dividing positive and negative sets by taking a high activity value such as 10 mu M as a threshold value, and establishing a target-specific re-grading model by adopting a machine learning algorithm. And (4) verifying by adopting cross validation, wherein the evaluation indexes are commonly used ACC, AUC, SE and SP.
In particular, although the three-dimensional structures of many proteins are resolved at present, the three-dimensional structures of a large number of target proteins are unknown. Therefore, the molecular docking-based target prediction method is used as an auxiliary screening method herein to verify the screened protein having a three-dimensional structure. And if the prediction result of the model and the prediction result of the integrated QSAR model have the same trend according to the scoring function model established by the existing three-dimensional structural protein, the prediction result of the integrated QSAR model is more accurate and reliable.
In one embodiment, obtaining a target-specific re-scoring function model corresponding to the three-dimensional structure of the target protein of interest comprises: and carrying out molecular docking treatment on the target molecule and the target protein according to the three-dimensional structure of the target protein. And re-scoring according to the optimal docking conformation generated by docking the target protein and the corresponding binding small molecules to obtain energy terms of a plurality of scoring functions. And combining the energy terms of the scoring functions to obtain a target specificity re-scoring function model.
In a specific embodiment, MOE software (Molecular Operating Environment, integrated software for Molecular simulation and drug design) is used for performing Molecular docking on a target molecule and a target protein, and re-scoring is performed according to an optimal docking conformation generated by docking a screened target protein and a corresponding binding small molecule so as to better describe the interaction between the target protein and the target molecule, so that 21 energy items of five scoring functions of GBVI/WSA dG, ASE, Affinity dG, Alpha HB and London dG are obtained. Among these, the GBVI/WSA dG scoring function is a force field-based evaluation method combining stress energy, desolvation energy, van der waals interaction energy, and the like. The ASE scoring function measures Affinity by evaluating ligand-receptor atom pair interactions, and Affinity dG scoring function measures Affinity by evaluating hydrophobic interactions, ionic interactions, hydrogen bond interactions. The Alpha HB scoring function measures affinity by measuring shape fit and hydrogen bonding interactions. The London dG scoring function is an evaluation method combining molecular stress energy, desolvation energy of atoms, hydrogen bonding energy, ionic bonding energy and the like. Each scoring function in molecular docking uses a different physicochemical parameter to describe the binding affinity of the complex, whereas a single scoring function does not adequately assess binding affinity. Combining multiple scoring functions becomes an effective means to improve the accuracy of prediction of target proteins and bound small molecules, because fusing multiple scoring functions allows for a more complete assessment of the binding affinity between a target protein and a bound small molecule than a single scoring function.
In order to improve the prediction accuracy of the re-scoring model, 21 energy terms of the five scoring functions are combined, and a target-specific re-scoring function model is constructed based on an RF algorithm. In MOE software, molecular docking-based methods typically rely on predictive screening of the energy term S values in the default scoring function GBVI/WSA dG. Therefore, a prediction model based on the S value of the scoring function GBVI/WSA dG is constructed at the same time for performance comparison. From the evaluation index AUC of all models, the AUC values of the models based on the five scoring functions were significantly improved compared to the single prediction model. From the overall trend, the target specificity re-scoring function model based on the machine learning method is more robust and reliable, and is superior to a single model in predicting the binding affinity between a target molecule and a target protein.
In a specific application example, a target molecule is Neferine (Nef), the target point of Neferine is identified, and the pharmacological action mechanism of the active ingredient is disclosed. Lotus is an aquatic herb of the genus Nelumbo of the family Nymphaeaceae, in which the plumula Nelumbinis is the dried young leaves and embryo of the mature seed of Nelumbo nucifera. The lotus plumule has especially obvious effect on improving the anti-tumor effect by cooperating with the anti-tumor medicament. The lotus plumule contains a plurality of pharmacological active ingredients such as flavone and alkaloid, wherein the alkaloid is most researched, and the methyl liensinine accounts for a larger proportion in the alkaloid and is the main active ingredient for the lotus plumule to play the pharmacological action. Nef is a bisbenzylisoquinoline alkaloid, and research shows that the bisbenzylisoquinoline alkaloid has various pharmacological activities of multidrug resistance, arrhythmia resistance, sedation hypnosis, thrombosis resistance, fibrosis resistance, blood pressure reduction and the like. Among them, Nef has a significant effect on multidrug resistance. Domestic and foreign studies indicate that multidrug resistance is the leading cause of chemotherapy failure. In recent years, the incidence of tumor and mortality have risen every year. Therefore, as a representative component of the effect of the traditional Chinese medicine lotus plumule, Nef has a good application prospect in the aspect of enhancing the anti-tumor effect of the synergistic chemotherapeutic drug, and has attracted wide attention worldwide. The process of screening for protein targets is as follows:
referring to fig. 4, the present application provides a layer-by-layer target screening strategy, which mainly includes three steps, a first step of prediction and preliminary screening of a plurality of ligand-based target prediction tools, a second step of activity prediction and screening based on a consistency QSAR model, which is an integrated model integrating three molecular descriptors, and a third step of verifying a screening result of the consistency QSAR model based on a target-specific re-scoring function model, wherein the target-specific re-scoring function model is a function model combining five scoring functions of GBVI/WSA dG, ASE, Affinity dG, Alpha HB, and London dG, and further, a key target can be mapped to a corresponding biological function by using a bioinformatics analysis method, and a Nef-target-pathway network relationship is drawn by using cytoscape3.5.1 software. The pathway is a KEGG enrichment pathway involved in the previous enrichment analysis process of the 6 screened target proteins, and the pharmacological activity correlation of the key target proteins and Nef is further explored.
Specifically, Nef is firstly respectively input into a ligand-based prediction platform SEA, Swiss targetclick, HitPickV2 and PPB2 as a first step of a combined target screening strategy, and is used for rapidly and efficiently reducing the target space of Nef to obtain a potential target thereof. For the prediction target list obtained by each prediction platform, firstly, the result list is subjected to preliminary processing according to the screening reference rule of each target prediction platform. After all targets are annotated in detail, 20 prediction targets ranked in the top (less than 20 targets are selected) are respectively selected from the prediction results of each prediction platform and summarized as the prediction targets of Nef, and finally 47 prediction target proteins of Nef are obtained, which is specifically shown in table 1.
Table 147 predicted target proteins and genes
Figure BDA0002598241890000151
Figure BDA0002598241890000161
In order to verify the accuracy and reliability of the target protein prediction, all biological processes and KEGG pathways in which the target protein may participate are obtained through GO and KEGG enrichment analysis, and the KEGG pathways are screened by taking P <0.05 as a threshold value. For the enriched GO biological processes or KEGG pathways, a greater number of participating predicted targets indicates a more closely related pharmacological activity with Nef. The enriched GO biological processes and the KEGG channels are firstly sequenced according to the number of enriched target proteins, the GO biological processes and the KEGG channels which are sequenced at the top are selected, and the verification is carried out by checking whether the mediated functions focus on the main pharmacological activity of Nef. Research shows that the enriched biological processes and pathways are closely related to main pharmacological activities of Nef such as multidrug resistance, arrhythmia resistance, sedation hypnosis, thrombus resistance, fibrosis resistance, blood pressure reduction and the like, and the target space obtained by a ligand-based target prediction platform is reliable and accurate. Finally, in order to obtain a key functional target of Nef, target proteins predicted by at least two prediction platforms at the same time are screened out to be used as potential target proteins for deep screening research, and 14 target proteins meeting the conditions are obtained.
In order to carry out targeted research on 14 target proteins obtained by screening through a ligand-based target prediction platform, a key target of Nef is further predicted and screened by utilizing a method integrating QSAR models. The binding small molecule bioactivity data of 14 screened target proteins are downloaded from the ChEMBL and bindingDB databases first, and data preprocessing is performed. For each bound small molecule, three molecular descriptors, CATS, MACCS, and MOE2D, were calculated to describe its shape and physicochemical properties from different aspects. For any target protein, a single QSAR classification prediction model is respectively constructed by adopting an RF algorithm based on 3 different descriptors, model performance evaluation is carried out by utilizing a five-fold cross validation method, and model evaluation is carried out by taking AUC, ACC, SE and SP as evaluation indexes. The experimental result shows that the numerical values of the evaluation indexes are basically over 0.8, which shows that the prediction accuracy and reliability of the classification model constructed based on the single descriptor are high. In machine learning, results of a plurality of QSAR models are averaged, an integrated QSAR model can be obtained to effectively reduce variability and volatility of the model and improve the prediction capability of the model, and compared with a single model, the integrated model can capture the latent internal relation combining a small molecule structure and a biological activity value more efficiently. Therefore, in order to effectively reduce the variability and volatility of the model and improve the ubiquitination capability of the model, for any target protein, the results of a single prediction model constructed based on 3 different descriptors are averaged, and an integrated QSAR model with average output characteristics is constructed. Compared with a single prediction model, the overall volatility and variability are obviously reduced, the integrated QSAR model is more stable than the single model, and the target prediction accuracy and reliability are higher.
The 14 potential target proteins of Nef are deeply researched by utilizing a method of integrating QSAR models so as to predict and screen out the key targets of the pharmacological activity exerted by Nef. And outputting the prediction result obtained by the integrated QSAR model in the form of a probability value, taking the probability value of 0.5 as a screening threshold, if the prediction probability value of the target is more than 0.5, indicating that the target protein is a potential target of Nef, and otherwise, removing the target protein. Among the results predicted for the Nef potential target proteins, 6 of the 14 target proteins have a predictive probability value of greater than 0.5, including orexin type 2 receptor (HCRTR 2), ABCB1, β -1 adrenergic receptor (adrenoceptor beta 1, ADRB1), fibroblast growth factor 2 (FGF growth factor 2, FGF2), tissue factor (tissue factor III, tissue factor, F3), and phosphodiesterase 3A (phosphodiesterase 3A, PDE 3A). These target proteins will be further validated as key target proteins for Nef to exert pharmacological activity.
In the verification process, a target specificity re-grading function model is used as an auxiliary method to verify the predicted key target. Although the three-dimensional structures of many proteins are resolved at present, the three-dimensional structures of a large number of target proteins are unknown. Therefore, the molecular docking-based target prediction method is used as an auxiliary screening method herein to verify the screened protein having a three-dimensional structure. And if the prediction result of the model and the prediction result of the integrated QSAR model have the same trend according to the scoring function model established by the existing three-dimensional structural protein, the prediction result of the integrated QSAR model is more accurate and reliable.
7 of the 14 target proteins screened based on the ligand target prediction platform have three-dimensional structures, and are downloaded from the PDB protein database, and mainly comprise orexin type 1 receptor (hypocrectin receptor 1, HCRTR1, PDB ID:4ZJ8), HCRTR2(PDB ID:4S0V), solute carrier family 2 (facilitated glucose transporter), member 1(solute carrier family 2member 1, SLC2A1, PDB ID:5EQG), acetylcholinesterase (acetylcholinesterase, ACHE, PDB ID:4M0E), dipeptidyl peptidase 4(dipeptidyl peptidase 4, PDB 4, PDB ID:4A5S), dopamine receptor D3(dopamine EPTOR D3, DRD3, PDB ID:3PBL) and DPP 3B phosphodiesterase (PDE 3B 2, PDB 3 PDE3B 3B). And then performing molecular docking on Nef and the target protein by using MOE software, and re-scoring according to the optimal docking conformation generated by docking the screening target protein and the corresponding binding small molecule so as to better describe the interaction between the target protein and the Nef. Finally, 21 energy terms of the five scoring functions of GBVI/WSA dG, ASE, Affinity dG, Alpha HB and London dG are obtained. Each scoring function in molecular docking uses a different physicochemical parameter to describe the binding affinity of the complex, whereas a single scoring function does not adequately assess binding affinity. Combining multiple scoring functions becomes an effective means to improve the accuracy of prediction of target proteins and bound small molecules, because fusing multiple scoring functions allows for a more complete assessment of the binding affinity between a target protein and a bound small molecule than a single scoring function.
In order to improve the prediction accuracy of the re-scoring model, a total of 21 energy terms of the five scoring functions are combined, and a target-specific re-scoring function model is constructed based on an RF algorithm. In MOE software, molecular docking-based methods typically rely on predictive screening of the energy term S values in the default scoring function GBVI/WSA dG. Meanwhile, a prediction model based on the S value of the scoring function GBVI/WSA dG is constructed to compare the performance. The AUC of the evaluation index for all models is remarkably improved compared with that of a single prediction model based on five scoring functions in the overall view. From the overall trend, the target specificity re-scoring function model based on the machine learning method is more robust and reliable, and is superior to a single model in predicting the binding affinity between Nef and a target protein.
Referring to table 2, a re-scoring function model based on five scoring functions is used to predict the known three-dimensional structure target of Nef, and the prediction result is output in the form of a probability value, and the probability value is still used as a screening threshold value to be screened, wherein the probability value is 0.5. HCRTR2, ABCB1, ADRB1, FGF2, F3 and PDE3A are potential targets screened based on a method integrating QSAR models, the prediction probability values of the potential targets are all larger than 0.5, wherein HCRTR2 has a three-dimensional protein structure and can be subjected to molecular docking, the prediction probability value obtained through a target-specific re-grading function model is 0.62, and the HCRTR2 is also proved to be a potential target of Nef. And the prediction probability values of the remaining 6 three-dimensional structural proteins HCRTR1, SLC2A1, ACHE, DPP4, DRD3 and PDE3B based on the scoring function model are all less than 0.5, and the prediction trends of the method based on the integrated QSAR model are the same as those of the method based on the integrated QSAR model, which shows that the method based on the re-scoring function model and the method based on the integrated QSAR model have the same prediction trend for the prediction of the existing three-dimensional structural proteins, and the obtained prediction result of the integrated QSAR model is an accurate and reliable conclusion. Through the verification of a re-grading function model, HCRTR2, ABCB1, ADRB1, FGF2, F3 and PDE3A serve as key targets of the pharmacological activity exerted by Nef.
TABLE 2 14 potential target proteins of Nef
Figure BDA0002598241890000191
In other embodiments, after the key target protein is obtained through analysis, the reliability of the screened key target may be further verified, for example, an SPR experiment may be used to verify the binding affinity of the query molecule and the potential target protein, and the SPR experiment has many advantages of fast response, no label, high sensitivity, and the like. Further experimental studies of the molecules with predicted relevant targets can also be queried, for example: cell experiments, Western blot experiments, CCK-8 experiments and immunohistochemical experiments, wherein the experimental schemes are selected according to the specificity of the screened protein so as to verify the reliability of the screened key target.
It should be understood that, although the steps in the flowcharts are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in each of the flowcharts described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
In one embodiment, as shown in fig. 5, there is provided a screening apparatus for molecular target proteins, comprising: a target protein prediction module 502, a potential target protein screening module 504, a model acquisition module 506, a binding activity probability analysis module 508, and a key target protein screening module 510, wherein:
the target protein prediction module 502 is configured to obtain a target molecule, and perform target prediction on the target molecule by using different target prediction modes to obtain a predicted target protein.
And the potential target protein screening module 504 is configured to screen the predicted target proteins according to the number of prediction modes corresponding to the predicted target proteins to obtain potential target proteins corresponding to the target molecules.
A model obtaining module 506, configured to obtain a quantitative structure-activity relationship model corresponding to the potential target protein.
And a binding activity probability analysis module 508, configured to input the target molecule into the quantitative structure-activity relationship model, so as to obtain a binding activity probability between the target molecule and the potential target protein.
And a key target protein screening module 510, configured to screen out a key target protein matching the target molecule from the potential target proteins according to the binding activity probability and a preset probability threshold.
In one embodiment, the target protein prediction module is further configured to input the target molecules into a plurality of ligand-based target prediction platforms, respectively, to obtain predicted target proteins corresponding to the target prediction platforms; the potential target protein screening module is also used for respectively obtaining the number of target prediction platforms corresponding to the same prediction target protein, and screening the prediction target proteins of which the number of the corresponding target prediction platforms is not less than 2; and marking the predicted target protein obtained by screening as a potential target protein corresponding to the target molecule.
In one embodiment, the screening device for molecular target proteins further comprises a predicted target protein validation module for introducing the predicted target protein into a preset functional annotation tool; performing GO and KEGG enrichment treatment on the predicted target protein based on a function annotation tool to obtain an enriched GO biological process and a KEGG passage; performing reliability verification on the predicted target protein according to the enriched GO biological process and the KEGG pathway; and updating the predicted target protein according to the verified predicted target protein.
In one embodiment, the quantitative structure-activity relationship model comprises an integrated quantitative structure-activity relationship model; the model acquisition module is also used for acquiring active molecules of the potential target protein; performing data preprocessing on the active molecules; a plurality of molecular descriptors based on the pretreated active molecules; constructing a single quantitative structure-activity relationship model corresponding to each molecular descriptor; and carrying out integrated processing on the single quantitative structure-activity relationship model to obtain an integrated quantitative structure-activity relationship model corresponding to the potential target protein.
In one embodiment, the model obtaining module is further configured to perform at least one of the following processes: screening active molecules according to a preset quantitative activity data type, and quantifying the activity data type; removing undefined active molecules and active molecules with abnormal activity values in the active molecules; carrying out molecular structure format standardization on the active molecules, and carrying out active molecule duplication removal according to format standardization results; performing molecular skeleton analysis on the active molecules, and removing redundant active molecules with the same skeleton; and (3) carrying out any one of metal ion removal, deprotonation and hydrogenation on the active molecules.
In one embodiment, the screening device for molecular target proteins further comprises a key target protein verification module for identifying target proteins of interest with three-dimensional structures in the potential target proteins; obtaining a target specificity re-scoring function model corresponding to the three-dimensional structure of the target protein; inputting the target molecules into a target specificity re-grading function model to obtain the binding activity probability of the target molecules and the target protein, and screening the verification target protein of which the binding activity probability meets the requirement of preset probability; and when the key target protein comprises various verification target proteins, feeding back a verification result that the verification of the key target protein is passed.
In one embodiment, the key target protein verification module is further configured to perform molecular docking processing on the target molecule and the target protein according to the three-dimensional structure of the target protein; re-scoring is carried out according to the optimal docking conformation generated by docking the target protein and the corresponding binding small molecules to obtain energy terms of a plurality of scoring functions; and combining the energy terms of the scoring functions to obtain a target specificity re-scoring function model.
For the specific definition of the screening device for the molecular target protein, reference may be made to the above definition of the screening method for the molecular target protein, which is not described herein again. The modules in the screening device for the molecular target protein can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store the screened target protein data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for screening a molecular target protein.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method for screening a molecular target protein. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of screening for a molecular target protein, the method comprising:
obtaining target molecules, and performing target prediction on the target molecules by adopting different target prediction modes to obtain predicted target proteins;
screening the predicted target protein according to the number of prediction modes corresponding to the predicted target protein to obtain a potential target protein corresponding to the target molecule;
obtaining a quantitative structure-activity relationship model corresponding to the potential target protein;
inputting the target molecule into the quantitative structure-activity relationship model to obtain the binding activity probability of the target molecule and the potential target protein;
and screening out key target proteins matched with the target molecules from the potential target proteins according to the binding activity probability and a preset probability threshold.
2. The method of claim 1, wherein the obtaining of the target molecule, the target prediction of the target molecule using different target prediction modes, and the screening of the potential target protein corresponding to the target molecule comprises:
inputting the target molecules into a plurality of ligand-based target prediction platforms respectively to obtain prediction target proteins corresponding to the target prediction platforms;
respectively obtaining the number of target prediction platforms corresponding to the same prediction target protein, and screening the prediction target proteins of which the number of the corresponding target prediction platforms is not less than 2;
and marking the predicted target protein obtained by screening as a potential target protein corresponding to the target molecule.
3. The method of claim 1, further comprising, after obtaining the target molecule and performing target prediction on the target molecule by using different target prediction modes to obtain a predicted target protein:
introducing the predicted target protein into a preset functional annotation tool;
performing GO and KEGG enrichment treatment on the predicted target protein based on the functional annotation tool to obtain an enriched GO biological process and a KEGG pathway;
performing reliability verification on the predicted target protein according to the enriched GO biological process and the KEGG pathway;
and updating the predicted target protein according to the verified predicted target protein.
4. The method of claim 1, wherein the quantitative structure-activity relationship model comprises an integrated quantitative structure-activity relationship model;
the obtaining of the integrated quantitative structure-activity relationship model corresponding to the potential target protein comprises:
obtaining active molecules of the potential target protein;
performing data preprocessing on the active molecules;
a plurality of molecular descriptors based on the pretreated active molecules;
constructing a single quantitative structure-activity relationship model corresponding to each molecular descriptor;
and carrying out integrated processing on the single quantitative structure-activity relationship model to obtain an integrated quantitative structure-activity relationship model corresponding to the potential target protein.
5. The method of claim 4, wherein the pre-processing of the data on the active molecules comprises at least one of:
screening the active molecules according to a preset quantitative activity data type, wherein the quantitative activity data type is obtained;
removing undefined active molecules and active value abnormal active molecules in the active molecules;
carrying out molecular structure format standardization on the active molecules, and carrying out active molecule duplication elimination according to format standardization results;
performing molecular skeleton analysis on the active molecules, and removing redundant active molecules with the same skeleton;
and (3) carrying out any one treatment of metal ion removal, deprotonation and hydrogenation on the active molecules.
6. The method of claim 1, further comprising, after the screening the potential target proteins for a key target protein matching the target molecule according to the binding activity probability:
identifying a target protein of interest having a three-dimensional structure among the potential target proteins;
obtaining a target specificity re-scoring function model corresponding to the three-dimensional structure of the target protein;
inputting the target molecules into the target specificity re-grading function model to obtain the binding activity probability of the target molecules and the target protein, and screening the verification target protein of which the binding activity probability meets the requirement of a preset probability;
when the key target protein comprises each item of the verification target protein, feeding back a verification result that the verification of the key target protein is passed.
7. The method of claim 6, wherein obtaining a target-specific re-scoring function model corresponding to the three-dimensional structure of the target protein of interest comprises:
performing molecular docking treatment on the target molecule and the target protein according to the three-dimensional structure of the target protein;
re-scoring is carried out according to the optimal docking conformation generated by docking the target protein and the corresponding binding small molecules to obtain energy terms of a plurality of scoring functions;
and combining the energy items of the scoring functions to obtain the target specificity re-scoring function model.
8. An apparatus for screening molecular target proteins, the apparatus comprising:
the target protein prediction module is used for obtaining target molecules and performing target prediction on the target molecules by adopting different target prediction modes to obtain predicted target proteins;
the potential target protein screening module is used for screening the predicted target protein according to the number of prediction modes corresponding to the predicted target protein to obtain the potential target protein corresponding to the target molecule;
the model acquisition module is used for acquiring a quantitative structure-activity relationship model corresponding to the potential target protein;
the binding activity probability analysis module is used for inputting the target molecules into the quantitative structure-activity relationship model to obtain the binding activity probability of the target molecules and the potential target proteins;
and the key target protein screening module is used for screening key target proteins matched with the target molecules from the potential target proteins according to the binding activity probability and a preset probability threshold.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202010716290.5A 2020-07-23 2020-07-23 Method and device for screening molecular target protein, computer equipment and storage medium Pending CN112053742A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010716290.5A CN112053742A (en) 2020-07-23 2020-07-23 Method and device for screening molecular target protein, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010716290.5A CN112053742A (en) 2020-07-23 2020-07-23 Method and device for screening molecular target protein, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112053742A true CN112053742A (en) 2020-12-08

Family

ID=73602551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010716290.5A Pending CN112053742A (en) 2020-07-23 2020-07-23 Method and device for screening molecular target protein, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112053742A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436686A (en) * 2021-06-23 2021-09-24 腾讯科技(深圳)有限公司 Artificial intelligence-based compound library construction method, device, equipment and storage medium
CN114023396A (en) * 2022-01-05 2022-02-08 北京晶泰科技有限公司 Protein kinase inhibitor prediction method, model construction method and device
WO2022185346A1 (en) * 2021-03-03 2022-09-09 Sadri Arash Consensually docked 4d and 5d qsar: method and software for synergistic integration of binding affinity predictions
WO2023108465A1 (en) * 2021-12-15 2023-06-22 深圳晶泰科技有限公司 Virtual screening method and apparatus, and electronic device
WO2023123023A1 (en) * 2021-12-29 2023-07-06 深圳晶泰科技有限公司 Method and device for screening molecules and application thereof
CN117373564A (en) * 2023-12-08 2024-01-09 北京百奥纳芯生物科技有限公司 Method and device for generating binding ligand of protein target and electronic equipment
CN117438090A (en) * 2023-12-15 2024-01-23 首都医科大学附属北京儿童医院 Drug-induced immune thrombocytopenia toxicity prediction model, method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030108872A1 (en) * 2000-08-23 2003-06-12 Mark Sulavik Genomics-assisted rapid identification of targets
WO2015002860A1 (en) * 2013-07-02 2015-01-08 Epigenetx, Llc Structure-based modeling and target-selectivity prediction
CN104866710A (en) * 2015-05-08 2015-08-26 西北师范大学 Method for predicting inhibition concentration of cytochrome P450 enzyme CYP1A2 inhibitor by utilizing simplified partial least squares
CN106575320A (en) * 2014-05-05 2017-04-19 艾腾怀斯股份有限公司 Binding affinity prediction system and method
CN107480467A (en) * 2016-06-07 2017-12-15 王�忠 A kind of differentiation or the method for comparative drug effort module
CN108399315A (en) * 2018-03-01 2018-08-14 中国科学院长春应用化学研究所 A kind of screening technique of Bcr-Abl kinases inhibitors
CN111292800A (en) * 2020-01-21 2020-06-16 中南大学 Molecular characterization based on predicted protein affinity and application thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030108872A1 (en) * 2000-08-23 2003-06-12 Mark Sulavik Genomics-assisted rapid identification of targets
WO2015002860A1 (en) * 2013-07-02 2015-01-08 Epigenetx, Llc Structure-based modeling and target-selectivity prediction
CN106575320A (en) * 2014-05-05 2017-04-19 艾腾怀斯股份有限公司 Binding affinity prediction system and method
CN104866710A (en) * 2015-05-08 2015-08-26 西北师范大学 Method for predicting inhibition concentration of cytochrome P450 enzyme CYP1A2 inhibitor by utilizing simplified partial least squares
CN107480467A (en) * 2016-06-07 2017-12-15 王�忠 A kind of differentiation or the method for comparative drug effort module
CN108399315A (en) * 2018-03-01 2018-08-14 中国科学院长春应用化学研究所 A kind of screening technique of Bcr-Abl kinases inhibitors
CN111292800A (en) * 2020-01-21 2020-06-16 中南大学 Molecular characterization based on predicted protein affinity and application thereof

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
JAYKANT VORA等: "Molecular docking, QSAR and ADMET based mining of natural compounds against prime targets of HIV", 《JOURNAL OF BIOMOLECULAR STRUCTURE AND DYNAMICS, 2019 VOL.37, NO.1, 131-146, HTTPS://DOI.ORG/10.1080/07391102.2017.1420489》, pages 131 - 146 *
WEN-LING YE等: "Improving Docking-Based Virtual Screening Ability by Integrating Multiple Energy Auxiliary Terms from Molecular Docking Scoring", 《J. CHEM. INF. MODEL. 2020, 60, 9, 4216-4230》, 30 April 2020 (2020-04-30), pages 4216 - 4230 *
WEN-LING YE等: "Improving Docking-Based Virtual Screening Ability by Integrating Multiple Energy Auxiliary Terms from Molecular Docking Scoring", 《J. CHEM. INF. MODEL. 2020, 60, 9, 4216-4230》, pages 4216 - 4230 *
WEN-LING YE等: "Virtual screening and experimental validation of eEF2K inhibitors by combining homology modeling, QSAR and molecular docking from FDA approved drugs", 《NEW J. CHEM., 2019, 43, 19097》, pages 19097 - 19106 *
李定等: "《个性化药物新药研发的未来》", 29 February 2020, 上海:上海科学技术文献出版社, pages: 117 - 119 *
李定等: "《计算机辅助 药物设计基础》", 西北农林科技大学出版社, pages: 117 - 119 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022185346A1 (en) * 2021-03-03 2022-09-09 Sadri Arash Consensually docked 4d and 5d qsar: method and software for synergistic integration of binding affinity predictions
CN113436686A (en) * 2021-06-23 2021-09-24 腾讯科技(深圳)有限公司 Artificial intelligence-based compound library construction method, device, equipment and storage medium
WO2022267752A1 (en) * 2021-06-23 2022-12-29 腾讯科技(深圳)有限公司 Compound processing method and apparatus based on artificial intelligence, and device, storage medium, and computer program product
CN113436686B (en) * 2021-06-23 2024-02-27 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for constructing compound library based on artificial intelligence
WO2023108465A1 (en) * 2021-12-15 2023-06-22 深圳晶泰科技有限公司 Virtual screening method and apparatus, and electronic device
WO2023123023A1 (en) * 2021-12-29 2023-07-06 深圳晶泰科技有限公司 Method and device for screening molecules and application thereof
CN114023396A (en) * 2022-01-05 2022-02-08 北京晶泰科技有限公司 Protein kinase inhibitor prediction method, model construction method and device
CN117373564A (en) * 2023-12-08 2024-01-09 北京百奥纳芯生物科技有限公司 Method and device for generating binding ligand of protein target and electronic equipment
CN117373564B (en) * 2023-12-08 2024-03-01 北京百奥纳芯生物科技有限公司 Method and device for generating binding ligand of protein target and electronic equipment
CN117438090A (en) * 2023-12-15 2024-01-23 首都医科大学附属北京儿童医院 Drug-induced immune thrombocytopenia toxicity prediction model, method and system
CN117438090B (en) * 2023-12-15 2024-03-01 首都医科大学附属北京儿童医院 Drug-induced immune thrombocytopenia toxicity prediction model, method and system

Similar Documents

Publication Publication Date Title
CN112053742A (en) Method and device for screening molecular target protein, computer equipment and storage medium
Agamah et al. Computational/in silico methods in drug target and lead prediction
Sun et al. In silico prediction of compounds binding to human plasma proteins by QSAR models
US7734420B2 (en) Methods and systems to identify operational reaction pathways
Pejaver et al. Missense variant pathogenicity predictors generalize well across a range of function‐specific prediction challenges
Wirth et al. Mining SOM expression portraits: Feature selection and integrating concepts of molecular function
CN104978474B (en) A kind of method of evaluating drug effect and system based on molecular network
CN111613297A (en) Method, system and device for establishing traditional Chinese medicine action mechanism model based on network pharmacology
Saldívar-González et al. Getting SMARt in drug discovery: chemoinformatics approaches for mining structure–multiple activity relationships
Wang et al. Review and comparative assessment of similarity-based methods for prediction of drug–protein interactions in the druggable human proteome
Kalinowsky et al. A diverse benchmark based on 3D matched molecular pairs for validating scoring functions
Marin-Sanguino et al. Biochemical pathway modeling tools for drug target detection in cancer and other complex diseases
Szulc et al. fingeRNAt—A novel tool for high-throughput analysis of nucleic acid-ligand interactions
Poli et al. Consensus docking in drug discovery
Zhang et al. Network motif-based identification of breast cancer susceptibility genes
KR101684742B1 (en) Method and system for drug virtual screening and construction of focused screening library
Abdolmaleki et al. Computational multi-target drug design
CN111383708B (en) Small molecular target prediction algorithm based on chemical genomics and application thereof
CN109801676B (en) Method and device for evaluating activation effect of compound on gene pathway
Cassotti et al. Application of the weighted Power-Weakness Ratio (wPWR) as a fusion rule in ligand-based virtual screening
Ghulam et al. A Review of Pathway Databases and Related Methods Analysis
Afanasyeva et al. Developing a kinase-specific target selection method using a structure-based machine learning approach
Cai et al. NetTDP: permutation-based true discovery proportions for differential co-expression network analysis
Ghulam et al. Human drug-pathway association prediction based on network consistency projection
Seo et al. Pseq2Sites: Enhancing protein sequence-based ligand binding-site prediction accuracy via the deep convolutional network and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination