CN111292800A - Molecular characterization based on predicted protein affinity and application thereof - Google Patents

Molecular characterization based on predicted protein affinity and application thereof Download PDF

Info

Publication number
CN111292800A
CN111292800A CN202010069615.5A CN202010069615A CN111292800A CN 111292800 A CN111292800 A CN 111292800A CN 202010069615 A CN202010069615 A CN 202010069615A CN 111292800 A CN111292800 A CN 111292800A
Authority
CN
China
Prior art keywords
molecular characterization
model
molecules
protein
different
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010069615.5A
Other languages
Chinese (zh)
Inventor
曹东升
刘璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202010069615.5A priority Critical patent/CN111292800A/en
Publication of CN111292800A publication Critical patent/CN111292800A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a molecular characterization based on predicted protein affinity, which is constructed by the following method: collecting protein targets and activity data thereof; selecting a plurality of different descriptors for computing each protein target, and selecting a plurality of different machine learning algorithms; combining the computed descriptors and a machine learning algorithm pairwise to form a plurality of different single models; calculating the average probability value of the single model to be used as the combination strength value of the molecule and the protein target to form a consensus model; in the consensus model, molecules to be detected are input, a calculation target spectrum is output, and molecular characterization is formed. The invention explains the molecular characterization from the perspective of biological space, integrates the binding affinity between the compound and the target into the molecular characterization, and obtains new activity or biological information through the integral prediction of an organism; the results are obtained quickly by a computer, and are more quick and efficient compared with the characterization of bioactive molecules based on experiments.

Description

Molecular characterization based on predicted protein affinity and application thereof
Technical Field
The invention relates to the field of computer-aided drug design, in particular to a molecular characterization based on predicted protein affinity and application thereof.
Background
In the existing drug design process, in order to shorten the drug research period and control the drug research and development cost, computer drug-assisted molecular design becomes an essential tool in the drug research and development process. Computer-aided molecular design can be divided into two classes of ligand-based and receptor-based approaches. Most ligand-based approaches are based on the assumption that compounds of similar structure have similar biological activities, and the key point is how to characterize the relationship between molecules and biological activities.
The existing solutions and disadvantages: existing molecular characterization can be classified into chemical structure-based and biological activity-based. Conventional chemical structural molecular characterization dissects the structure of a molecule, such as substructure fragments, topology, pharmacophore features, and physicochemical properties, among others. More commonly used are MOE2D, CATS, Atompair, MACCS, ECFP series, FP series, etc. Although this molecular characterization covers to some extent the link between compounds and biological activity, this link is ambiguous and cannot deeply interpret biological activity. As technology has evolved, data based on various biological tests has accumulated. Researchers have attempted to characterize the association between compounds and biological activity using biological assay data. For example, Lamb et al [1] captured the relationship between compounds and disease by using the changes in gene signals after perturbation by the compounds. And the side effect of the medicine, and biological information such as the test binding affinity between the molecule and the protein target point are used for constructing the characterization of the bioactive molecules. However, assay-based molecular characterization is only applicable to molecules that also have assay information, which greatly limits the application of bioactive molecular characterization. In order to expand the range of applications, researchers will usually assemble the experimental data obtained from different groups into a large data set, which inevitably results in errors. Even so, molecular characterization based on experimental biological activity data is not complete, and not all molecules have a broad spectrum of tests, and drug enterprises or research organizations will test on their own drug development schedules, which results in only a few test results for some molecules or a few molecules with a complete test spectrum. The disadvantage of such incomplete data matrices is passed on to the molecules by characterization, so that the association between the molecules and the biological activity is not complete. Some studies will default to a deletion value of 0, i.e.there is no binding force of the molecule to the protein being tested or the molecule has no such biological activity, in such a way that false negatives occur. Therefore, we need to develop computer-based bioactive molecule characterization that breaks the current limitations of test-based bioactive molecule characterization and obtains high accuracy results using conditions that are as simple as possible.
Disclosure of Invention
To solve the above problems, the present invention aims to construct a molecular characterization and its application capable of directly and simply characterizing the relationship between a compound and biological activity, aiming at the shortcomings of the prior art.
In order to achieve the above object, the present invention provides a molecular characterization (i.e. a calculated target profile) based on predicted protein affinity, wherein the calculated target profile is constructed by the following method:
(1) data collection: collecting protein targets and information thereof;
(2) molecular characterization results: the binding affinity value of each protein target to the compound was calculated by computer method.
The target spectrum is calculated, and further in S2, the computer method includes a SAR model and docking.
The above calculated target spectrum is further obtained by constructing the calculated target spectrum by the following method:
(1) data collection: collecting protein targets and activity data thereof;
(2) parameter determination: calculating a plurality of different descriptors for each protein target and selecting a plurality of different machine learning algorithms;
(3) constructing a model; combining the computed descriptors and the machine learning algorithm pairwise to form a plurality of different single models;
(4) constructing a consensus model: calculating the average probability value of the single model, and taking the average probability value as the binding strength value of the molecule and the protein target to form a consensus model;
(5) forming a calculated target spectrum: and inputting the molecules to be detected in the consensus model, and outputting a calculated target spectrum.
The target spectrum is calculated as above, and further in S1, the protein target includes enzyme, receptor, transporter, ion channel; the activity data included IC50, EC50, Ki values.
The target spectrum is calculated, and further in the S2, the descriptor is one or more of MOE2D, CATS2, GpidAPH3, TGT, TGD, MACCS, Atompair, Estate, FP series, ECFP series, and ECFC series descriptors; the machine learning algorithm is one or more of RF, XGboosting, DL, ADA, GBM, KNN and SVM algorithms.
The above calculated target spectrum is further obtained by constructing the calculated target spectrum by the following method:
(1) data collection: collecting protein targets and activity data thereof;
(2) parameter determination: three different descriptors for each protein target were calculated separately: MOE2D, CATS, and MACCS, and selects three different machine learning algorithms: RF, ADA and GBM;
(3) constructing a model; combining three different descriptors and three different machine learning algorithms in pairs to form nine different single models;
(4) a consensus model: calculating the average probability value of the nine different single models to serve as the binding strength value of the molecule and the protein target to form a consensus model;
(5) forming a calculated target spectrum: and inputting the molecules to be detected in the consensus model, and outputting a calculated target spectrum.
Based on a general technical concept, the invention provides an application of the calculated target spectrum in framework transition, and the application method comprises the following steps:
s1-1, respectively inputting the molecules in the active molecule database and the screening database into a consensus model, outputting a calculation target spectrum, obtaining the molecular characterization, and calculating descriptors based on chemical structures, including MOE2D, MACCS, CATS, ECFP4, Atompair, FP2, GpidAPH3, TGT and TGD;
s1-2, sequentially selecting active molecules in different targets of the active molecule database as query molecules, and combining other active molecules with molecules in the screening database to form the query database;
s1-3, calculating similarity values of the query molecules and the molecules in the query database by using Tanimoto coefficients;
s1-4, according to the size sequence of the similarity values, calculating the ratio of the number of active molecular frameworks to the number of all active molecular frameworks in the first 100, 200, 500, 1000, 2000 and 5000 compounds to obtain the enrichment condition of the active molecules with frameworks different from the query molecules.
Based on a general technical concept, the invention provides an application of the above calculated target spectrum in predicting a protein target, and the application method comprises the following steps:
s2-1, inputting the active molecules into the consensus model, outputting a calculation target spectrum of the active molecules, and obtaining molecular characterization;
s2-1, ranking the targets according to the binding affinity values predicted by the active molecule characterization to predict the targets of the active molecules.
Based on a general technical concept, the invention provides an application of the above-mentioned calculated target spectrum in QSAR model prediction of hepatotoxicity, and the application method comprises the following steps:
s3-1, collecting human hepatotoxicity data and non-hepatotoxicity data from the existing database;
s3-2, dividing the data into a training set and a test set by adopting different data splitting strategies;
s3-3, inputting hepatotoxicity and non-hepatotoxicity data into the consensus model, outputting a calculation target spectrum, and obtaining a CADes molecular characterization; calculating the MOE2D, CATS and MACCS descriptors based on chemical structure;
s3-4, constructing a model by the CADes molecule characterization and the XG boosting algorithm; respectively constructing models of MOE2D, CATS and MACCS descriptors based on chemical structures and XGBoosting algorithm;
s3-5, combining the model of the calculated target spectrum and the model of the chemical structure descriptor to construct a consensus model;
s3-6, inputting the molecules to be detected into the consensus model to obtain hepatotoxicity data.
The above application, further, the different splitting strategies include: random splitting, Bemis-Murcko framework splitting, and carbon framework splitting.
Based on a general technical concept, the invention provides an application of the calculated target spectrum in predicting drug-drug interaction, and the application method comprises the following steps:
s4-1, inputting the medicine collected from the literature into a consensus model, outputting a calculation target spectrum, and obtaining molecular characterization;
s4-2, dividing the medicines into positive samples and negative samples, wherein the positive samples are all real medicine-medicine interaction pairs, and the negative samples are random non-interaction pairs among the medicines;
s4-3, combining the molecular characterization of the two medicines between the medicine and the medicine interaction pair in the positive sample as the molecular characterization of the medicine pair, and combining with an RF algorithm to construct a model;
s4-4, inputting the molecules to be detected into the model constructed by the S4-3, and predicting whether the interaction exists between the drugs.
Based on a general technical concept, the invention provides an application of the above calculated target spectrum in determining a drug action mode network, and the application method comprises the following steps:
s5-1, inputting the medicine with the chemical structure obtained in the literature into a consensus model, outputting a calculation target spectrum, and obtaining molecular characterization;
s5-2, calculating Pearson correlation coefficients between any two drugs from the drugs, visualizing a drug network according to an anatomical chemical classification method, wherein each node corresponds to one drug, and if the corresponding similarity coefficient is larger than a predefined threshold value, the two nodes are connected through an edge; the width of the rim is proportional to the similarity between the drugs attached to the rim.
Compared with the prior art, the invention has the following beneficial effects:
(1) the present invention provides a calculated target profile that enables simple and direct characterization of the binding relationship between a compound and a biological activity. It explains the molecular characterization from the perspective of biospace, integrates the binding affinity between the compound and the target into the molecular characterization, and obtains new activity or biological information through the integral prediction of the organism; the result is obtained quickly by using a computer, and the method is quicker and more efficient compared with the representation of the bioactive molecule based on the test; only chemical structure information is needed, and the method is not limited to the existing molecules with biological test information; the application range is wide, and when the molecular characterization is carried out, the molecular characterization can be applied to a QSAR model to predict biological activity or toxicity, and active molecules and an active framework can be enriched in virtual screening.
(2) The invention provides an application of a calculated target spectrum in framework transition. Compared with the traditional molecular characterization for framework transition, the computed target spectrum introduces the binding strength relationship between the molecule and the protein target to characterize the molecule, and weakens the association with a chemical structure, so that the computed target spectrum has excellent framework transition performance.
(3) The invention provides an application of a calculated target spectrum in predicting a protein target. The method is simple in requirement, efficient and rapid, and can accurately calculate the potential target of the compound.
(4) The invention provides application of a calculated target spectrum in QSAR model prediction of hepatotoxicity. The target spectrum is calculated to represent the relationship between the compound and the biological activity, and the relationship and the information of the chemostructure descriptor can be mutually complemented to construct a robust and accurate hepatotoxicity model.
(5) The invention provides an application of a calculated target spectrum in predicting drug-drug interaction. Calculating the target spectrum to re-characterize the drug in terms of biological activity enables accurate capture of drug-drug interactions.
(6) The invention provides application of a calculation target spectrum in determining a drug action mode network, wherein the calculation target spectrum explains an action mechanism of a drug in a biological organism by representing a binding affinity relation between a compound and a protein target, and the mechanism relation is visualized, so that a new research idea can be brought to a researcher.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of the construction of the target spectrum calculation in example 1 of the present invention.
FIG. 2 is a similarity search framework for applying the calculated target spectra to skeletal transitions in example 2 of the present invention.
FIG. 3 shows the results of the calculation of target spectra applied to skeletal transitions in example 2 of the present invention.
FIG. 4 is a network diagram of the application of the calculated target profile to the drug action pattern in example 6 of the present invention.
Detailed Description
The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention. Modifications or substitutions to methods, procedures, or conditions of the invention may be made without departing from the spirit and scope of the invention.
Unless otherwise specified, the technical means used in the examples are conventional means well known to those skilled in the art; all reagents used in the examples are commercially available unless otherwise specified.
The percentage "%" referred to in the present invention means mass% unless otherwise specified; but the percentage of the solution, unless otherwise specified, refers to the grams of solute contained in 100ml of the solution.
The weight parts in the invention can be the weight units known in the art such as mu g, mg, g, kg, and the like, and can also be multiples thereof, such as 1/10, 1/100, 10, 100, and the like.
Example 1:
a target spectrum is calculated, the construction process is shown in figure 1, and the specific construction steps are as follows:
(1) data collection: target proteins and their activity data were collected.
Collecting a plurality of protein targets from a database and literature, and then downloading activity data of each protein target from a ChEMBL database, wherein the activity data comprises: IC50(half inhibitory concentration) refers to the concentration of drug or inhibitor required to inhibit a given biological process by half), EC50(half maximal effective concentration) refers to the concentration of drug or inhibitor that would achieve 50% of the maximal biological effect after a given exposure time), and inhibition constants (Ki value, half maximal dissociative concentration of drug). The final activity data value of this example is the mean of the above activity data.
To get a more accurate and stable model, we removed protein targets with activity data numbers less than 200, leaving 832 protein targets, including 239 kinases, 270 other enzymes, 192 receptors, 46 ion channels, 30 transporters, and 55 other classes. Specific protein targets are shown in table 1.
Table 1: protein targets for use in calculating target profiles in example 1
Figure BDA0002376971310000061
Figure BDA0002376971310000071
Figure BDA0002376971310000081
Figure BDA0002376971310000091
(2) Data preprocessing and splitting: the activity data is preprocessed and divided into positive and negative sets according to the activity value.
To remove noise caused by data in modeling, the activity data was pre-treated with Molecular operating environment (MOE, version 2018), metal ions were removed by wash function, salt was removed, hydrogenation, strong acid and strong base were protonated. In each protein target, data with activity values below 10 μ M were taken as the positive set and data with activity values above 10 μ M were taken as the negative set. If the negative set data is insufficient, part of data is extracted from other protein targets to serve as a negative set, so that deviation caused by data imbalance is reduced.
(3) Constructing a model: and selecting a descriptor and a machine learning algorithm to construct a model.
3.1, selecting descriptors characterizing the physicochemical properties of the molecules, pharmacophores and structural fragments: MOE2D, CATS, and MACCS.
3.2, selecting a machine learning algorithm based on a tree, a connection and a core: RF, ADA and GBM.
3.3, combining two by two three different descriptors and three different machine learning algorithms, and constructing 9 different single models for each protein target, wherein the three different models are respectively as follows: MOE2D & RF model, MOE2D & ADA model, MOE2D & GBM model, CATS & RF model, CATS & ADA model, CATS & GBM model, MACCS & RF model, MACCS & ADA model, MACCS & GBM model. There are a total of 832 x 9 models. To obtain more accurate results, 832 × 9 models were integrated into the output of a single model, and the average probability values were calculated as the binding strength of the molecule to the protein target, forming a consensus model.
(4) And (3) model evaluation: the constructed model was evaluated using a five-fold cross-validation method.
Five-fold cross validation means that the modeling data in each protein target is divided into five data sets of equal size, wherein the model is trained using four data sets and the model built is tested using the remaining one. This process was repeated five times, using a different data set for each test model, and one model would yield five results, with a total of 832 x 9 x 5 results. The performance of The constructed model is evaluated by using two indexes, namely ACC (Accuracy, representing The proportion of samples with correct classification) and AUC (The area under The operating curve of a subject, which is an index for judging The overall performance of The two-classification prediction model).
The results show that the ACC value and AUC value of the model are respectively in the range of 70-99.8% and 75.9-100%. This assessment indicates that the model is effective in predicting new molecules.
(5) Forming a calculated target spectrum: in the consensus model, the molecules to be detected are input, and a calculation target spectrum is output.
Example 2:
an application of the calculated target spectrum in the framework transition in example 1 is shown in fig. 2, and the specific construction steps are as follows:
(1) active molecule database: the method adopts the following steps: vogt, m.; stumpfe, d.; geppert, h.; bajorth, J. Scaffold hosting Using Two-Dimensional Fingerprints, TruePotential, Black Magic, or a Hopeless owner? Guidelines for visual screening. J Med Chem 2010,53, (15),5707-5715. the database constructed, comprised 17 target active molecules, was selected only for those molecules with Ki or IC50 less than 1 μm, heavy atoms within 10-50 and non-hydrogen atoms in the side chain cluster less than the backbone. There are at least 10 different Bemis-Murcko backbones in each target (proposed by Bemis and Murcko, defined as scaffolds extracted from one molecule with the R-group removed but the linker between the ring systems retained), and 5 representing molecules per backbone.
(2) Screening a database: ChEMBL (85644), SPECS (212863), Natural Products (120686), Chemblock (198643) and ASINEX (567106) are prepared as 5 screening databases with different designs and sizes, and the interference of the databases on the results is reduced. The pre-treatment process for the molecules in the database is the same as for the active molecule database molecules.
(3) The molecules in the active molecule database and the screening database are input into the consensus model of example 1, and the binding affinity strength value of each new molecule with 832 protein targets in the consensus model is output. Each new molecule will have 832 × 1 data matrix to be applied as the calculated target spectrum of the molecule.
To facilitate similarity search, the final result is converted to a fingerprint form according to a threshold of 0.5. Molecular characterization of the compounds in the active and screening databases, including MOE2D, MACCS, CATS, ECFP4, Atompair, FP2, GpidAPH3, TGT, and TGD, were then calculated separately for comparison.
(4) And sequentially selecting the activity data of each target point in the active molecule database as query molecules, and forming the query database by the remaining active compounds and the screening database.
(5) The Tanimoto coefficient is used to calculate a similarity value for the query molecule to the molecules in the query database.
(6) According to the size sequence of the similarity values, the recovery rate (the ratio of the number of active molecular frameworks to the number of all active molecular frameworks) of the active frameworks in the first 100, 200, 500, 1000, 2000 and 5000 compounds is calculated, and the enrichment condition of the active molecules of frameworks different from the query molecule is obtained.
(7) See figure 3 for results: all molecular characterizations show a certain framework transition capability. The order in which all molecules characterize the overall enrichment capacity is: target spectra > MOE2D > ECFP4, GpidAPH3 > others were calculated. Calculation of target spectra in the first 100 compounds ranked by similarity values, an average was able to recover 38.0% of the active scaffold in ChEMBL (NP: 43.7%, Chemblock: 24.4%, SPECS: 23.3%, ASINEX: 40.2%). The average active scaffold recovery of the target profile in the five databases was calculated to be 33.9%, indicating that nearly one-third of the active scaffold types were found in the first 100 compounds. The target profile was calculated to enrich for a minimum of 3.3 active scaffolds per target, using the lowest 10 scaffold types in the 17 targets in the study as criteria. In the five screening databases, the active scaffold recoveries for the calculated target profiles were 69.5%, 65.3%, 52.1%, 50.6% and 59.7%, respectively, for the first 5000 compounds. Overall active scaffold recovery was greater than 50%, even up to 69.5%, which has demonstrated a powerful ability to calculate target profiles for active scaffold recovery from screening databases. Most importantly, we found that the active scaffold recovery of the calculated target profile was highest or slightly lower than highest in the five databases. It shows that the calculated target spectra have overall better framework transition performance than other molecular characterizations used in this study.
Example 3
The application of the calculated target spectrum of example 1 to predict protein targets is constructed as follows:
(1) inputting the medicine bromocriptine into a consensus model, and outputting a calculation target spectrum of the bromocriptine to form molecular characterization.
(2) We will calculate the predicted probability values in the target spectra as the binding affinity values between the compound and the protein target, according to which the targets are ranked.
(3) In the calculated target profile, it was shown that bromocriptine might interact with the novel protein D (1A) dopamine receptor (UniprotID: Q95136). Bromocriptine mesylate is a semisynthetic ergot alkaloid derivative with strong dopaminergic activity and is useful in the treatment of parkinsonism. By consulting the relevant literature, the binding affinity between bromocriptine and the D (1A) dopamine receptor was found (Ki ═ 1.444 μ M). And most approved targets of bromocriptine ranked first 30, suggesting that computing the target profile may predict potential targets of the drug to some extent.
Example 4
An application of the calculated target spectrum of embodiment 1 in the QSAR model liver toxicity prediction includes the following specific construction steps:
(1) hepatotoxicity and non-hepatotoxicity data based on humans were collected from the SIDER, LiverTox and drug databases and literature and were pre-processed using the wash function in MOE.
(2) Dividing the preprocessed data into a training set and a testing set by adopting different data splitting strategies, and specifically comprising the following steps: random split (ratio 80% to 20%), Bemis-Murcko skeleton split (training and testing sets are divided according to Bemis-Murcko skeleton categories, where the training and testing sets differ in skeleton by 75% -85% to 15% -25%) and carbon skeleton split (training and testing sets are divided according to carbon skeleton categories, where the training and testing sets differ in skeleton by 75% -85% to 15% -25%). The purpose of setting two skeleton splits is to study and calculate the liver toxicity prediction performance of a target spectrum under real conditions.
(3) Hepatotoxicity and non-hepatotoxicity data were input into the consensus model of example 1 and a calculated target spectrum was output as a CADes molecular characterization. The calculations are based on descriptors of chemical structure, including MOE2D, CATS, and MACCS.
(4) And (4) characterizing the CADes molecules in the step (3) and constructing a model by the XG boosting algorithm. Models are constructed based on MOE2D, CATS and MACCS descriptors of chemical structures and XGBoosting algorithm respectively. Models were then evaluated using ACC, AUC, SPE (Specificity, representing the ratio of the number of negative samples predicted to be correct to the number of true negative samples) and SEN (Sensitivity, representing the ratio of the number of positive samples predicted to be correct to the number of true positive samples). The results are shown in Table 2.
Table 2: comparison result of target spectrum and other molecular characterization applied to hepatotoxicity prediction
Figure BDA0002376971310000121
Figure BDA0002376971310000131
From the results of table 2, it can be seen that: in stochastic splitting, the CADes model of the application is similar to models trained by MACCS and MOE2D in prediction performance, the ACC is close to 70.0%, and the result shows that the CADes model has liver toxicity prediction performance equivalent to other excellent common chemical structure descriptors. In the Bemis-Murcko framework split, the molecular frameworks in the training and test sets were completely different, which resulted in evaluation parameters that were lower than the stochastic split, with average accuracy decreasing from 69.4% to 66.1%. However, we found that the CADes-based model performed best with 67.0% ACC, 76.2% SEN, 53.0% SPE and 71.9% AUC, because the CADes model is good at capturing biological information and has low dependence on chemical structure. We also partitioned the training set and the test set according to the carbon scaffold definition, and the results show that the model for calculating the target spectrum performs better than other models. These results indicate that the calculated target profile is more suitable for hepatotoxicity prediction in practical applications.
(5) And (4) combining the model trained by the computational target spectrum and the model trained by the chemical structure descriptor in the step (4) to construct a consensus model. The specific combination mode is as follows: CATS + CADes, MACCS + CADes, MOE2D + CADes.
(6) Inputting the data of the test set into the consensus model in the step (5), and inspecting the result of the consensus model applied to the hepatotoxicity prediction. See table 3 for results.
Table 3: result of applying consensus model constructed by combining calculation target spectrum and other molecular characterization to hepatotoxicity prediction
Figure BDA0002376971310000132
From the results in table 3, it can be seen that: consensus models generally have better hepatotoxicity prediction performance than single descriptor-trained models because the computational target profile and chemical structure descriptor information complement each other so that compounds that cannot be distinguished by a single model are correctly distinguished. However, there are also compounds in which a single model predicts correctly and a compound predicts incorrectly in a consensus model. In general, consensus models are more beneficial than poor, and integrating the calculated target profile and chemical structure descriptors can indeed improve the predictive performance of hepatotoxicity.
Example 5
The use of the calculated target profile of example 1 to predict drug-drug interactions is constructed as follows:
(1) according to the literature: yao, z.j.; dong, j.; che, y.j.; zhu, m.f.; wen, m.; wang, n.n.; wang, s.; a web service for predicting potential drug-targeting profiling via multi-target SAR models J composite air MolDes.2016,30,413-24, 1125 collected drug and 6743 drug-drug interaction pairs, input the 1125 drugs into the model, output the calculated target spectra of example 1, forming a molecular characterization.
(2) The data in the molecular characterization were divided into positive and negative samples, with the positive sample being all true drug-drug interaction pairs and the negative sample being random non-interaction pairs between 1125 drugs. The negative samples are sampled randomly, and the sampling amount is the number of the positive samples, so as to avoid the deviation caused by unbalanced data.
(3) The combination of the calculated target spectra of the two drugs between the drug-drug interaction pair is used as a descriptor, and a model is constructed by combining an RF algorithm.
(4) The model was evaluated using three different levels of test sets:
level 1: the training set contains DDI associations of two drugs.
Level 2: the training set contains DDI associations of only one drug.
Level 3: the DDI associations of the two drugs are not included in the training set.
And (5) adopting a leave-one-out method to cross-verify the evaluation model.
(5) See table 4 for results:
table 4: results of molecular characterization applied to drug-drug interactions
Figure BDA0002376971310000141
From the results in table 4, it can be seen that: the best predicted performance was obtained for the level 1 model, with 91.8% ACC, 92.8% SEN and SPE, respectively. The ACC, SEN and SPE of the level 2 model were 71.6%, 74.5% and 68.6%, respectively. The ACC, SEN and SPE for the level 3 model were 68.9%, 62.9% and 75.1%, respectively. The accuracy rankings are level 1 > level 2 > level 3, because the information implied by the different levels decreases in turn, and this result suggests that the model should contain as much information as possible. The high accuracy of the ensemble model indicates that the calculated target spectra can be used as an effective molecular characterization for assessing drug-drug interactions during clinical trials and drug development.
Example 6
The application of the network for determining the action mode of the drug by calculating the target spectrum in the embodiment 1 comprises the following construction processes:
(1) the 909 drugs approved and having chemical structures collected from drug bank were input into the model of example 1, and the calculated target spectra were output to obtain molecular characterization.
(2) Pearson correlation coefficients between any two drugs were calculated and the drug network was visualized according to the Anatomical Chemical Classification System (ATC). Where each node corresponds to a drug and two nodes are connected by an edge if the corresponding similarity coefficient is greater than a predefined threshold (the similarity threshold for this network is 0.97). The width of the rim is proportional to the similarity between the drugs attached to the rim.
(3) See figure 4 for results: in a network, drugs with similar modes of action are linked or located in the same community. First, we examined whether the ATC primary levels of the two linked drugs were the same. It can be clearly seen that 327 of 835 sides represent the same ATC rating, indicating that the first level in ATC may reflect the drug mode of action to some extent. For example, flunisolide and fluocinonide have similar calculated target spectra with a similarity coefficient of 0.987. We have found that the principle of action of both drugs is activation of the glucocorticoid receptor. However, flunisolone is a corticosteroid hormone, commonly used for the treatment of allergic rhinitis, whereas fluocinonide is a topical glucocorticoid used for the treatment of eczema. In addition, medrysone and methylprednisolone also have very similar calculated target spectra with a similarity coefficient of 0.993. Medrysone is used for the treatment of allergic conjunctivitis, vernal conjunctivitis, scleritis and epinephrine sensitivity, while methylprednisolone is used for short-term adjuvant treatment of rheumatoid arthritis. According to the literature, we have found that they act by inducing phospholipase a2 inhibitor proteins (collectively known as lipocortin). It is speculated that these proteins control the biosynthesis of potent mediators of inflammation such as prostaglandins and leukotrienes by inhibiting the release of their common precursor arachidonic acid. These results show the accuracy of the calculated target profile for the network prediction of drug mode of action. The similarity measurement only needs to be calculated by calculating the target spectrum to predict the drug action mode network, and the drug action mode does not need to be known in advance, so that the development and the application of the drug action mode in drug research are greatly promoted.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A molecular characterization based on predicted protein affinity, which is constructed by the following method:
(1) collecting data: collecting protein targets and activity data thereof;
(2) molecular characterization results: the binding affinity value of each protein target to the compound was calculated by the SAR model and computer method of docking.
2. The predicted protein affinity based molecular characterization according to claim 1, wherein the molecular characterization is constructed by the following method:
(1) data collection: collecting protein targets and activity data thereof;
(2) parameter determination: calculating a plurality of different descriptors for each protein target and selecting a plurality of different machine learning algorithms;
(3) constructing a model; combining the computed descriptors and the machine learning algorithm pairwise to form a plurality of different single models;
(4) constructing a consensus model: calculating the average probability value of the single model, and taking the average probability value as the binding strength value of the molecule and the protein target to form a consensus model;
(5) molecular characterization results: and inputting the molecules to be detected in the consensus model, outputting a calculation target spectrum, and forming a molecular characterization.
3. The predicted protein affinity based molecular characterization of claim 2, wherein in S1 the protein target comprises an enzyme, a receptor, a transporter, an ion channel; the activity data included IC50, EC50, Ki values;
and/or in the S2, the descriptor is one or more of MOE2D, CATS2, GpidAPH3, TGT, TGD, MACCS, Atompair, Estate, FP series, ECFP series, and ECFC series descriptors; the machine learning algorithm is one or more of RF, XGboosting, DL, ADA, GBM, KNN and SVM algorithms.
4. The predicted protein affinity based molecular characterization according to claim 2, wherein the molecular characterization is constructed by the following method:
(1) data collection: collecting protein targets and activity data thereof;
(2) parameter determination: three different descriptors for each protein target were calculated separately: MOE2D, CATS, and MACCS, and selects three different machine learning algorithms: RF, ADA and GBM;
(3) constructing a model; combining three different descriptors and three different machine learning algorithms in pairs to form nine different single models;
(4) a consensus model: integrating nine different single models, calculating an average probability value as a combination strength value of the molecule and the protein target to form a consensus model;
(5) forming a molecular characterization: and inputting the molecules to be detected in the consensus model, outputting a calculation target spectrum, and forming a molecular characterization.
5. Use of the molecular characterization based on predicted protein affinity in a skeletal transition according to any of claims 1 to 4, by a method comprising:
s1-1, respectively inputting molecules in the active molecule database and the screening database into a consensus model, outputting a calculation target spectrum, obtaining molecular characterization of the active molecule database and the screening database, and calculating descriptors for comparison based on chemical structures, including MOE2D, MACCS, CATS, ECFP4, Atompair, FP2, GpidAPH3, TGT and TGD;
s1-2, sequentially selecting active molecules in different targets of the active molecule database as query molecules, and combining other active molecules with molecules in the screening database to form the query database;
s1-3, calculating similarity values of the query molecules and the molecules in the query database by using Tanimoto coefficients;
s1-4, according to the size sequence of the similarity values, calculating the ratio of the number of active molecular frameworks to the number of all active molecular frameworks in the first 100, 200, 500, 1000, 2000 and 5000 compounds to obtain the enrichment condition of the active molecules with frameworks different from the query molecules.
6. Use of the molecular characterization based on predicted protein affinity according to any one of claims 1 to 4 for predicting a protein target by a method comprising:
s2-1, inputting the molecules to be detected into the consensus model, outputting a calculation target spectrum, and obtaining molecular characterization;
s2-1, ranking the targets according to the predicted binding affinity values in the molecular characterization to predict the targets of the active molecules.
7. The use of the molecular characterization based on predicted protein affinity for the prediction of hepatotoxicity in QSAR models according to any of claims 1 to 4, wherein the method is as follows:
s3-1, collecting human hepatotoxicity data and non-hepatotoxicity data from the existing database;
s3-2, dividing the data into a training set and a test set by adopting different data splitting strategies;
s3-3, inputting hepatotoxicity and non-hepatotoxicity data into the consensus model, outputting a calculation target spectrum, and obtaining a CADes molecular characterization; calculating MOE2D, CATS and MACCS descriptors;
s3-4, constructing a model by the CADes molecule characterization and the XG boosting algorithm; respectively constructing models of MOE2D, CATS and MACCS descriptors based on chemical structures and XGBoosting algorithm;
s3-5, combining the model of the calculated target spectrum and the model of the chemical structure descriptor to construct a consensus model;
s3-6, inputting the molecules to be detected into the consensus model to obtain the hepatotoxicity information of the molecules to be detected.
8. The use of claim 7, wherein the different splitting strategies comprise: random splitting, Bemis-Murcko framework splitting, and carbon framework splitting.
9. Use of the molecular characterization based on predicted protein affinity according to any one of claims 1 to 4 for predicting drug-drug interaction by a method comprising:
s4-1, inputting the medicine collected from the literature into a consensus model, outputting a calculation target spectrum, and obtaining molecular characterization;
s4-2, dividing the medicines into positive samples and negative samples, wherein the positive samples are all real medicine-medicine interaction pairs, and the negative samples are random non-interaction pairs among the medicines;
s4-3, combining the molecular characterization of the two medicines between the interaction pairs of the medicines in the positive sample as the molecular characterization between the medicine pairs, and combining an RF algorithm to construct a model;
s4-4, inputting the molecules to be detected into the model constructed by the S4-3, and predicting whether the interaction exists between the drugs.
10. Use of the molecular characterization based on predicted protein affinity according to any one of claims 1 to 4 in the determination of a network of drug action patterns by a method comprising:
s5-1, inputting the medicine with the chemical structure obtained in the literature into a consensus model, outputting a calculation target spectrum, and obtaining molecular characterization;
s5-2, calculating Pearson correlation coefficients between any two drugs from the drugs, visualizing a drug network according to an anatomical chemical classification method, wherein each node corresponds to one drug, and if the corresponding similarity coefficient is larger than a predefined threshold value, the two nodes are connected through an edge; the width of the rim is proportional to the similarity between the drugs attached to the rim.
CN202010069615.5A 2020-01-21 2020-01-21 Molecular characterization based on predicted protein affinity and application thereof Pending CN111292800A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010069615.5A CN111292800A (en) 2020-01-21 2020-01-21 Molecular characterization based on predicted protein affinity and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010069615.5A CN111292800A (en) 2020-01-21 2020-01-21 Molecular characterization based on predicted protein affinity and application thereof

Publications (1)

Publication Number Publication Date
CN111292800A true CN111292800A (en) 2020-06-16

Family

ID=71028411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010069615.5A Pending CN111292800A (en) 2020-01-21 2020-01-21 Molecular characterization based on predicted protein affinity and application thereof

Country Status (1)

Country Link
CN (1) CN111292800A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112053742A (en) * 2020-07-23 2020-12-08 中南大学湘雅医院 Method and device for screening molecular target protein, computer equipment and storage medium
CN112133447A (en) * 2020-08-14 2020-12-25 中南大学 Construction method of colloid screening model and colloid screening method
CN112331261A (en) * 2021-01-05 2021-02-05 北京百度网讯科技有限公司 Drug prediction method, model training method, device, electronic device, and medium
CN112466399A (en) * 2020-11-19 2021-03-09 大连理工大学 Method for predicting mutagenicity of chemicals through machine learning algorithm
CN112466410A (en) * 2020-11-24 2021-03-09 江苏理工学院 Method and device for predicting protein and ligand molecule binding free energy
CN113409884A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Training method of sequencing learning model, sequencing method, device, equipment and medium
CN113628698A (en) * 2021-06-04 2021-11-09 中山大学 Method for screening stomatitis clearing action target
CN114093436A (en) * 2021-11-22 2022-02-25 北京深势科技有限公司 Construction method and system of iterative binding affinity evaluation model
WO2022059021A1 (en) * 2020-09-18 2022-03-24 Peptris Technologies Private Limited System and method for predicting biological activity of chemical or biological molecules and evidence thereof
CN114702450A (en) * 2022-04-15 2022-07-05 大连理工大学 Compound acting on ABL1 tyrosine kinase and application thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040117125A1 (en) * 1999-04-26 2004-06-17 Hao Chen Drug discovery method and apparatus
CN102930181A (en) * 2012-11-07 2013-02-13 四川大学 Protein-ligand affinity predicting method based on molecule descriptors
CN103150490A (en) * 2013-02-20 2013-06-12 浙江大学 Network pharmacology method used for finding active ingredients of traditional Chinese medicine and effect targets thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040117125A1 (en) * 1999-04-26 2004-06-17 Hao Chen Drug discovery method and apparatus
CN102930181A (en) * 2012-11-07 2013-02-13 四川大学 Protein-ligand affinity predicting method based on molecule descriptors
CN103150490A (en) * 2013-02-20 2013-06-12 浙江大学 Network pharmacology method used for finding active ingredients of traditional Chinese medicine and effect targets thereof

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GOVINDAN SUBRAMANIAN ET AL.: "Computational Modeling of β-Secretase 1 (BACE-1) Inhibitors Using Ligand Based Approaches" *
OLIVIER SPERANDIO ET AL.: "MED-SuMoLig: A New Ligand-Based Screening Tool for Efficient Scaffold Hopping" *
TAILONG LEI ET AL.: "ADMET evaluation in drug discovery: 15. Accurate prediction of rat oral acute toxicity using relevance vector machine and consensus modeling" *
ZHI-JIANG YAO ET AL.: "TargetNet: a web service for predicting potential drug–target interaction profiling via multi-target SAR models" *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112053742A (en) * 2020-07-23 2020-12-08 中南大学湘雅医院 Method and device for screening molecular target protein, computer equipment and storage medium
CN112133447A (en) * 2020-08-14 2020-12-25 中南大学 Construction method of colloid screening model and colloid screening method
WO2022059021A1 (en) * 2020-09-18 2022-03-24 Peptris Technologies Private Limited System and method for predicting biological activity of chemical or biological molecules and evidence thereof
CN112466399A (en) * 2020-11-19 2021-03-09 大连理工大学 Method for predicting mutagenicity of chemicals through machine learning algorithm
CN112466399B (en) * 2020-11-19 2022-10-21 大连理工大学 Method for predicting mutagenicity of chemicals through machine learning algorithm
CN112466410A (en) * 2020-11-24 2021-03-09 江苏理工学院 Method and device for predicting protein and ligand molecule binding free energy
CN112466410B (en) * 2020-11-24 2024-02-20 江苏理工学院 Method and device for predicting binding free energy of protein and ligand molecule
CN112331261A (en) * 2021-01-05 2021-02-05 北京百度网讯科技有限公司 Drug prediction method, model training method, device, electronic device, and medium
CN113628698A (en) * 2021-06-04 2021-11-09 中山大学 Method for screening stomatitis clearing action target
CN113628698B (en) * 2021-06-04 2023-07-04 中山大学 Screening method for stomatitis clearing action target
CN113409884B (en) * 2021-06-30 2022-07-22 北京百度网讯科技有限公司 Training method of sequencing learning model, sequencing method, device, equipment and medium
CN113409884A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Training method of sequencing learning model, sequencing method, device, equipment and medium
CN114093436A (en) * 2021-11-22 2022-02-25 北京深势科技有限公司 Construction method and system of iterative binding affinity evaluation model
CN114702450A (en) * 2022-04-15 2022-07-05 大连理工大学 Compound acting on ABL1 tyrosine kinase and application thereof

Similar Documents

Publication Publication Date Title
CN111292800A (en) Molecular characterization based on predicted protein affinity and application thereof
Tian et al. The application of in silico drug-likeness predictions in pharmaceutical research
Cereto-Massagué et al. Tools for in silico target fishing
Marcou et al. Optimizing fragment and scaffold docking by use of molecular interaction fingerprints
Kryshtafovych et al. Protein structure prediction and model quality assessment
Vyas et al. Building and analysis of protein-protein interactions related to diabetes mellitus using support vector machine, biomedical text mining and network analysis
Medina-Franco et al. Reaching for the bright StARs in chemical space
CN105740626A (en) Drug activity prediction method based on machine learning
Zaki et al. ProRank: a method for detecting protein complexes
González-Dı́az et al. Predicting multiple drugs side effects with a general drug-target interaction thermodynamic Markov model
CN104978474B (en) A kind of method of evaluating drug effect and system based on molecular network
Bhavani et al. Substructure-based support vector machine classifiers for prediction of adverse effects in diverse classes of drugs
Kumar et al. Machine intelligence-driven framework for optimized hit selection in virtual screening
AbdulHameed et al. ToxProfiler: toxicity-target profiler based on chemical similarity
Zheng et al. Linking biochemical pathways and networks to adverse drug reactions
Dietmann et al. Automated detection of remote homology
Helguera et al. A radial-distribution-function approach for predicting rodent carcinogenicity
Karasev et al. The method predicting interaction between protein targets and small-molecular ligands with the wide applicability domain
CN117133378A (en) Automatic molecular design method and device
Dong et al. Domain boundary prediction based on profile domain linker propensity index
Zhu et al. Network biology methods for drug repositioning
CN112259175B (en) Virtual screening method of IRAK1 kinase inhibitor
Deeb Recent applications of quantitative structure-Activity relationships in drug design
Alshehri Integrated virtual screening, molecular modeling and machine learning approaches revealed potential natural inhibitors for epilepsy
Serra et al. Integrated modeling for compound efficacy and safety assessment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination