CN110176279B - Lead compound virtual screening method and device based on small sample - Google Patents

Lead compound virtual screening method and device based on small sample Download PDF

Info

Publication number
CN110176279B
CN110176279B CN201910470488.7A CN201910470488A CN110176279B CN 110176279 B CN110176279 B CN 110176279B CN 201910470488 A CN201910470488 A CN 201910470488A CN 110176279 B CN110176279 B CN 110176279B
Authority
CN
China
Prior art keywords
drug target
ligand
module
learning
virtual screening
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910470488.7A
Other languages
Chinese (zh)
Other versions
CN110176279A (en
Inventor
黄婉晴
吴建盛
胡海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910470488.7A priority Critical patent/CN110176279B/en
Publication of CN110176279A publication Critical patent/CN110176279A/en
Application granted granted Critical
Publication of CN110176279B publication Critical patent/CN110176279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a lead compound virtual screening method and device based on a small sample. The method comprises the following steps: constructing a virtual screening model by using ligand sample information of a drug target and a homologous drug target thereof and adopting sparse low-rank multi-task learning; learning common characteristics of the homologous drug target and ligand molecule actions by adopting a low-rank regularization term; learning the unique characteristics of the action of the new drug target and the ligand molecules by using a sparse regularization term; the constructed model is used for predicting the activity of the lead compound under a small sample and evaluating the performance of the lead compound. The invention also provides a device for realizing the method. The method and the device can realize the purposes of improving the screening efficiency and saving time and huge capital expenditure. Meanwhile, when the method is applied to the field of drug screening, the requirement of predicting the biological activity of the drug target ligand can be met.

Description

Lead compound virtual screening method and device based on small sample
Technical Field
The invention relates to computer-aided drug design, in particular to a lead compound virtual screening method and a device based on a small sample.
Background
The lead compound is a compound which is obtained by various ways and means and has certain biological activity and chemical structure, is used for further structural modification and modification, and is the starting point of modern new medicine research. In the new drug research process, the lead compound with biological activity obtained by compound activity screening is the basis of innovative drug research.
The traditional drug screening needs to consume a large amount of manpower and material resources, and has a series of defects of long experimental period and the like. With the rapid development of computers in the 21 st century, virtual drug screening technology has been widely applied to drug development, and especially plays a key role in discovery of new drug targets and leading compound structures of rare diseases.
Virtual screening, also called computer screening, is to simulate the interaction between a target point and a candidate drug by using molecular docking software on a computer before biological activity screening is carried out, and calculate the activity value between the target point and the candidate drug so as to reduce the number of the actually screened compounds and improve the discovery efficiency of the lead compounds. Among them, virtual screening can be classified into two categories, i.e., receptor-based virtual screening and ligand-based virtual screening.
Currently, drug development for known drug targets has approached saturation. Development of new drugs aiming at new drug targets or rare diseases has become a research hotspot in recent years, but the sample information is insufficient, so that a good ligand virtual screening model is difficult to obtain. Therefore, one technical problem that needs to be urgently solved by those skilled in the art is: how to effectively provide a method for virtually screening a lead compound based on a small sample.
Disclosure of Invention
The purpose of the invention is as follows: in order to solve the problem of virtual screening of small molecule drugs of a new target under a small sample, one of the purposes of the invention is to provide an effective virtual screening method of a lead compound based on the small sample, and the other purpose of the invention is to provide a corresponding virtual screening device of the lead compound based on the small sample according to the method.
The technical scheme is as follows: the invention relates to a lead compound virtual screening method based on a small sample, which comprises the following steps:
(1) constructing a virtual screening model by using ligand sample information of a new drug target and a homologous drug target and adopting sparse low-rank multi-task learning;
(2) learning common characteristics of the homologous drug target and ligand molecule actions by adopting a low-rank regularization term;
(3) learning the unique characteristics of the action of the new drug target and the ligand molecules by using a sparse regularization term;
(4) the constructed model is used for predicting the activity of the lead compound under a small sample and evaluating the performance of the lead compound.
Preferably, in the step (1), the method for constructing the virtual screening model includes:
the number of the homologous drug targets is not specifically required, generally more than two homologous drug targets are selected as much as possible for the performance of the model, and three homologous drug targets are adopted in the specific embodiment of the invention. When selecting cognate drug targets, targets with a relatively high number of ligand molecules known to bind are preferably selected. The homologous drug targets often have sequence homology and functional similarity, and are easier to act with similar ligands, and the interaction mode and mechanism are often more similar, so that the construction of a lead compound virtual screening model of new drug targets with insufficient sample information can be facilitated by utilizing the abundant ligand sample information of the homologous drug targets;
acquiring a required initial data set, wherein the initial data set comprises information of a new drug target and a homologous drug target thereof, and the information comprises required ligand molecules smiles and an activity value acting with a ligand;
the prediction of the activity value of each drug target is called a task, and the prediction of a plurality of homologous drug targets is called a plurality of tasks; according to the initial data set, smiles of ligands bound by n drug targets in t tasks are generated into corresponding d-dimensional molecular characteristics by using a molecular fingerprint method, wherein the d-dimensional molecular characteristics are represented by 0/1 and are marked as x n (ii) a Changing X to [ X ] 1 ,x 2 ,…,x n ] T ∈R n×d As feature matrix input for multiple tasks, where x 1 ,x 2 ,…,x n Respectively represent the 1 st, 2.,. n molecular characteristics of the ligand, R n×d Representing a space of dimension n x d, T being a transposed symbol;
the activity values of the n drug targets are denoted as vector y n Changing Y to [ Y ═ Y 1 ,y 2 ,…,y n ] T ∈R n×t Inputting the activity value vector as a plurality of tasks; wherein, y 1 ,y 2 ,…,y n Represents the activity values of the action of the 1 st, 2., n ligands, R n×t Representing a space of dimension n x T, T being a transposed symbol;
and constructing loss function constraint by using a multi-task learning method, and further obtaining a weight matrix W to obtain the virtual screening model.
Preferably, the step (2) of learning common characteristics of the interaction between the homologous drug target and the ligand molecule specifically comprises: and adding a low-rank regularization term into the loss function to constrain to obtain a P matrix, wherein the P matrix is a low-rank matrix and captures the correlation between tasks.
Preferably, the learning of the unique characteristics of the interaction of the novel drug target and the ligand molecule in step (3) specifically comprises: and adding a sparse regularization item to constrain through a loss function to obtain a Q matrix, wherein the Q matrix is a sparse matrix and the unique characteristics of each task are selected.
And (3) according to the constructed virtual screening model, predicting the activity value and evaluating the performance of the model under the condition that a small sample compound in smiles format and the biological activity value thereof are known.
The invention also provides a virtual screening device of the lead compound based on the small sample, which comprises the following components:
the virtual screening module based on sparse low-rank multi-task learning is used for constructing a virtual screening model;
the common characteristic recognition module is used for learning the common characteristics of the interaction of the homologous drug target and the ligand molecule;
the unique characteristic recognition module of the new drug target and ligand molecule action is used for learning the unique characteristics of the new drug target and ligand molecule action;
and the lead compound activity prediction and performance evaluation module is used for predicting the activity of the lead compound based on the small molecule sample and evaluating the performance of the virtual screening model.
Wherein the virtual screening module based on sparse low rank multi-task learning comprises:
the homologous drug target selection module is used for assisting the selection of the drug target constructed by the model;
an initial module for obtaining an initial data set comprising ligand molecules smiles and activity value information for interaction with a ligand from a database;
the characteristic extraction module is used for processing the original data to generate a characteristic matrix of the ligand;
the activity value generation module is used for sorting the activity values acted by the drug target and the ligand molecules, and carrying out-lg processing on the data of the activity values so as to reduce the span of the activity values;
and the multi-task learning module is used for constructing loss function constraints, further optimizing the loss function constraints to obtain a weight matrix, and learning the virtual screening model.
The common characteristic identification module is a low-rank regularization module and is used for learning the common characteristics to obtain a low-rank matrix representing the correlation between tasks.
The unique feature identification module is a sparse regularization module and is used for learning unique features and obtaining a sparse matrix representing uniqueness of each task.
The lead compound activity prediction and performance evaluation module comprises:
a prediction module for predicting an activity value of the effect;
and the evaluation module is used for obtaining an index for evaluating the performance of the model.
It will be understood by those skilled in the art that the homologous drug targets of the present invention are drug targets that have homology to the new drug target. Biologically, two or more structures are homologous if they have the same ancestry, i.e., they have evolved from a common ancestry. Homologous proteins are in particular evolutionarily related proteins, i.e.proteins of the same or similar function in different species or proteins with significant sequence homology. The homology of proteins is often determined by the similarity of their sequences, which refers to the ratio of the same amino acid residue sequence between the test sequence and the target sequence in the sequence alignment process; generally, when the degree of similarity is higher than 30%, it is usually presumed that the test sequence and the target sequence have functional homology.
Has the advantages that:
aiming at the problem of insufficient sample information of a lead compound of a new drug target, the invention utilizes the abundant sample information of the homologous drug target and adopts sparse low-rank multi-task learning to construct a virtual screening model according to the characteristics of the interaction between the homologous drug target and the compound molecule; aiming at the characteristic that interaction modes and mechanisms of a homologous drug target and compound molecules are often more similar, a low-rank regularization term is adopted to learn common characteristics of the interaction of the homologous drug target and ligand molecules; aiming at the unique characteristics of the new drug target and the compound molecular action, the sparse regularization term is adopted to learn the unique characteristics of the interaction. Common characteristics and unique characteristics among homologous drug targets are fully considered in the model construction process, and the problem that a virtual screening model is difficult to construct or is not ideal to construct due to insufficient sample quantity and information can be effectively solved. Virtual screening can be better carried out, and the method has great help for researching new drug targets and lead compounds of rare diseases. Experiments prove that the specific example has less than one hundred samples and can still achieve better modeling effect.
The method and the device can realize the purposes of improving the screening efficiency and saving time and huge capital expenditure. Meanwhile, when the method is applied to the field of drug screening, the requirement of predicting the biological activity of the drug target ligand can be met.
Drawings
FIG. 1 is a flow chart of a method for virtual screening of lead compounds based on small samples according to the present invention;
FIG. 2 is a flowchart of step 101 of FIG. 1;
FIG. 3 is a flow chart of step 104 in an embodiment of the method of the present invention:
FIG. 4 is a block diagram of an embodiment of the apparatus for virtual screening of lead compounds based on small samples according to the present invention;
FIG. 5 is a block diagram of a module 401 in an embodiment of the apparatus;
fig. 6 is a block diagram illustrating the structure of the module 404 in an embodiment of the present invention.
Detailed Description
The present invention will be further explained with reference to the following embodiments.
Fig. 1 shows a flow of the method for virtual screening of lead compounds based on small samples in this embodiment, which may include the following steps:
step 101: and constructing a virtual screening model by adopting sparse low-rank multi-task learning.
Specifically, referring to fig. 2, where fig. 2 is a flowchart of step 101 in practical application, step 101 may specifically include:
step 201: selecting a homologous drug target.
Table 1 shows the selected homologous drug targets in this example, which are three G protein-coupled receptors, named as GPCR1, GPCR2, and GPCR3 (in this example, one of them is assumed to be a new drug target, and the other two are homologous drug targets), and the ID numbers in the database are shown in the following table.
TABLE 1
Figure GDA0003711561310000051
Step 202: the required initial data set is acquired.
The initial data set contains information on new drug targets and their cognate drug targets, including the desired smiles format of the ligand molecule and the activity value for interaction with the ligand. Taking table 2 as an example, the initial data set includes:
canonical smiles: smiles format of ligand molecules for generating molecular characteristics of the ligand;
standard value: the activity value of the action of each ligand;
standard units: units.
TABLE 2
Figure GDA0003711561310000052
The prediction of the activity value of each drug target is called a task, and the prediction of a plurality of homologous drug targets is called a plurality of tasks; for each task, an initial data set is obtained from the database.
And step 203, generating ligand molecule characteristics according to the initial data set.
According to the initial data set, further, the obtained ligand molecules, in this case smiles of each subject, obtain corresponding characteristics by using a molecular fingerprint method; specifically, smiles of ligands bound to n drug targets in t tasks are generated into corresponding d-dimensional molecular characteristics by using a pubchem molecular fingerprint method, represented by 0/1 and marked as x n
Consider the tasks together, X ═ X 1 ,x 2 ,…,x n ] T ∈R n×d The input data feature matrix is denoted as X.
Wherein x is 1 ,x 2 ,…,x n Respectively represent the 1 st, 2.,. n molecular characteristics of the ligand, R n×d Representing a space of dimension n x d, T being the transposed symbol.
Taking table 3 as an example, only 14-dimensional features are selected:
TABLE 3
0 0 0 0 0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1
And step 204, obtaining the activity value of each ligand action according to the initial data set, and carrying out-lg treatment on the activity value.
The activity values of the n drug targets are denoted as vector y n And (d) converting the obtained Y into [ Y ] 1 ,y 2 ,…,y n ] T ∈R n×t The activity value vector input for the plurality of tasks is denoted as Y.
Wherein, y 1 ,y 2 ,…,y n Represents the activity values of the action of the 1 st, 2., n ligands, R n×t Representing a space of dimension n x T, T being the transposed symbol.
Taking table 4 as an example, only 8 values are selected:
TABLE 4
Value
1.523
0.959
-0.653
0.456
0.357
0.699
1.046
1.046
And step 205, utilizing a multi-task learning method for constructing the loss function constraint.
In machine learning, a general objective function is usually constructed by a loss function and a regularization term.
An objective function: w ═ l (W) + Ω W);
for an objective function, wherein the first term l (w) is a loss function term, which is used to measure the error between the true value and the predicted value of each sample in our model; the second term omega (W) is a regularization term and comprises a low-rank regularization term and a sparse regularization term, the low-rank regularization term is used for learning common characteristics of the action of a homologous drug target and a ligand molecule, and the sparse regularization term is used for learning unique characteristics of the action of a new drug target and the ligand molecule.
For the loss function, typically L (w) L (y) i -f(x i ;w));
Wherein L (y) i -f(x i (ii) a w)), is the feature vector x used to measure our model for the ith sample i A predicted value f (x) calculated from the weight vector w i (ii) a w) and authentic labels y i The error between;
because our model needs to fit our training samples as closely as possible, we will fit the training data as closely as possible at the time of training to achieve this minimum.
Thus, an overfitting situation may occur.
To avoid the over-fitting problem, we add a sparse low-rank regularization term, also called penalty term Ω (W).
And 102, learning common characteristics of the homologous drug target and the ligand molecule.
And adding a low-rank regularization term into the loss function to constrain to obtain a P matrix, wherein the P matrix is a low-rank matrix and captures the correlation between tasks.
As shown in Table 5, the P matrix is selected with 8 dimensions:
TABLE 5
0.2654 0.0190 0.0089
-0.1584 -0.0428 -0.0029
0.0199 0 -0.0341
-0.1170 -0.0465 -0.0067
0.0158 0.0478 -0.1856
-0.0610 0.0636 -0.0191
0.0755 -0.0562 -0.2221
-0.0126 -0.1671 -0.3132
Step 103, learning the unique characteristics of the interaction between the new drug target and the ligand molecule.
A Q matrix is obtained by adding a sparse regularization item into a loss function to constrain, wherein Q is a sparse matrix, and the unique characteristic of each task is selected.
As shown in Table 6, 8 dimensions of the Q matrix are selected:
TABLE 6
Figure GDA0003711561310000071
Figure GDA0003711561310000081
And step 104, predicting the activity of the lead compound under the small sample by using the constructed model and evaluating the performance of the lead compound.
Specifically, referring to fig. 3, where fig. 3 is a flowchart of step 104 in practical application, step 104 may specifically include:
and step 301, prediction is carried out.
As shown in table 7, from the P, Q matrix obtained in steps 102 and 103, W can be obtained from the constraint W ═ P + Q.
Where P is a low rank matrix, Q is a sparse matrix, and the combination yields a weight matrix W.
The dimensions corresponding to tables 5 and 6 were selected:
TABLE 7
0.2654 0.0190 0.0089
-0.1584 -0.0428 -0.0029
0.0199 0 -0.0341
-0.1170 -0.0465 -0.0067
0.0158 0.0478 -0.1952
-0.0610 0.0636 -0.0191
0.0755 -0.0562 -0.2221
-0.0126 -0.1671 -0.3132
Preferably, the predicted value of the sample is obtained from a linear relationship of Y ═ WX.
Wherein Y is a matrix formed by activity value vectors, W is a weight matrix, and X is a feature matrix.
As shown in table 8, the predicted values obtained by selecting the samples corresponding to table 4:
TABLE 8
Figure GDA0003711561310000091
Coefficient of correlation r 2 The evaluation indexes used in Kaggle challenge, which is an organization of Merck in 2012 for prediction of drug activity, are:
Figure GDA0003711561310000092
wherein y is i In order to be the true activity value,
Figure GDA0003711561310000093
is the average of the true activity values,
Figure GDA0003711561310000094
in order to predict the value of the activity,
Figure GDA0003711561310000095
to predict the average value of the activity values, n is the number of the sample of the formulation. r is 2 The larger the value, the better the representation model.
In order to eliminate the influence of randomly selecting a control sample on the result, 5 groups of control samples are randomly selected for each drug target data set, a ligand action biological activity prediction model is respectively constructed, and the mean value +/-variance is calculated to serve as the final result.
In order to verify the effectiveness of the method, the invention selects 7 algorithms as comparison, namely SVR, GBDT, RF, LASSO, RR, MTLa and MTGLA.
SVR is support vector regression;
GBDT is gradient elevated tree, and DT (precision Tree) trees in GDBT are regression trees;
RF is a random forest;
LASSO is the LASSO regression, and the regularization term added is L 1 A norm;
RR is ridge regression, and the regularization term added is L 2 A norm;
MTLa is the multitasking lasso regression;
MTGLA is a multitasking group lasso, and the regularization term added is L 21 A norm;
the results are shown in Table 9, 8 th algorithm we chose.
As can be seen from table 9, the performance of the algorithm we choose is improved compared to the other 7 algorithms: wherein GPCR1, GPCR2, and GPCR3 represent the 3 drug targets selected in this example.
TABLE 9
Figure GDA0003711561310000101
Corresponding to the method provided by the above-mentioned method embodiment of virtual screening of lead compounds based on small samples, referring to fig. 4, the present application also provides an embodiment of an apparatus for virtual screening of lead compounds based on small samples, and in this embodiment, the apparatus may include:
the virtual screening module 401 based on sparse low-rank multi-task learning is used for constructing a virtual screening model;
referring to fig. 5, fig. 5 is a schematic structural diagram of the virtual screening module 401 based on sparse low rank multi-task learning, which specifically includes:
a homologous drug target selection module 501, configured to assist in selecting a drug target for model construction;
an initial module 502 for obtaining an initial data set containing smiles and activity value information from a database;
the feature extraction module 503 is configured to process the original data to generate a feature matrix of the ligand;
an activity value generation module 504, configured to sort the activity values acted on by the drug target and the ligand molecule, and perform-lg processing on the data of the activity values to reduce the span of the activity values;
and the multi-task learning module 505 is used for constructing loss function constraints, further optimizing the loss function constraints to obtain a weight matrix, and learning a virtual screening model.
A common characteristic recognition module 402 for the interaction of the homologous drug target and the ligand molecule, for learning the common characteristic of the interaction of the homologous drug target and the ligand molecule;
and adding low-rank regularization for learning common features to obtain a low-rank matrix representing the correlation between tasks.
A unique feature recognition module 403 for the action of the new drug target and the ligand molecule, which is used for learning the unique features of the action of the new drug target and the ligand molecule;
and adding sparse regularization for learning unique features to obtain a sparse matrix representing the uniqueness of each task.
And a lead compound activity prediction module 404, configured to predict the activity of the lead compound based on the small molecule sample and evaluate the model performance.
Referring to fig. 6, fig. 4 is a schematic structural diagram of the lead compound activity prediction module 404, which specifically includes:
a prediction module 601 for predicting an activity value of the effect;
and the evaluation module 602 is configured to obtain an index for evaluating the performance of the model.
The method and the device for virtually screening the lead compound based on the small sample provided by the application are introduced in detail, and the principle and the implementation mode of the application are explained by applying specific examples, and the description of the above examples is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (7)

1. A lead compound virtual screening method based on a small sample is characterized by comprising the following steps:
(1) constructing a virtual screening model by using ligand sample information of a new drug target and a homologous drug target and adopting sparse low-rank multi-task learning;
(2) learning common characteristics of the homologous drug target and ligand molecule actions by adopting a low-rank regularization term;
(3) learning the unique characteristics of the action of the new drug target and the ligand molecules by using a sparse regularization term;
(4) predicting the activity of the lead compound under a small sample by using the constructed model and evaluating the performance of the lead compound;
the step (1) comprises the following steps:
selecting a homologous drug target;
acquiring a required initial data set, wherein the initial data set comprises information of a new drug target and a homologous drug target thereof, and the information comprises required ligand molecules smiles and an activity value acting with a ligand;
the prediction of the activity value of each drug target is called a task, and the prediction of a plurality of homologous drug targets is called a plurality of tasks; according to the initial data set, smiles of ligands bound by n drug targets in t tasks are generated into corresponding d-dimensional molecular characteristics by using a molecular fingerprint method, wherein the d-dimensional molecular characteristics are represented by 0/1 and are marked as x n (ii) a Changing X to [ X ] 1 ,x 2 ,…,x n ] T ∈R n×d As a feature matrix input for a plurality of tasks, where x 1 ,x 2 ,…,x n Respectively represent the 1 st, 2.,. n molecular characteristics of the ligand, R n×d Representing a space of dimension n x d, T being a transposed symbol;
the activity values of the n drug targets are denoted as vector y n Changing Y to [ Y ═ Y 1 ,y 2 ,…,y n ] T ∈R n×t Inputting the activity value vector as a plurality of tasks; wherein, y 1 ,y 2 ,…,y n Represents the activity values of the action of the 1 st, 2., n ligands, R n×t Representing a space of dimension n x T, T being a transposed symbol; and constructing loss function constraint by using a multi-task learning method, and further obtaining a weight matrix W to obtain the virtual screening model.
2. The method for virtually screening lead compounds based on a small sample according to claim 1, wherein the step (2) comprises: and adding a low-rank regularization term into the loss function to constrain to obtain a P matrix, wherein the P matrix is a low-rank matrix.
3. The method for virtually screening lead compounds based on a small sample according to claim 1, wherein the step (3) comprises: and adding a sparse regularization item to constrain through a loss function to obtain a Q matrix, wherein the Q matrix is a sparse matrix and the unique characteristics of each task are selected.
4. A virtual screening device for a lead compound based on a small sample, comprising:
the virtual screening module based on sparse low-rank multi-task learning is used for constructing a virtual screening model;
the common characteristic recognition module is used for learning the common characteristics of the interaction of the homologous drug target and the ligand molecules;
the unique characteristic recognition module of the new drug target and ligand molecule action is used for learning the unique characteristics of the new drug target and ligand molecule action;
the lead compound activity prediction and performance evaluation module is used for predicting the activity of the lead compound based on a small molecule sample and evaluating the performance of the virtual screening model;
the virtual screening module based on sparse low-rank multi-task learning comprises:
the homologous drug target selection module is used for assisting the selection of the drug target constructed by the model;
an initial module for obtaining an initial data set comprising ligand molecules smiles and activity value information for interaction with a ligand from a database;
the characteristic extraction module is used for processing the original data to generate a characteristic matrix of the ligand;
the activity value generation module is used for sorting the activity values acted by the drug target and the ligand molecules, and performing-lg processing on the data of the activity values so as to reduce the span of the activity values;
and the multi-task learning module is used for constructing loss function constraints, further optimizing the loss function constraints to obtain a weight matrix, and learning the virtual screening model.
5. The virtual screening apparatus of lead compounds based on small samples according to claim 4, wherein the common feature identification module is a low rank regularization module for learning the common features to obtain a low rank matrix representing correlations between tasks.
6. The virtual screening device of small-sample-based lead compounds according to claim 4, wherein the unique feature identification module is a sparse regularization module for learning unique features to obtain a sparse matrix representing uniqueness of each task.
7. The virtual screening device of lead compounds based on small samples according to claim 4, characterized in that the lead compound activity prediction and performance evaluation module comprises:
a prediction module for predicting an activity value of the effect;
and the evaluation module is used for obtaining an index for evaluating the performance of the model.
CN201910470488.7A 2019-05-31 2019-05-31 Lead compound virtual screening method and device based on small sample Active CN110176279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910470488.7A CN110176279B (en) 2019-05-31 2019-05-31 Lead compound virtual screening method and device based on small sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910470488.7A CN110176279B (en) 2019-05-31 2019-05-31 Lead compound virtual screening method and device based on small sample

Publications (2)

Publication Number Publication Date
CN110176279A CN110176279A (en) 2019-08-27
CN110176279B true CN110176279B (en) 2022-08-26

Family

ID=67696133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910470488.7A Active CN110176279B (en) 2019-05-31 2019-05-31 Lead compound virtual screening method and device based on small sample

Country Status (1)

Country Link
CN (1) CN110176279B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112086143B (en) * 2020-08-24 2022-07-29 南京邮电大学 Small molecule drug virtual screening method and device based on unsupervised domain adaptation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222178A (en) * 2011-03-31 2011-10-19 清华大学深圳研究生院 Method for screening and/or designing medicines aiming at multiple targets
CN107862173A (en) * 2017-11-15 2018-03-30 南京邮电大学 A kind of lead compound virtual screening method and device
CN108399316A (en) * 2018-03-02 2018-08-14 南京邮电大学 Ligand molecular Feature Selection device and screening technique in drug design
CN108536999A (en) * 2018-03-21 2018-09-14 南京邮电大学 A kind of ligand small molecule key minor structure screening technique and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222178A (en) * 2011-03-31 2011-10-19 清华大学深圳研究生院 Method for screening and/or designing medicines aiming at multiple targets
CN107862173A (en) * 2017-11-15 2018-03-30 南京邮电大学 A kind of lead compound virtual screening method and device
CN108399316A (en) * 2018-03-02 2018-08-14 南京邮电大学 Ligand molecular Feature Selection device and screening technique in drug design
CN108536999A (en) * 2018-03-21 2018-09-14 南京邮电大学 A kind of ligand small molecule key minor structure screening technique and device

Also Published As

Publication number Publication date
CN110176279A (en) 2019-08-27

Similar Documents

Publication Publication Date Title
CN107862173B (en) Virtual screening method and device for lead compound
Ideker et al. Discovery of regulatory interactions through perturbation: inference and experimental design
Kim et al. Subsystem identification through dimensionality reduction of large-scale gene expression data
US20040162852A1 (en) Multidimensional biodata integration and relationship inference
Zhu et al. Improving protein fold recognition by extracting fold-specific features from predicted residue–residue contacts
Zheng et al. Emerging deep learning methods for single-cell RNA-seq data analysis
Rasheed et al. Metagenomic taxonomic classification using extreme learning machines
WO2003062943A2 (en) Method for analyzing data to identify network motifs
Ruan et al. DACIDR: deterministic annealed clustering with interpolative dimension reduction using a large collection of 16S rRNA sequences
Cui et al. Boosting gene expression clustering with system-wide biological information: a robust autoencoder approach
CN112086139A (en) Multi-source transfer learning method and device for virtual screening of small molecule drugs
US20020072887A1 (en) Interaction fingerprint annotations from protein structure models
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
Maâtouk et al. Evolutionary biclustering algorithms: an experimental study on microarray data
CN110176279B (en) Lead compound virtual screening method and device based on small sample
Varshavsky et al. Compact: A comparative package for clustering assessment
Qu et al. Enhancing understandability of omics data with shap, embedding projections and interactive visualisations
CN112086143B (en) Small molecule drug virtual screening method and device based on unsupervised domain adaptation
Darvish et al. Discovering dynamic regulatory pathway by applying an auto regressive model to time series DNA microarray data
Yoshimura et al. Genomic style: yet another deep-learning approach to characterize bacterial genome sequences
Deng Algorithms for reconstruction of gene regulatory networks from high-throughput gene expression data
Bible et al. DeepMicrobes: taxonomic classification for metagenomics with deep learning
Marbach Evolutionary reverse engineering of gene networks
Piccolo et al. Coordinate-based mapping of tabular data enables fast and scalable queries
Taherpour Benchmarking of computational methods for Spatial Transcriptomics Data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant