CN113299339B - Deep learning-based drug efficacy prediction method, device, equipment and storage medium - Google Patents

Deep learning-based drug efficacy prediction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113299339B
CN113299339B CN202110592915.6A CN202110592915A CN113299339B CN 113299339 B CN113299339 B CN 113299339B CN 202110592915 A CN202110592915 A CN 202110592915A CN 113299339 B CN113299339 B CN 113299339B
Authority
CN
China
Prior art keywords
drug
protein
text expression
subsequence
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110592915.6A
Other languages
Chinese (zh)
Other versions
CN113299339A (en
Inventor
王俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110592915.6A priority Critical patent/CN113299339B/en
Publication of CN113299339A publication Critical patent/CN113299339A/en
Application granted granted Critical
Publication of CN113299339B publication Critical patent/CN113299339B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Data Mining & Analysis (AREA)
  • Analytical Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medicinal Chemistry (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Molecular Biology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention provides a drug efficacy prediction method, device, equipment and storage medium based on deep learning, wherein the method comprises the following steps: acquiring a first protein sequence corresponding to the medicine; dividing the first protein sequence to obtain a plurality of first subsequences; analyzing each first subsequence to obtain each first text expression; calculating the matching degree of each first text expression and each second text expression; determining an action target point of the drug matched with the targeting protein based on the matching degree; predicting the drug efficacy of the drug on the targeting protein based on each of the action targets. The invention has the beneficial effects that: the method realizes the rapid automatic detection of the action target point of the medicine, predicts the curative effect of the medicine and saves experimental resources.

Description

Deep learning-based drug efficacy prediction method, device, equipment and storage medium
Technical Field
The invention relates to the field of digital medical treatment, in particular to a drug curative effect prediction method, device and equipment based on deep learning and a storage medium.
Background
Drug discovery is a process of determining new candidate compounds with potential therapeutic effects, and prediction of drug-target interactions (DTI) interactions between drug molecules and targeting proteins is an essential step in the drug discovery process. The efficacy of drug molecules depends on their affinity for the target protein or receptor. A drug molecule without any interaction or affinity to the target protein will not provide a therapeutic response. At present, the experimental determination of drug target interaction DTI can only be performed manually, which is time-consuming and resource-consuming.
Disclosure of Invention
The invention mainly aims to provide a drug curative effect prediction method, device, equipment and storage medium based on deep learning, and aims to solve the problems that experimental measurement of drug target interaction DTI can only be carried out by means of manual measurement, and time and resource are wasted.
The invention provides a drug efficacy prediction method based on deep learning, which is applied to target proteins and comprises the following steps:
Acquiring a first protein sequence corresponding to the medicine;
dividing the first protein sequence to obtain a plurality of first subsequences corresponding to the first protein sequence, wherein the number of amino acid molecules of each first subsequence is the same;
Analyzing each first subsequence to obtain a first text expression corresponding to each first subsequence;
inputting each first text expression and each second text expression corresponding to the targeting protein into a word2vec model trained in advance to obtain the matching degree of each first text expression and each second text expression; the second text expression is obtained by dividing a second protein sequence of the target protein to obtain a plurality of second subsequences corresponding to the second protein sequence, and analyzing the second protein to obtain a second text expression corresponding to each second subsequence; the number of amino acid molecules of each second subsequence is the same, and each second subsequence corresponding to each second text expression is a target point;
determining an action target point of the drug matched with the targeting protein based on the matching degree;
predicting the drug efficacy of the drug on the targeting protein based on each of the action targets.
The invention also provides a drug efficacy prediction device based on deep learning, which is applied to target proteins and comprises the following components:
The acquisition module is used for acquiring a first protein sequence corresponding to the medicine;
the segmentation module is used for segmenting the first protein sequence to obtain a plurality of first subsequences corresponding to the first protein sequence, wherein the number of amino acid molecules of each first subsequence is the same;
the analysis module is used for analyzing each first subsequence to obtain a first text expression corresponding to each first subsequence;
The input module is used for inputting each first text expression and each second text expression corresponding to the targeting protein into a word2vec model trained in advance to obtain the matching degree of each first text expression and each second text expression; the second text expression is obtained by dividing a second protein sequence of the target protein to obtain a plurality of second subsequences corresponding to the second protein sequence, and analyzing the second protein to obtain a second text expression corresponding to each second subsequence; the number of amino acid molecules of each second subsequence is the same, and each second subsequence corresponding to each second text expression is a target point;
The determining module is used for determining an action target point of the drug matched with the targeting protein based on the matching degree;
and the prediction module is used for predicting the drug curative effect of the drug on the targeting protein based on each action target point.
The invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.
The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the preceding claims.
The invention has the beneficial effects that: the amino acid composition structure of the drug is obtained and split into a plurality of first subsequences, the first subsequences are converted into corresponding first text expressions for matching calculation, and then the drug curative effect of the drug on the target protein is predicted according to the matching condition. Therefore, the method realizes rapid automatic detection of the action target point of the medicine, predicts the curative effect of the medicine and saves experimental resources.
Drawings
FIG. 1 is a flow chart of a method for predicting therapeutic effects of a drug based on deep learning according to an embodiment of the invention;
FIG. 2 is a schematic block diagram of a device for predicting therapeutic effects of a drug based on deep learning according to an embodiment of the present invention;
Fig. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that, in the embodiments of the present invention, all directional indicators (such as up, down, left, right, front, and back) are merely used to explain the relative positional relationship, movement conditions, and the like between the components in a specific posture (as shown in the drawings), if the specific posture is changed, the directional indicators correspondingly change, and the connection may be a direct connection or an indirect connection.
The term "and/or" is herein merely an association relation describing an associated object, meaning that there may be three relations, e.g., a and B, may represent: a exists alone, A and B exist together, and B exists alone.
Furthermore, descriptions such as those referred to as "first," "second," and the like, are provided for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implying an order of magnitude of the indicated technical features in the present disclosure. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
Referring to fig. 1, the invention provides a drug efficacy prediction method based on deep learning, which is applied to a target protein and comprises the following steps:
s1: acquiring a first protein sequence corresponding to the medicine;
s2: dividing the first protein sequence to obtain a plurality of first subsequences corresponding to the first protein sequence, wherein the number of amino acid molecules of each first subsequence is the same;
S3: analyzing each first subsequence to obtain a first text expression corresponding to each first subsequence;
S4: inputting each first text expression and each second text expression corresponding to the targeting protein into a word2vec model trained in advance to obtain the matching degree of each first text expression and each second text expression; the second text expression is obtained by dividing a second protein sequence of the target protein to obtain a plurality of second subsequences corresponding to the second protein sequence, and analyzing the second protein to obtain a second text expression corresponding to each second subsequence; the number of amino acid molecules of each second subsequence is the same, and each second subsequence corresponding to each second text expression is a target point;
s5: determining an action target point of the drug matched with the targeting protein based on the matching degree;
S6: predicting the drug efficacy of the drug on the targeting protein based on each of the action targets.
The first protein sequence corresponding to the drug is obtained as described in step S1 above. The method of obtaining the amino acid composition structure of the drug is that the amino acid composition structure of the drug is obtained by a first protein sequence which is a drug amino acid composition structure input by a receiver or transmitted from other devices, namely, when detecting the drug molecule, a researcher analyzes the amino acid composition structure of the drug, so that the amino acid composition structure of the drug can be obtained. Wherein the first protein sequence comprises at least an amino acid composition structure, i.e. an amino acid ordering. The first protein sequence may also include a spatial structure of the molecule, i.e., a spatial structure of the amino acids, which may facilitate subsequent detection of whether the spatial structure may be associated with the target protein.
The first protein sequence is split as described in step S2 above to obtain a plurality of first subsequences corresponding to the first protein sequence. The method of dividing may be dividing according to the number of amino acids, and generally dividing 3 amino acids into a group, so as to obtain a plurality of first subsequences. It should be noted that if the number of the last split group of amino acids is less than 3, 0 may be added as one amino acid to fill in, so that each first subsequence contains 3 amino acids. Or 3 amino acid molecules are taken as a group, and the corresponding serial numbers of each amino acid molecule are sequentially added with 1, for example, ABCDE is taken as 5 amino acid molecules which are sequentially arranged, so that the first subsequence obtained by segmentation is ABC, BCD, CDE first subsequences, and the method can avoid the problem of insufficient number of a group of amino acids in the subsequent final segmentation.
And (3) analyzing each first subsequence to obtain a first text expression corresponding to each first subsequence. The analysis can be performed through a model trained in advance, the training mode is that the first subsequences and corresponding words are trained, so that the text representation of each first subsequence is obtained. The N-gram Model is a Language Model (LM), which is a probability-based discriminant Model, that is, after any first subsequence is input, a first text expression corresponding to the probability can be obtained. Specifically, for example, assuming that for a () C, there are three expressions in the middle, for example, the probabilities obtained by training and learning ABC, AQC, and AXC, N-gram model are 80%, 10%, and 10%, respectively, then the first literal expression of ABC includes ABC: AQC: AXC=8:1:1, and the first text expression corresponding to AQC comprises AQC: ABC: axc=1:8:1, thereby forming a unique literal expression for each first subsequence.
And (4) inputting each first text expression and each second text expression corresponding to the target protein into a pre-trained word2vec model to obtain the matching degree of each first text expression and each second text expression as described in the steps S4-S5. The word2vec model serves as an unsupervised model and includes two pre-training methods, called Skip-Gram and continuous Bag-of-Words (CBOW). Skip-Gram is used to predict a word from context, and CBOW is used to predict context from a given word. Combining Skip-Gram and CBOW, word2vec can ultimately map words to low-dimensional real-valued vectors. By adopting the mechanism, an action target point of matching the first text expression and the second text expression can be obtained, wherein the matching of the first text expression and the second text expression is specifically a word-to-word matching condition (namely, whether a first subsequence corresponding to the first text expression and a second subsequence corresponding to the second text expression can be combined or not), so that whether the first subsequences can be combined with the target protein or not can be known, and the action target point can be obtained, and the medicine curative effect of the medicine can be judged based on the action target point. The second text expression obtaining method is the same as the first text expression obtaining method; the method for dividing the first protein sequence is the same as the method for dividing the second protein sequence corresponding to the target protein, and therefore will not be described in detail. The UniProt is a database which can intensively record protein resources and can be mutually connected with other resources, and is also a database which has the most extensive recorded protein sequence catalogue and the most comprehensive functional annotation so far. The data in UniProt may be used as training data for training the word2vec model.
As described in step S6, the therapeutic effect of the drug on the target protein is predicted based on each of the action targets, wherein the prediction mode may be a matching condition, the number of binding sites between the drug and the target protein is predicted, or the therapeutic effect of the drug is calculated by determining whether the target (the second subsequence of the target protein) on which the pathogenic factor of the target protein is located is bound or not and giving a higher weight to the target, which will not be described in detail later.
In one implementation, the step S6 of predicting the therapeutic effect of the drug on the target protein based on each of the action targets includes:
S601: calculating the action score of each action target point according to the preset weight of each target point of the targeting protein, and summing the obtained action scores to obtain the corresponding curative effect score of the medicine;
s602: and acquiring the medicine curative effect of the curative effect score according to the corresponding relation between the preset medicine curative effect and the medicine curative effect score.
And as described in the step S601, the action target point of each first text expression is obtained based on the matching condition analysis. I.e. the binding site of the drug to the targeting protein, if the action target is the site of the targeting protein that is mainly pathogenic, the drug can be considered to have a certain drug efficacy and should be given a higher weight. Calculating the action score of each action target point according to the weight of each target point of the preset target protein, and summing the obtained action scores to obtain the corresponding curative effect score of the medicine. Because the targeting protein and the drug have a plurality of combined action targets, when the targeting protein is analyzed, the weight of each first subsequence of the targeting protein needs to be recorded in advance, and the pathogenic sites are given higher weight, so that the weight of each target can be directly obtained according to the combined targets. The efficacy score is thus calculated. The setting mode of the weight can be that the corresponding scientific researchers obtain the pathogenic intensity of each target point on the target protein after researching, so that the corresponding scientific researchers can set the corresponding weight value for each target point based on the pathogenic intensity of each target point.
As described in step S602, according to the correspondence between the preset therapeutic effect and the therapeutic effect score, the therapeutic effect of the therapeutic effect score is obtained. Namely, according to the weight value of the action target point, the medicine curative effect score corresponding to each target point can be obtained, the curative effect score of the medicine can be obtained by summing, and the relation between the medicine curative effect score and the medicine curative effect can be preset, so that the medicine curative effect can be obtained.
In one embodiment, the step S3 of analyzing each of the first subsequences to obtain the first text expression corresponding to each of the first subsequences includes:
s301: inputting each first subsequence to a Skip-Gram model for processing to obtain real value vectors corresponding to each first subsequence; wherein the dimensions of each real value vector are the same;
S302: acquiring real value vectors corresponding to the context words of the preset number of the real value vectors respectively as target vectors;
s303: and updating each real value vector by a random gradient rising method to obtain the first text expression corresponding to each real value vector.
As described in step S301, each of the first sub-sequences is input to a Skip-Gram model for processing, so as to obtain real-valued vectors corresponding to each of the first sub-sequences, and the dimensions of each of the real-valued vectors are the same, assuming that the real-valued vectors are V-dimensional vectors.
As described in step S302, the real-valued vectors corresponding to the context words of the preset number of real-valued vectors are obtained as the target vectors. It should be noted that, if the number of real-valued vectors in the context or the context is not enough, the extraction may be performed from the corresponding context or the context, so that the number of real-valued vectors extracted from the context by each real-valued vector is guaranteed to be the same, and it is assumed that 2c words are extracted.
As described in step S303, the real value vectors are updated by a random gradient ascent method, so as to obtain the first text expressions corresponding to the real value vectors. The step of the random gradient ascent method is to accumulate the weight matrix W (v×n matrix) of each of the extracted 2c vectors, and then average the result as a hidden layer vector (1×n). N is a preset dimension. The hidden layer vector is multiplied by the output weight matrix W' (n×v matrix) to obtain a vector (1×v). And processing by using an activation function (softmax) to obtain a V-dim probability distribution, wherein the word indicated by the index with the highest probability is the predicted intermediate word w. To maximize the log-likelihood functionFor the purpose, the model is iterated continuously, and finally, a first text expression corresponding to each real value vector is obtained, and the root sign of each real value vector can be represented, so that more accurate vector words are obtained. Where, in the maximized log likelihood function, c=2c, w represents the selected actual real-valued vector, context (w) represents a Context word of the selected actual real-valued vector, and p=w|context (w)) represents the probability that the Context word matches the selected actual real-valued vector.
In one embodiment, before the step S4 of inputting each first text expression and each second text expression corresponding to the targeting protein into the pre-trained word2vec model to obtain the matching degree between each first text expression and each second text expression, the method further includes:
S311: based on the target category to which the target protein belongs, acquiring initial parameters corresponding to the target category from a parameter database; and
Acquiring training data of a corresponding category based on the targeting category;
S32: and inputting the initial parameters into a word2vec initial model, and inputting the training data for training to obtain the pre-trained word2vec model.
The training of the word2vec initial model is realized, corresponding category initial parameters are acquired first, then further training is carried out based on training data, the training time of the word2vec model can be reduced, and the training speed is increased.
As described in step S311, the correspondence between the initial parameters and the target categories may be stored in advance, and it should be noted that the training data is continuously increased, and the latest training data should be used for training, so that the initial parameters only need to be initially trained for the target categories, that is, the preset number of training data in the target categories may be selected in advance to be trained, and the obtained parameters may be used as the initial parameters of the target categories.
Training data of the corresponding category is obtained based on the target category, namely training data of the corresponding category is obtained from a corresponding database, and the data can be obtained from UniProt.
And as described in the step S312, the initial parameters are input into the word2vec initial model, and the training data is input for training, so as to obtain the pre-trained word2vec model. I.e. retraining the model, further optimizing the initial parameters therein.
In one embodiment, the step S312 of inputting the initial parameters into the word2vec initial model and inputting the training data for training to obtain the pre-trained word2vec model includes:
S3121: splitting the training data into a plurality of training sets;
s3122: inputting each training set and the initial parameters into different word2vec initial models for training, and obtaining the respective trained intermediate parameters of each word2vec initial model after training is completed;
s3123: calculating a loss value of each word2vec initial model by using a gradient descent method, and optimizing the corresponding word2vec initial model based on the loss value to obtain optimization parameters corresponding to the word2vec initial models;
s3124: inputting each optimized parameter into a meta-optimization formula for calculation to obtain a target parameter;
s3125: and inputting the target parameters into the word2vec initial model to obtain the pre-trained word2vec model.
And obtaining a word2vec model according to the training data.
The training data is split into a plurality of training sets as described above in step S3121. The splitting mode can be uniform splitting or nonuniform splitting, and it should be noted that the training data in the split training set is ensured to be enough so as to avoid larger errors.
As described in the above step S3122, each training set and the initial parameters are input into a word2vec initial model with different inputs for training, and after training is completed, the intermediate parameters trained by each word2vec initial model are obtained. And respectively inputting each training set into different word2vec initial models to obtain respective trained intermediate parameters so as to facilitate further calculation.
As described in the above step S3123, a gradient descent method is used to calculate a loss value of each word2vec initial model, and optimize the corresponding word2vec initial model based on the loss value, so as to obtain the optimization parameters corresponding to the word2vec initial models. Wherein the training formula isWherein θ' [ i ] is the optimal parameter of the task Ti, θ is the initial parameter, α is the hyper-parameter,/>[ Theta ] LTi ] f (theta) is the gradient of the task Ti, which represents the task of the ith training set in the ith word2vec initial model.
As described in the above step S3124, the meta-optimization formula isWhere θ is the initial parameter, β is the hyper-parameter,/>Is the gradient of each new task Ti relative to the parameter θ' [ i ], and f (θ i) is the optimized parameter obtained by the ith model.
And as described in the step S3125, inputting the obtained target parameters into the word2vec initial model to obtain the pre-trained word2vec model.
In one embodiment, before the step S2 of dividing the amino acid composition structure into a plurality of first subsequences, the method further comprises:
S201: acquiring a three-dimensional structure of the drug and a targeting three-dimensional structure of the targeting protein based on SWISS-MODEL;
S202: inputting the three-dimensional structure of the drug and the targeting three-dimensional structure into a preset protein structure matching model to obtain the matching degree of the three-dimensional structure of the drug and the targeting three-dimensional structure; wherein the protein matching model is a convolutional neural network model;
s203: judging whether the drug can act on the target protein according to the matching degree;
s204: if so, the step of dividing the amino acid composition into a plurality of first subsequences is performed.
The three-dimensional shape detection of the medicine is realized.
As described in step 201 above, SWISS-modem acquires the drug three-dimensional structure of the drug, and the targeting three-dimensional structure of the targeting protein; SWISS-MODEL is a MODEL for predicting protein structure at present, and can obtain the amino acid structure sequence in the protein, so as to obtain the corresponding protein structure.
And (2) inputting the three-dimensional structure of the drug and the targeting three-dimensional structure into a preset protein structure matching model to obtain the matching degree of the three-dimensional structure of the drug and the targeting three-dimensional structure as described in the step S202. The three-dimensional structure of the drug and the targeting three-dimensional structure can be matched through a convolutional neural network, the convolutional neural network trains by taking the predicted combination conditions of different protein segments as input and taking the actual combination conditions of corresponding protein segments as output, so that a corresponding protein matching model is obtained.
As described in the above steps S203 to S204, whether the drug can act on the target protein is determined according to the matching degree, and if the spatial structure of the corresponding drug and the target protein cannot be combined, even if the site of the target protein can be combined by the drug, the drug cannot be considered to have a therapeutic effect on the target protein, so that the drug therapeutic effect of the drug can be continuously detected only when the spatial structure can be combined.
In one embodiment, after step S6, the predicting the therapeutic effect of the drug on the target protein based on each of the action targets further includes:
s701: acquiring the actual drug efficacy of the drug, and calculating the similarity between the drug efficacy and the drug efficacy based on a similarity calculation formula;
s702: judging whether the similarity is smaller than a preset similarity or not;
S703: if the calculated value is smaller than the preset similarity, calculating a curative effect loss value of the curative effect of the medicine;
S704: and inputting the curative effect loss value into the word2vec model for retraining.
The word2vec model is retrained, self-learning is realized, and the subsequent model identification is more accurate.
As described in step S701, the actual drug efficacy of the drug is obtained, and the similarity between the drug efficacy and the drug efficacy is calculated based on the similarity calculation formula. The similarity calculation formula is any calculation formula in the prior art, and is not described herein.
As described in the above steps S702-S703, the preset similarity is a similarity set in advance, if the similarity is greater than the preset similarity, it is indicated that the prediction result of the word2vec model has higher accuracy, and no additional training is needed, and if the similarity is less than the preset similarity, the efficacy loss value of the drug efficacy is calculated, so as to facilitate subsequent retraining.
The efficacy loss value is input to the word2vec model for retraining as described in step S704. The treatment loss value and the actual drug treatment value are input into a treatment word2vec model, the treatment loss value is used as an amplitude reference for adjusting parameters in the word2vec model, the actual drug treatment value is used as a final output, and the word2vec model is retrained.
The invention has the beneficial effects that: the amino acid composition structure of the drug is obtained and split into a plurality of first subsequences, the first subsequences are converted into corresponding first text expressions for matching calculation, and then the drug curative effect of the drug on the target protein is predicted according to the matching condition. Therefore, the method realizes rapid automatic detection of the action target point of the medicine, predicts the curative effect of the medicine and saves experimental resources.
Referring to fig. 2, the embodiment of the application further provides a drug efficacy prediction device based on deep learning, which is applied to a target protein, and comprises:
An acquisition module 10, configured to acquire a first protein sequence corresponding to a drug;
A partitioning module 20, configured to partition the first protein sequence to obtain a plurality of first subsequences corresponding to the first protein sequence, where the number of amino acid molecules in each of the first subsequences is the same;
An analysis module 30, configured to analyze each of the first subsequences to obtain a first text expression corresponding to each of the first subsequences;
The input module 40 is configured to input each first text expression and each second text expression corresponding to the target protein into a word2vec model trained in advance, so as to obtain a matching degree between each first text expression and each second text expression; the second text expression is obtained by dividing a second protein sequence of the target protein to obtain a plurality of second subsequences corresponding to the second protein sequence, and analyzing the second protein to obtain a second text expression corresponding to each second subsequence; the number of amino acid molecules of each second subsequence is the same, and each second subsequence corresponding to each second text expression is a target point;
A determining module 50 for determining an action target point of the drug matching the targeting protein based on the matching degree;
a prediction module 60, configured to predict a therapeutic effect of the drug on the target protein based on each of the action targets.
In one embodiment, prediction module 60 includes:
The action score calculation submodule is used for calculating the action score of each action target point according to the preset weight of each target point of the target protein, and summing the obtained action scores to obtain the curative effect score corresponding to the medicine;
and the medicine curative effect obtaining submodule is used for obtaining the medicine curative effect of the curative effect score according to the corresponding relation between the preset medicine curative effect and the medicine curative effect score.
In one embodiment, analysis module 30 includes:
Inputting the subsequences into a sub-module, and inputting each first subsequence into a Skip-Gram model for processing to obtain real value vectors corresponding to each first subsequence; wherein the dimensions of each real value vector are the same;
The target vector acquisition sub-module is used for acquiring real value vectors corresponding to the context words of the preset number of the real value vectors respectively as target vectors;
And the updating sub-module is used for updating each real value vector through a random gradient ascending method to obtain the first text expression corresponding to each real value vector.
In one embodiment, the deep learning-based drug efficacy prediction device further comprises:
The data acquisition module is used for acquiring initial parameters corresponding to the target category from a parameter database based on the target category to which the target protein belongs; and
The training data acquisition module is used for acquiring training data of the corresponding category based on the targeting category;
And the model training module is used for inputting the initial parameters into a word2vec initial model, and then inputting the training data for training to obtain the pre-trained word2vec model.
In one embodiment, the input module 40 includes:
The splitting module is used for splitting the training data into a plurality of training sets;
The parameter input sub-module is used for inputting each training set and the initial parameters into different word2vec initial models for training, and obtaining the respective trained intermediate parameters of each word2vec initial model after training is completed;
The loss value calculation sub-module is used for calculating the loss value of each word2vec initial model by using a gradient descent method, and optimizing the corresponding word2vec initial model based on the loss value to obtain the corresponding optimization parameters of each word2vec initial model;
the target parameter calculation sub-module is used for inputting each optimized parameter into a meta-optimization formula for calculation to obtain a target parameter;
And the target parameter input sub-module is used for inputting the target parameters into the word2vec initial model to obtain the pre-trained word2vec model.
In one embodiment, the deep learning-based drug efficacy prediction device further comprises:
The drug three-dimensional structure acquisition module is used for acquiring the drug three-dimensional structure of the drug and the targeting three-dimensional structure of the targeting protein based on SWISS-MODEL;
the structure input module is used for inputting the three-dimensional structure of the drug and the targeting three-dimensional structure into a preset protein structure matching model to obtain the matching degree of the three-dimensional structure of the drug and the targeting three-dimensional structure; wherein the protein matching model is a convolutional neural network model;
The medicine judging module is used for judging whether the medicine can act on the target protein according to the matching degree;
And the segmentation module is used for executing the steps of segmenting the first protein sequence to obtain a plurality of first subsequences corresponding to the first protein sequence if the first subsequences are the same.
In one embodiment, the deep learning-based drug efficacy prediction device further comprises:
The actual drug efficacy obtaining module is used for obtaining the actual drug efficacy of the drug and calculating the similarity between the actual drug efficacy and the drug efficacy based on a similarity calculation formula;
The similarity judging module is used for judging whether the similarity is smaller than a preset similarity or not;
the curative effect loss value calculation module is used for calculating the curative effect loss value of the curative effect of the medicine if the curative effect loss value is smaller than the preset similarity;
and the retraining module is used for inputting the curative effect loss value into the word2vec model for retraining.
Referring to fig. 3, in an embodiment of the present application, there is further provided a computer device, which may be a server, and an internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store the second protein sequences of the various targeted proteins, and so on. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, may implement the deep learning-based drug efficacy prediction method according to any of the above embodiments.
It will be appreciated by those skilled in the art that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, can implement the deep learning-based drug efficacy prediction method according to any one of the above embodiments.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by hardware associated with a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that comprises the element.
Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The blockchain underlying platform may include processing modules for user management, basic services, smart contracts, operation monitoring, and the like. The user management module is responsible for identity information management of all blockchain participants, including maintenance of public and private key generation (account management), key management, maintenance of corresponding relation between the real identity of the user and the blockchain address (authority management) and the like, and under the condition of authorization, supervision and audit of transaction conditions of certain real identities, and provision of rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node devices, is used for verifying the validity of a service request, recording the service request on a storage after the effective request is identified, for a new service request, the basic service firstly analyzes interface adaptation and authenticates the interface adaptation, encrypts service information (identification management) through an identification algorithm, and transmits the encrypted service information to a shared account book (network communication) in a complete and consistent manner, and records and stores the service information; the intelligent contract module is responsible for registering and issuing contracts, triggering contracts and executing contracts, a developer can define contract logic through a certain programming language, issue the contract logic to a blockchain (contract registering), invoke keys or other event triggering execution according to the logic of contract clauses to complete the contract logic, and simultaneously provide a function of registering contract upgrading; the operation monitoring module is mainly responsible for deployment in the product release process, modification of configuration, contract setting, cloud adaptation and visual output of real-time states in product operation, for example: alarms, monitoring network conditions, monitoring node device health status, etc.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (8)

1. A drug efficacy prediction method based on deep learning is applied to a target protein and is characterized by comprising the following steps:
Acquiring a first protein sequence corresponding to the medicine;
dividing the first protein sequence to obtain a plurality of first subsequences corresponding to the first protein sequence, wherein the number of amino acid molecules of each first subsequence is the same;
analyzing each first subsequence based on an N-gram model to obtain a first text expression corresponding to each first subsequence;
Inputting each first text expression and each second text expression corresponding to the targeting protein into a word2vec model trained in advance to obtain the matching degree of each first text expression and each second text expression; the second text expression is obtained by dividing a second protein sequence of the target protein to obtain a plurality of second subsequences corresponding to the second protein sequence, and analyzing each second subsequence based on the N-gram model to obtain a second text expression corresponding to each second subsequence; the number of amino acid molecules of each second subsequence is the same, and each second subsequence corresponding to each second text expression is a target point;
determining an action target point of the drug matched with the targeting protein based on the matching degree;
Predicting the drug efficacy of the drug on the targeting protein based on each of the action targets;
the step of predicting the therapeutic effect of the drug on the targeting protein based on each of the action targets comprises the following steps:
calculating the action score of each action target point according to the preset weight of each target point of the targeting protein, and summing the obtained action scores to obtain the corresponding curative effect score of the medicine;
Acquiring the medicine curative effect of the curative effect score according to the preset corresponding relation between the medicine curative effect and the medicine curative effect score;
The step of analyzing each first subsequence to obtain a first text expression corresponding to each first subsequence includes:
inputting each first subsequence to a Skip-Gram model for processing to obtain real value vectors corresponding to each first subsequence; wherein the dimensions of each real value vector are the same;
Acquiring real value vectors corresponding to the context words of the preset number of the real value vectors respectively as target vectors;
and updating each real value vector by a random gradient rising method to obtain the first text expression corresponding to each real value vector.
2. The method for predicting therapeutic effects of a deep learning-based drug according to claim 1, wherein before the step of inputting each first text expression and each second text expression corresponding to the target protein into a word2vec model trained in advance to obtain the matching degree between each first text expression and each second text expression, the method further comprises:
Based on the target category to which the target protein belongs, acquiring initial parameters corresponding to the target category from a parameter database; and
Acquiring training data of a corresponding category based on the targeting category;
and inputting the initial parameters into a word2vec initial model, and then inputting the training data for training to obtain the pre-trained word2vec model.
3. The deep learning-based drug efficacy prediction method as set forth in claim 2, wherein the step of inputting the initial parameters into a word2vec initial model and inputting the training data for training to obtain the pre-trained word2vec model comprises:
splitting the training data into a plurality of training sets;
Inputting each training set and the initial parameters into different word2vec initial models for training, and obtaining the respective trained intermediate parameters of each word2vec initial model after training is completed;
calculating a loss value of each word2vec initial model by using a gradient descent method, and optimizing the corresponding word2vec initial model based on the loss value to obtain optimization parameters corresponding to the word2vec initial models;
inputting each optimized parameter into a meta-optimization formula for calculation to obtain a target parameter;
And inputting the target parameters into the word2vec initial model to obtain the pre-trained word2vec model.
4. The deep learning-based drug efficacy prediction method according to claim 1, wherein the step of dividing the first protein sequence to obtain a plurality of first subsequences corresponding to the first protein sequence is preceded by the step of:
Acquiring a three-dimensional structure of the drug and a targeting three-dimensional structure of the targeting protein based on SWISS-MODEL;
Inputting the three-dimensional structure of the drug and the targeting three-dimensional structure into a preset protein structure matching model to obtain the matching degree of the three-dimensional structure of the drug and the targeting three-dimensional structure; wherein the protein matching model is a convolutional neural network model;
judging whether the drug can act on the target protein according to the matching degree;
if yes, executing the step of dividing the first protein sequence to obtain a plurality of first subsequences corresponding to the first protein sequence.
5. The deep learning-based drug efficacy prediction method of claim 1, further comprising, after the step of predicting the drug efficacy of the drug on the targeting protein based on each of the action targets:
acquiring the actual drug efficacy of the drug, and calculating the similarity between the drug efficacy and the drug efficacy based on a similarity calculation formula;
Judging whether the similarity is smaller than a preset similarity or not;
if the calculated value is smaller than the preset similarity, calculating a curative effect loss value of the curative effect of the medicine;
and inputting the curative effect loss value into the word2vec model for retraining.
6. A deep learning-based drug efficacy prediction device applied to a targeting protein for implementing the method of any one of claims 1 to 5, comprising:
The acquisition module is used for acquiring a first protein sequence corresponding to the medicine;
the segmentation module is used for segmenting the first protein sequence to obtain a plurality of first subsequences corresponding to the first protein sequence, wherein the number of amino acid molecules of each first subsequence is the same;
the analysis module is used for analyzing each first subsequence to obtain a first text expression corresponding to each first subsequence;
The input module is used for inputting each first text expression and each second text expression corresponding to the targeting protein into a word2vec model trained in advance to obtain the matching degree of each first text expression and each second text expression; the second text expression is obtained by dividing a second protein sequence of the target protein to obtain a plurality of second subsequences corresponding to the second protein sequence, and analyzing the second protein to obtain a second text expression corresponding to each second subsequence; the number of amino acid molecules of each second subsequence is the same, and each second subsequence corresponding to each second text expression is a target point;
The determining module is used for determining an action target point of the drug matched with the targeting protein based on the matching degree;
and the prediction module is used for predicting the drug curative effect of the drug on the targeting protein based on each action target point.
7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.
CN202110592915.6A 2021-05-28 2021-05-28 Deep learning-based drug efficacy prediction method, device, equipment and storage medium Active CN113299339B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110592915.6A CN113299339B (en) 2021-05-28 2021-05-28 Deep learning-based drug efficacy prediction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110592915.6A CN113299339B (en) 2021-05-28 2021-05-28 Deep learning-based drug efficacy prediction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113299339A CN113299339A (en) 2021-08-24
CN113299339B true CN113299339B (en) 2024-05-07

Family

ID=77325947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110592915.6A Active CN113299339B (en) 2021-05-28 2021-05-28 Deep learning-based drug efficacy prediction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113299339B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109215737A (en) * 2018-09-30 2019-01-15 东软集团股份有限公司 Protein characteristic extracts, functional mode generates, the method and device of function prediction
CN112086133A (en) * 2020-08-24 2020-12-15 南京邮电大学 Drug target feature learning method and device based on text implicit information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190018924A1 (en) * 2015-12-31 2019-01-17 Cyclica Inc. Methods for proteome docking to identify protein-ligand interactions
CN107563150B (en) * 2017-08-31 2021-03-19 深圳大学 Method, device, equipment and storage medium for predicting protein binding site
US20200411137A1 (en) * 2019-06-25 2020-12-31 Guangzhou University Drug Recommendation Method and System

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109215737A (en) * 2018-09-30 2019-01-15 东软集团股份有限公司 Protein characteristic extracts, functional mode generates, the method and device of function prediction
CN112086133A (en) * 2020-08-24 2020-12-15 南京邮电大学 Drug target feature learning method and device based on text implicit information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Predicting activatory and inhibitory drug-target interactions based on mol2vec and genetically perturbed transcriptomes";Won-Yung Lee et al;《bioRxiv》;第1-21页 *
"SPVec:A Word2vec-Inspired Feature Representation Method for Drug-Target Interaction Prediction";Yu-Fang Zhang et al;《Frontiers in Chemistry》(第7期);第1-11页 *
"基于计算的中药靶点预测研究探讨与实验分析";孟志昌等;《世界科学技术-中医药现代化》;第16卷(第11期);第2296-2303页 *

Also Published As

Publication number Publication date
CN113299339A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
US10606949B2 (en) Artificial intelligence based method and apparatus for checking text
CN110459324B (en) Disease prediction method and device based on long-term and short-term memory model and computer equipment
US11170257B2 (en) Image captioning with weakly-supervised attention penalty
CN109783617B (en) Model training method, device, equipment and storage medium for replying to questions
US11694109B2 (en) Data processing apparatus for accessing shared memory in processing structured data for modifying a parameter vector data structure
CN112417096B (en) Question-answer pair matching method, device, electronic equipment and storage medium
JP2021532499A (en) Machine learning-based medical data classification methods, devices, computer devices and storage media
CN109326353B (en) Method and device for predicting disease endpoint event and electronic equipment
CN110569356B (en) Interviewing method and device based on intelligent interviewing interaction system and computer equipment
CN112329865B (en) Data anomaly identification method and device based on self-encoder and computer equipment
CN112016279A (en) Electronic medical record structuring method and device, computer equipment and storage medium
CN112132624A (en) Medical claims data prediction system
CN111291264A (en) Access object prediction method and device based on machine learning and computer equipment
CN114781272A (en) Carbon emission prediction method, device, equipment and storage medium
CN113011895B (en) Associated account sample screening method, device and equipment and computer storage medium
CN112488712A (en) Safety identification method and safety identification system based on block chain big data
CN110597956B (en) Searching method, searching device and storage medium
CN113705685B (en) Disease feature recognition model training, disease feature recognition method, device and equipment
CN113657105A (en) Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN114281991A (en) Text classification method and device, electronic equipment and storage medium
CN112163635B (en) Image classification method, device, server and medium based on deep learning
CN113299339B (en) Deep learning-based drug efficacy prediction method, device, equipment and storage medium
CN113761375A (en) Message recommendation method, device, equipment and storage medium based on neural network
CN113449718A (en) Method and device for training key point positioning model and computer equipment
CN113177109A (en) Text weak labeling method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant