CN112530515A - Novel deep learning model for predicting protein affinity of compound, computer equipment and storage medium - Google Patents
Novel deep learning model for predicting protein affinity of compound, computer equipment and storage medium Download PDFInfo
- Publication number
- CN112530515A CN112530515A CN202011502118.6A CN202011502118A CN112530515A CN 112530515 A CN112530515 A CN 112530515A CN 202011502118 A CN202011502118 A CN 202011502118A CN 112530515 A CN112530515 A CN 112530515A
- Authority
- CN
- China
- Prior art keywords
- model
- input
- protein
- compound
- bigru
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 92
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 92
- 150000001875 compounds Chemical class 0.000 title claims abstract description 86
- 238000013136 deep learning model Methods 0.000 title claims abstract description 8
- 238000013528 artificial neural network Methods 0.000 claims abstract description 29
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 27
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 20
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 17
- 238000012545 processing Methods 0.000 claims abstract description 14
- 238000011176 pooling Methods 0.000 claims abstract description 13
- 241000288105 Grus Species 0.000 claims abstract description 6
- 238000003062 neural network model Methods 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 10
- 238000010586 diagram Methods 0.000 claims description 7
- 125000003275 alpha amino acid group Chemical group 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 230000007774 longterm Effects 0.000 claims description 6
- 239000002904 solvent Substances 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 5
- 238000006073 displacement reaction Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 2
- 238000000034 method Methods 0.000 description 24
- 230000008569 process Effects 0.000 description 11
- 239000003814 drug Substances 0.000 description 10
- 229940079593 drug Drugs 0.000 description 8
- 239000002547 new drug Substances 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 4
- 238000009509 drug development Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 239000003596 drug target Substances 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000001225 therapeutic effect Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000006916 protein interaction Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Physiology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a novel deep learning model for predicting the protein affinity of a compound. The novel depth model comprises a bidirectional gating cyclic unit (BiGRU) model, a graph convolution neural network model (GCN) and a Convolution Neural Network (CNN) model, and the whole network architecture is BiGRU/BiGRU/GCN-CNN. The bidirectional gated cyclic unit model comprises a sequence processing model consisting of two gated cyclic units (GRUs), wherein one input is a forward input, the other input is a reverse input, and the two-way recursive neural network is a bidirectional recursive neural network with only an input gate and a forgetting gate. The input of the model is a compound one-dimensional SMILES sequence, a protein sequence and a compound two-dimensional molecular graph, and the three are respectively input into a BiGRU/BiGRU/GCN model. The BiGRU/GCN outputs are a feature vector representing the compound and a feature vector representing the protein. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a characteristic vector of a compound and a characteristic vector of a protein; the final output of the BiGRU/BiGRU/GCN-CNN model is the root mean square error value of the predicted compound protein affinity value.
Description
Technical Field
The invention relates to the field of compound protein molecular feature extraction, in particular to a novel deep learning model for predicting compound protein affinity, computer equipment and a storage medium.
Background
The successful recognition of compound protein interactions is a key step in the discovery of new uses for existing compounds. With the discovery of new compounds, the field is expanding and the reuse of existing compounds and new interactions are of interest to numerous researchers. Drug relocation, i.e., finding new uses for approved drugs, will greatly shorten the time to develop new drugs, which also attracts the attention of many researchers. Therefore, using statistical and machine-learned models to predict the strength of drug target interactions is an important alternative based on interactions that have been measured in clinical trials. Such as support vector machines, logistic regression, random forests and shallow neural networks, these models can also predict drug target binding affinity to some extent.
The introduction of deep learning has proven to be one of the best models to predict drug target binding affinity. The main advantage of deep learning is that by performing nonlinear transformations in each layer, they can better represent the raw data and thus facilitate learning patterns hidden in the data. However, many models of compound representation are only molecular fingerprints, a single SMILES string. This can cause the encoded compound signature to lose important information inherent in many compounds, resulting in inaccuracies in the final prediction of the protein affinity value of the compound.
Disclosure of Invention
The embodiment of the invention provides a novel deep learning model, a computer device and a storage medium for predicting the protein affinity of a compound, wherein the novel deep learning model, the computer device and the storage medium combine two-dimensional molecular graph structure information and one-dimensional SMILES character string information of the compound molecule, so that more information about the compound molecule can be extracted, and a deep learning method is used for improving the accuracy of predicting the protein affinity of the compound.
According to a first aspect of embodiments of the present invention, there is provided a novel deep learning model for predicting protein affinity of a compound.
In some optional embodiments, the depth model includes a bidirectional gated cyclic unit (BiGRU) model, a graph convolution neural network model (GCN), and a Convolution Neural Network (CNN) model, and the entire network architecture is BiGRU/GCN-CNN. The bidirectional gated cyclic unit model comprises a sequence processing model consisting of two gated cyclic units (GRUs), wherein one input is a forward input, the other input is a reverse input, and the two-way recursive neural network is a bidirectional recursive neural network with only an input gate and a forgetting gate. The input of the model is a compound one-dimensional SMILES sequence, a protein sequence and a compound two-dimensional molecular graph, and the three are respectively input into a BiGRU/BiGRU/GCN model. The BiGRU/GCN outputs are a feature vector representing the compound and a feature vector representing the protein. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a characteristic vector of a compound and a characteristic vector of a protein; the final output of the BiGRU/BiGRU/GCN-CNN model is the root mean square error value of the predicted compound protein affinity value.
Optionally, the bidirectional gating and circulating unit (BiGRU) model enables data to be input from both positive and negative directions, so that information at each moment includes sequence information of previous and subsequent moments, that is, sequence information of a network at a certain specific moment is increased, and information of historical data is fully utilized, thereby enabling prediction to be more accurate. The basic idea of BiGRU is to present each training sequence forward and backward to two separate hidden layers, both connected to the same output layer. The output layer has the complete past and future information for each point in the input sequence. Wherein, the gating cycle unit (GRU) carries out sufficient feature extraction to the multivariate time sequence, continuously learns the long-term dependence of the multivariate time sequence, and the gating cycle unit specifically comprises: firstly, two gating states are obtained through the last transmitted state and the input of the current node, namely a reset gate (reset gate) for controlling reset and a update gate (update gate), after a gating signal is obtained, data after reset is obtained through the reset gate, the data and the input of the current node are spliced, the data are shrunk to the range of-1 to 1 through a hyperbolic tangent function, finally, the states are updated to the range of 0 to 1 through the update gate, and the more gating signals are close to 1, the more data are represented to be memorized.
Optionally, the main algorithm flow of the graph convolution neural network (GCN) model is to calculate neighbor information of each atomic node, and finally form an atomic vector containing the neighbor information.
Optionally, the Convolutional Neural Network (CNN) model is composed of convolution (convolution), activation (activation), and pooling (displacement) structures. The CNN output result is a specific feature space corresponding to a compound protein, and then the feature space output by the CNN is used as an input of a fully connected layer or a fully connected neural network (FCN), and the fully connected layer is used to complete mapping from the affinity value of the input compound feature vector and the protein feature vector.
Optionally, the input to the model is selected 3 variables, the input variables comprising the protein structure attribute sequence from the UniRef database, the compound SMILES from the stimch database, and the compound two-dimensional molecular map. Wherein the protein structure attribute sequence is encoded by the secondary structure of the protein, the length of the protein amino acid sequence, the physicochemical properties (polarity/non-polarity, acidity/alkalinity) of the protein and the solvent accessibility of the protein.
Optionally, the model is trained using existing affinity values for a large number of protein compounds, and refined model parameters are obtained.
According to a second aspect of an implementation of the present invention, a computer device is provided.
In some optional embodiments, the computer device includes a memory, a graphics card, a central processing unit, and an executable program stored in the memory and capable of being processed by the central processing unit and the graphics card in parallel, wherein the memory is characterized in that the central processing unit executes the program to implement the following steps: constructing a target detection and target prediction model, wherein the target detection and target prediction model comprises the following steps: feature extraction networks and prediction networks. Firstly, a feature extraction network is utilized to extract features of an input compound SMILES # sequence and a protein structure attribute sequence; and (3) utilizing the extracted characteristic vector matrix to a target prediction model, wherein the target prediction model is to utilize convolution, pooling and full connection to operate the characteristic vector matrix and output root mean square error values of a predicted value and an actual value which are combined with affinity.
Optionally, the bidirectional gating and circulating unit (BiGRU) model enables data to be input from both positive and negative directions, so that information at each moment includes sequence information of previous and subsequent moments, that is, sequence information of a network at a certain specific moment is increased, and information of historical data is fully utilized, thereby enabling prediction to be more accurate. The basic idea of BiGRU is to present each training sequence forward and backward to two separate hidden layers, both connected to the same output layer. The output layer has the complete past and future information for each point in the input sequence. Wherein, the gating cycle unit (GRU) carries out sufficient feature extraction to the multivariate time sequence, continuously learns the long-term dependence of the multivariate time sequence, and the gating cycle unit specifically comprises: firstly, two gating states are obtained through the last transmitted state and the input of the current node, namely a reset gate (reset gate) for controlling reset and a update gate (update gate), after a gating signal is obtained, data after reset is obtained through the reset gate, the data and the input of the current node are spliced, the data are shrunk to the range of-1 to 1 through a hyperbolic tangent function, finally, the states are updated to the range of 0 to 1 through the update gate, and the more gating signals are close to 1, the more data are represented to be memorized.
Optionally, the main algorithm flow of the graph convolution neural network (GCN) model is to calculate neighbor information of each atomic node, and finally form an atomic vector containing the neighbor information.
Optionally, the Convolutional Neural Network (CNN) model is composed of convolution (convolution), activation (activation), and pooling (displacement) structures. The CNN output result is a specific feature space corresponding to a compound protein, and then the feature space output by the CNN is used as an input of a fully connected layer or a fully connected neural network (FCN), and the fully connected layer is used to complete mapping from the affinity value of the input compound feature vector and the protein feature vector.
Optionally, the input to the model is selected 3 variables, the input variables comprising the protein structure attribute sequence from the UniRef database, the compound SMILES from the stimch database, and the compound two-dimensional molecular map. Wherein the protein structure attribute sequence is encoded by the secondary structure of the protein, the length of the protein amino acid sequence, the physicochemical properties (polarity/non-polarity, acidity/alkalinity) of the protein and the solvent accessibility of the protein.
Optionally, the model is trained using existing affinity values for a large number of protein compounds, and refined model parameters are obtained.
The time-space sequence in the field of medicine is intelligently processed by utilizing an artificial intelligence technology, so that the problems of high development cost, long time consumption, safety and the like of new medicines can be solved. The trend to be able to screen new drugs and therapeutic targets among old drugs and compounds abandoned in use, which have been determined to be safe, is changing the situation of drug development and creating a drug relocation model for new drug development.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
FIG. 1 is a specific flow diagram of a bidirectional GRU of the present invention
FIG. 2 is a system theme scheme diagram of the present invention
Detailed Description
The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them. Other embodiments may incorporate structural, logical, electrical, process, and other changes. The examples merely typify possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in or substituted for those of others. The scope of embodiments of the invention encompasses the full ambit of the claims, as well as all available equivalents of the claims. Embodiments may be referred to herein, individually or collectively, by the term "invention" merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept if more than one is in fact disclosed. Herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, or apparatus. The use of the phrase "including a" does not exclude the presence of other, identical elements in a process, method or device that includes the recited elements, unless expressly stated otherwise. The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. As for the methods, products and the like disclosed by the embodiments, the description is simple because the methods correspond to the method parts disclosed by the embodiments, and the related parts can be referred to the method parts for description.
Fig. 1 shows an alternative implementation architecture of a novel coding approach for predicting compound protein affinity based on deep learning.
In this optional example, the depth model includes a bidirectional gated cyclic unit (BiGRU) model, a graph convolution neural network model (GCN), and a Convolution Neural Network (CNN) model, and the entire network architecture is BiGRU/GCN-CNN. The bidirectional gated cyclic unit model comprises a sequence processing model consisting of two gated cyclic units (GRUs), wherein one input is a forward input, the other input is a reverse input, and the two-way recursive neural network is a bidirectional recursive neural network with only an input gate and a forgetting gate. The input of the model is a compound one-dimensional SMILES sequence, a protein sequence and a compound two-dimensional molecular graph, and the three are respectively input into a BiGRU/BiGRU/GCN model. The BiGRU/GCN outputs are a feature vector representing the compound and a feature vector representing the protein. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a characteristic vector of a compound and a characteristic vector of a protein; the final output of the BiGRU/BiGRU/GCN-CNN model is the root mean square error value of the predicted compound protein affinity value.
Optionally, the bidirectional gating and circulating unit (BiGRU) model enables data to be input from both positive and negative directions, so that information at each moment includes sequence information of previous and subsequent moments, that is, sequence information of a network at a certain specific moment is increased, and information of historical data is fully utilized, thereby enabling prediction to be more accurate. The basic idea of BiGRU is to present each training sequence forward and backward to two separate hidden layers, both connected to the same output layer. The output layer has the complete past and future information for each point in the input sequence. Wherein, the gating cycle unit (GRU) carries out sufficient feature extraction to the multivariate time sequence, continuously learns the long-term dependence of the multivariate time sequence, and the gating cycle unit specifically comprises: firstly, two gating states are obtained through the last transmitted state and the input of the current node, namely a reset gate (reset gate) for controlling reset and a update gate (update gate), after a gating signal is obtained, data after reset is obtained through the reset gate, the data and the input of the current node are spliced, the data are shrunk to the range of-1 to 1 through a hyperbolic tangent function, finally, the states are updated to the range of 0 to 1 through the update gate, and the more gating signals are close to 1, the more data are represented to be memorized.
Optionally, the main algorithm flow of the graph convolution neural network (GCN) model is to calculate neighbor information of each atomic node, and finally form an atomic vector containing the neighbor information.
Optionally, the Convolutional Neural Network (CNN) model is composed of convolution (convolution), activation (activation), and pooling (displacement) structures. The CNN output result is a specific feature space corresponding to a compound protein, and then the feature space output by the CNN is used as an input of a fully connected layer or a fully connected neural network (FCN), and the fully connected layer is used to complete mapping from the affinity value of the input compound feature vector and the protein feature vector.
Optionally, the input to the model is selected 3 variables, the input variables comprising the protein structure attribute sequence from the UniRef database, the compound SMILES from the stimch database, and the compound two-dimensional molecular map. Wherein the protein structure attribute sequence is encoded by the secondary structure of the protein, the length of the protein amino acid sequence, the physicochemical properties (polarity/non-polarity, acidity/alkalinity) of the protein and the solvent accessibility of the protein.
Optionally, the model is trained using existing affinity values for a large number of protein compounds, and refined model parameters are obtained.
Optionally, the model further includes a training process of the bidirectional gated loop unit model, and a specific embodiment of the training process of the bidirectional gated loop unit model is provided below.
In the embodiment, in the training process of the target detection and target prediction model, firstly, a compound one-dimensional SMILES molecular sequence is input into a BiGRU model, a compound two-dimensional molecular graph is input into a GCN model, a protein sequence is input into another BiGRU model, two expression methods of a compound are fused into a feature vector and a protein feature vector, the feature vector and the protein feature vector are respectively input into a CNN model, so that training data are formed, the unit number of the compound BiGRU model and the protein BiGRU model in the training process is respectively set to be 128(cell) and 256(cell), then the BiGRU/BiGRU/GCN model and the CNN model are trained together, in order to reduce the complexity of the model, the BiGRU/BiGRU/GCN model is trained in advance to fix parameters, and then the model and the CNN model are trained together to determine the parameters of the CNN model. The initial learning rate of the whole model training is 0.0001, a loss function (loss function) is set to be an average absolute error loss (MAE loss), network parameters are adjusted by using an Adam optimizer in the training process through calculating the error between a predicted value and a true value, the weight of the model parameters is adjusted, and then the loss function value is continuously reduced through continuous iteration, so that the network is finally converged.
In an exemplary embodiment, there is also provided a non-transitory computer readable storage medium, such as a memory, comprising instructions executable by a processor to perform the steps of: and constructing a bidirectional gating cyclic unit (BiGRU) model, a graph convolution neural network model (GCN) and a Convolution Neural Network (CNN) model, wherein the whole network architecture is BiGRU/BiGRU/GCN-CNN. The bidirectional gated cyclic unit model comprises a sequence processing model consisting of two gated cyclic units (GRUs), wherein one input is a forward input, the other input is a reverse input, and the two-way recursive neural network is a bidirectional recursive neural network with only an input gate and a forgetting gate. The input of the model is a compound one-dimensional SMILES sequence, a protein sequence and a compound two-dimensional molecular graph, and the three are respectively input into a BiGRU/BiGRU/GCN model. The BiGRU/GCN outputs are a feature vector representing the compound and a feature vector representing the protein. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a characteristic vector of a compound and a characteristic vector of a protein; the final output of the BiGRU/BiGRU/GCN-CNN model is the root mean square error value of the predicted compound protein affinity value.
Optionally, the bidirectional gating and circulating unit (BiGRU) model enables data to be input from both positive and negative directions, so that information at each moment includes sequence information of previous and subsequent moments, that is, sequence information of a network at a certain specific moment is increased, and information of historical data is fully utilized, thereby enabling prediction to be more accurate. The basic idea of BiGRU is to present each training sequence forward and backward to two separate hidden layers, both connected to the same output layer. The output layer has the complete past and future information for each point in the input sequence. Wherein, the gating cycle unit (GRU) carries out sufficient feature extraction to the multivariate time sequence, continuously learns the long-term dependence of the multivariate time sequence, and the gating cycle unit specifically comprises: firstly, two gating states are obtained through the last transmitted state and the input of the current node, namely a reset gate (reset gate) for controlling reset and a update gate (update gate), after a gating signal is obtained, data after reset is obtained through the reset gate, the data and the input of the current node are spliced, the data are shrunk to the range of-1 to 1 through a hyperbolic tangent function, finally, the states are updated to the range of 0 to 1 through the update gate, and the more gating signals are close to 1, the more data are represented to be memorized.
Optionally, the main algorithm flow of the graph convolution neural network (GCN) model is to calculate neighbor information of each atomic node, and finally form an atomic vector containing the neighbor information.
Optionally, the Convolutional Neural Network (CNN) model is composed of convolution (convolution), activation (activation), and pooling (displacement) structures. The CNN output result is a specific feature space corresponding to a compound protein, and then the feature space output by the CNN is used as an input of a fully connected layer or a fully connected neural network (FCN), and the fully connected layer is used to complete mapping from the affinity value of the input compound feature vector and the protein feature vector.
Optionally, the input to the model is selected 3 variables, the input variables comprising the protein structure attribute sequence from the UniRef database, the compound SMILES from the stimch database, and the compound two-dimensional molecular map. Wherein the protein structure attribute sequence is encoded by the secondary structure of the protein, the length of the protein amino acid sequence, the physicochemical properties (polarity/non-polarity, acidity/alkalinity) of the protein and the solvent accessibility of the protein.
Optionally, the model is trained using existing affinity values for a large number of protein compounds, and refined model parameters are obtained.
The non-transitory computer readable storage medium may be a Read Only Memory (ROM), a Random Access Memory (RAMD), a magnetic tape, an optical storage device, and the like.
The time-space sequence in the field of medicine is intelligently processed by utilizing an artificial intelligence technology, so that the problems of high development cost, long time consumption, safety and the like of new medicines can be solved. The trend to be able to screen new drugs and therapeutic targets among old drugs and compounds abandoned in use, which have been determined to be safe, is changing the situation of drug development and creating a drug relocation model for new drug development.
Those of skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments disclosed herein, it should be understood that the disclosed methods, articles of manufacture (including but not limited to devices, apparatuses, etc.) may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
It should be understood that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. The present invention is not limited to the procedures and structures that have been described above and shown in the drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
Claims (10)
1. A novel deep learning model for predicting protein affinity of a compound is characterized by comprising a bidirectional gating cyclic unit (BiGRU) model, a graph convolution neural network model (GCN) and a Convolution Neural Network (CNN) model, wherein the whole network architecture is BiGRU/BiGRU/GCN-CNN. The bidirectional gated cyclic unit model comprises a sequence processing model consisting of two gated cyclic units (GRUs), wherein one input is a forward input, the other input is a reverse input, and the two-way recursive neural network is a bidirectional recursive neural network with only an input gate and a forgetting gate. The input of the model is a compound one-dimensional SMILES sequence, a protein sequence and a compound two-dimensional molecular graph, and the three are respectively input into a BiGRU/BiGRU/GCN model. The BiGRU/GCN outputs are a feature vector representing the compound and a feature vector representing the protein. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a characteristic vector of a compound and a characteristic vector of a protein; the final output of the BiGRU/BiGRU/GCN-CNN model is the root mean square error value of the predicted compound protein affinity value.
2. The novel depth model of claim 1, wherein the bidirectional gated cyclic unit (BiGRU) model enables data to be input simultaneously from both the forward and reverse directions, so that the information at each time includes sequence information of the forward and backward times, which is equivalent to the increase of sequence information of the network at a specific time, and the information of historical data is fully utilized, thereby making prediction more accurate. The basic idea of BiGRU is to present each training sequence forward and backward to two separate hidden layers, both connected to the same output layer. The output layer has the complete past and future information for each point in the input sequence. Wherein, the gating cycle unit (GRU) carries out sufficient feature extraction to the multivariate time sequence, continuously learns the long-term dependence of the multivariate time sequence, and the gating cycle unit specifically comprises: firstly, two gating states are obtained through the last transmitted state and the input of the current node, namely a reset gate (reset gate) for controlling reset and a update gate (update gate), after a gating signal is obtained, data after reset is obtained through the reset gate, the data and the input of the current node are spliced, the data are shrunk to the range of-1 to 1 through a hyperbolic tangent function, finally, the states are updated to the range of 0 to 1 through the update gate, and the more gating signals are close to 1, the more data are represented to be memorized.
3. The feature extraction model of claim 2, wherein the main algorithm flow of the graph convolution neural network (GCN) model is to calculate neighbor information of each atomic node, and finally form an atomic vector containing the neighbor information.
4. The compound feature extraction model of claim 3, wherein the Convolutional Neural Network (CNN) model is composed of convolution (convolution), activation (activation), and pooling (displacement) structures. The CNN output result is a specific feature space corresponding to a compound protein, and then the feature space output by the CNN is used as an input of a fully connected layer or a fully connected neural network (FCN), and the fully connected layer is used to complete mapping from the affinity value of the input compound feature vector and the protein feature vector.
5. The entire model of claim 4, wherein the inputs to the model are selected 3 variables, the input variables comprising a sequence of protein structure attributes from the UniRef database, a compound SMILES from the STITCH database, and a compound two-dimensional molecular map. Wherein the protein structure attribute sequence is encoded by the secondary structure of the protein, the length of the protein amino acid sequence, the physicochemical properties (polarity/non-polarity, acidity/alkalinity) of the protein and the solvent accessibility of the protein.
6. The novel depth model of claim 5, wherein the model is trained using existing affinity values for a plurality of protein compounds and refined model parameters are obtained.
7. A computer device comprising a memory, a graphics card, a central processing unit, and an executable program stored in the memory and capable of being processed in parallel by the central processing unit and the graphics card, wherein the central processing unit implements the following steps when executing the program: constructing a target detection and target prediction model, wherein the target detection and target prediction model comprises the following steps: feature extraction networks and prediction networks. Firstly, a feature extraction network is utilized to carry out feature extraction on an input compound one-dimensional SMILES sequence, a two-dimensional molecular diagram sequence and a protein structure attribute sequence; and (3) utilizing the extracted characteristic vector matrix to a target prediction model, wherein the target prediction model is to utilize convolution, pooling and full connection to operate the characteristic vector matrix and output root mean square error values of a predicted value and an actual value which are combined with affinity.
8. The computer device of claim 7, wherein the bidirectional gated cyclic unit model enables data to be input simultaneously from both the front and back directions, such that the information at each time includes sequence information of preceding and following times, which is equivalent to the increase of sequence information of the network at a specific time, and the information of historical data is fully utilized, thereby making prediction more accurate. The gating cycle unit carries out sufficient feature extraction on the multivariate time sequence and continuously learns the long-term dependence of the multivariate time sequence, and the gating cycle unit specifically comprises the following steps: firstly, two gating states are obtained through the last transmitted state and the input of the current node, namely gating for controlling reset and gating for controlling update respectively, after a gating signal is obtained, data after reset is obtained through resetting gating, the data and the input of the current node are spliced, the data are shrunk to the range of-1 to 1 through a hyperbolic tangent function, finally, the states are updated to the range of 0 to 1 through the functions of forgetting and memorizing through the updating gating, and the more the gating signal is close to 1, the more data representing the memory are.
9. The computer device of claim 7, wherein the inputs to the model are selected 3 variables, the input variables containing a sequence of protein structure attributes from the UniRef database, a compound SMILES from the STITCH database, and a compound two-dimensional molecular map. Wherein the protein structure attribute sequence is encoded by the secondary structure of the protein, the length of the protein amino acid sequence, the physicochemical properties (polarity/non-polarity, acidity/alkalinity) of the protein and the solvent accessibility of the protein.
10. A storage medium storing a computer program, the memory, when executed by a central processing unit, performing the steps comprising: a bidirectional gating cyclic unit (BiGRU) model, a graph convolution neural network model (GCN) and a Convolution Neural Network (CNN) model, wherein the whole network architecture is BiGRU/BiGRU/GCN-CNN. The bidirectional gated cyclic unit model comprises a sequence processing model consisting of two gated cyclic units (GRUs), wherein one input is a forward input, the other input is a reverse input, and the two-way recursive neural network is a bidirectional recursive neural network with only an input gate and a forgetting gate. The input of the model is a compound one-dimensional SMILES sequence, a protein sequence and a compound two-dimensional molecular graph, and the three are respectively input into a BiGRU/BiGRU/GCN model. The BiGRU/GCN outputs are a feature vector representing the compound and a feature vector representing the protein. The CNN model consists of a convolution layer, a pooling layer and a full-connection layer, and the input of the CNN model is a characteristic vector of a compound and a characteristic vector of a protein; the final output of the BiGRU/BiGRU/GCN-CNN model is the root mean square error value of the predicted compound protein affinity value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011502118.6A CN112530515A (en) | 2020-12-18 | 2020-12-18 | Novel deep learning model for predicting protein affinity of compound, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011502118.6A CN112530515A (en) | 2020-12-18 | 2020-12-18 | Novel deep learning model for predicting protein affinity of compound, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112530515A true CN112530515A (en) | 2021-03-19 |
Family
ID=75001351
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011502118.6A Pending CN112530515A (en) | 2020-12-18 | 2020-12-18 | Novel deep learning model for predicting protein affinity of compound, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112530515A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113160894A (en) * | 2021-04-23 | 2021-07-23 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for predicting interaction between medicine and target |
CN113744799A (en) * | 2021-09-06 | 2021-12-03 | 中南大学 | End-to-end learning-based compound and protein interaction and affinity prediction method |
CN113782094A (en) * | 2021-09-06 | 2021-12-10 | 中科曙光国际信息产业有限公司 | Modification site prediction method, modification site prediction device, computer device, and storage medium |
CN113889179A (en) * | 2021-10-13 | 2022-01-04 | 山东大学 | Compound-protein interaction prediction method based on multi-view deep learning |
CN114464270A (en) * | 2022-01-17 | 2022-05-10 | 北京工业大学 | Universal method for designing medicines aiming at different target proteins |
WO2023029351A1 (en) * | 2021-08-30 | 2023-03-09 | 平安科技(深圳)有限公司 | Self-supervised learning-based method, apparatus and device for predicting properties of drug small molecules |
-
2020
- 2020-12-18 CN CN202011502118.6A patent/CN112530515A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113160894A (en) * | 2021-04-23 | 2021-07-23 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for predicting interaction between medicine and target |
CN113160894B (en) * | 2021-04-23 | 2023-10-24 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for predicting interaction between medicine and target |
WO2023029351A1 (en) * | 2021-08-30 | 2023-03-09 | 平安科技(深圳)有限公司 | Self-supervised learning-based method, apparatus and device for predicting properties of drug small molecules |
CN113744799A (en) * | 2021-09-06 | 2021-12-03 | 中南大学 | End-to-end learning-based compound and protein interaction and affinity prediction method |
CN113782094A (en) * | 2021-09-06 | 2021-12-10 | 中科曙光国际信息产业有限公司 | Modification site prediction method, modification site prediction device, computer device, and storage medium |
CN113744799B (en) * | 2021-09-06 | 2023-10-13 | 中南大学 | Method for predicting interaction and affinity of compound and protein based on end-to-end learning |
CN113889179A (en) * | 2021-10-13 | 2022-01-04 | 山东大学 | Compound-protein interaction prediction method based on multi-view deep learning |
CN113889179B (en) * | 2021-10-13 | 2024-06-11 | 山东大学 | Compound-protein interaction prediction method based on multi-view deep learning |
CN114464270A (en) * | 2022-01-17 | 2022-05-10 | 北京工业大学 | Universal method for designing medicines aiming at different target proteins |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112530515A (en) | Novel deep learning model for predicting protein affinity of compound, computer equipment and storage medium | |
CN112530514A (en) | Novel depth model, computer device, storage medium for predicting compound protein interaction based on deep learning method | |
Wayne et al. | Unsupervised predictive memory in a goal-directed agent | |
US12014257B2 (en) | Domain specific language for generation of recurrent neural network architectures | |
CN112562781A (en) | Novel coding scheme, computer device and storage medium for predicting compound protein affinity based on deep learning | |
CN112542211A (en) | Method for predicting protein affinity of compound based on single attention mechanism, computer device and storage medium | |
CN112582020A (en) | Method for predicting compound protein affinity based on edge attention mechanism, computer device and storage medium | |
CN112231489B (en) | Knowledge learning and transferring method and system for epidemic prevention robot | |
CN112652358A (en) | Drug recommendation system, computer equipment and storage medium for regulating and controlling disease target based on three-channel deep learning | |
CN112562791A (en) | Drug target action depth learning prediction system based on knowledge graph, computer equipment and storage medium | |
CN108369661A (en) | Neural network programming device | |
CN116206775A (en) | Multi-dimensional characteristic fusion medicine-target interaction prediction method | |
CN115346372B (en) | Multi-component fusion traffic flow prediction method based on graph neural network | |
CN114707655B (en) | Quantum line conversion method, quantum line conversion system, storage medium and electronic equipment | |
CN114357319A (en) | Network request processing method, device, equipment, storage medium and program product | |
AU2022392233A1 (en) | Method and system for analysing medical images to generate a medical report | |
CN109731338B (en) | Artificial intelligence training method and device in game, storage medium and electronic device | |
CN116208399A (en) | Network malicious behavior detection method and device based on metagraph | |
CN114463596A (en) | Small sample image identification method, device and equipment of hypergraph neural network | |
CN113705402B (en) | Video behavior prediction method, system, electronic device and storage medium | |
CN114743590A (en) | Drug-target affinity prediction system based on graph convolution neural network, computer device and storage medium | |
CN110781968B (en) | Extensible class image identification method based on plastic convolution neural network | |
CN116312856A (en) | Medicine interaction prediction method and system based on substructure | |
WO2020190887A1 (en) | Methods and systems for de novo molecular configurations | |
CN114399901A (en) | Method and equipment for controlling traffic system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210319 |