CN114093435A - Chemical molecule related water solubility prediction method based on deep learning - Google Patents
Chemical molecule related water solubility prediction method based on deep learning Download PDFInfo
- Publication number
- CN114093435A CN114093435A CN202111228584.4A CN202111228584A CN114093435A CN 114093435 A CN114093435 A CN 114093435A CN 202111228584 A CN202111228584 A CN 202111228584A CN 114093435 A CN114093435 A CN 114093435A
- Authority
- CN
- China
- Prior art keywords
- deep learning
- model
- chemical
- solubility
- learning model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 239000000126 substance Substances 0.000 title claims abstract description 32
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 title claims abstract description 20
- 238000013135 deep learning Methods 0.000 title claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 35
- 238000013136 deep learning model Methods 0.000 claims abstract description 24
- 230000006870 function Effects 0.000 claims abstract description 20
- 230000008569 process Effects 0.000 claims abstract description 15
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 11
- 230000007246 mechanism Effects 0.000 claims abstract description 9
- 238000012545 processing Methods 0.000 claims description 16
- 230000015654 memory Effects 0.000 claims description 11
- 238000003860 storage Methods 0.000 claims description 11
- 238000011176 pooling Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 6
- 230000007787 long-term memory Effects 0.000 claims description 2
- 239000003550 marker Substances 0.000 claims description 2
- 239000011159 matrix material Substances 0.000 claims description 2
- 230000006403 short-term memory Effects 0.000 claims description 2
- 230000004931 aggregating effect Effects 0.000 claims 1
- 238000012360 testing method Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 13
- 230000000694 effects Effects 0.000 description 12
- 238000013528 artificial neural network Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000010200 validation analysis Methods 0.000 description 5
- 238000012795 verification Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000012512 characterization method Methods 0.000 description 3
- 238000007876 drug discovery Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000035495 ADMET Effects 0.000 description 2
- 238000010535 acyclic diene metathesis reaction Methods 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- BASFCYQUMIYNBI-UHFFFAOYSA-N platinum Chemical compound [Pt] BASFCYQUMIYNBI-UHFFFAOYSA-N 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 238000010521 absorption reaction Methods 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003466 anti-cipated effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000009510 drug design Methods 0.000 description 1
- 238000009509 drug development Methods 0.000 description 1
- 239000003596 drug target Substances 0.000 description 1
- 230000029142 excretion Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000004060 metabolic process Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004001 molecular interaction Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 229910052697 platinum Inorganic materials 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001988 toxicity Effects 0.000 description 1
- 231100000419 toxicity Toxicity 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a chemical molecule related water solubility prediction method based on deep learning. The method comprises the following steps: constructing a deep learning model, wherein the deep learning model is constructed on the basis of a bidirectional time series prediction model and an attention mechanism and is used for learning the corresponding relation between a chemical molecular structure sequence and a water-solubility attribute; and training the deep learning model by taking the set loss function minimization as a target, wherein the training process takes character sequence codes representing chemical molecule structures as input, and takes water-solubility attribute information related to the chemical molecules as output. The deep learning model trained by the method can accurately predict water solubility and other related attributes.
Description
Technical Field
The invention relates to the technical field of molecular water-solubility analysis, in particular to a chemical molecular related water-solubility prediction method based on deep learning.
Background
In recent years, deep learning has been successfully applied to object detection and image segmentation, which provides a useful tool for processing large amounts of data and making useful predictions in the scientific field. However, applying a deep learning related framework to molecular property prediction remains a challenging research problem. The use of deep learning in drug discovery has been further facilitated by the advent of new experimental techniques and the dramatic increase in available compound activity and biomedical data, including, for example, the prediction of molecular interactions during drug design in pharmaceutical companies, the prediction of drug-target interactions, the search of chemical synthesis and reverse synthesis pathways, and the prediction of chemical properties.
It is anticipated that deep learning will be more involved in the field of drug discovery in the future. Water solubility prediction, an important physicochemical molecular property, has been studied intensively for many years in the history of drug discovery. Various representations of chemical information and deep learning architectural models have also been applied to solubility prediction problems. The choice of representation method depends on different models, the most common combinations include molecular fingerprinting and fully connected neural networks, SMILES characterization and recurrent neural networks, molecular and graph neural networks, etc. In existing water-soluble predictive model architectures, the size of the training data set ranges from 100 to 10000. The reported performance varies greatly due to the different datasets used, and presents many challenges, such as dataset noise, complex spatial structure of molecules, and so forth.
In conclusion, a stable and robust deep learning model is built, so that a good effect is realized on molecular water solubility prediction, and the time and economic cost for drug development are saved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a chemical molecule related water solubility prediction method based on deep learning.
According to a first aspect of the present invention, a chemical molecule-related water solubility prediction method based on deep learning is provided. The method comprises the following steps:
constructing a deep learning model, wherein the deep learning model is constructed on the basis of a bidirectional time series prediction model and an attention mechanism and is used for learning the corresponding relation between a chemical molecular structure sequence and a water-solubility attribute;
and training the deep learning model by taking the set loss function minimization as a target, wherein the training process takes character sequence codes representing chemical molecule structures as input, and takes water-solubility attribute information related to the chemical molecules as output.
According to a second aspect of the present invention, a chemical molecule-related water solubility prediction method is provided. The method comprises the following steps:
acquiring a character sequence code representing a chemical molecular structure to be detected;
inputting the character sequence code into the trained deep learning model obtained according to the first aspect of the invention to obtain the water-solubility attribute information related to the chemical molecule.
Compared with the prior art, the invention has the advantages that a data-driven end-to-end deep learning model (BCSA) is provided and applied to the prediction process of molecular water solubility. The model provided by the invention is simple and does not depend on additional auxiliary knowledge, and can also be used for predicting other physicochemical and ADMET characteristics.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is an architectural diagram of an end-to-end deep learning model according to one embodiment of the invention;
FIG. 2 is a schematic diagram of the variation of R2 during training of a validation set and a test set in accordance with one embodiment of the present invention;
FIG. 3 is a predicted effect scatter plot of four different models according to one embodiment of the present invention;
FIG. 4 is a scatter plot of predicted outcomes over a test set, according to one embodiment of the invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
In short, the chemical molecule related water solubility prediction method based on deep learning provided by the invention integrally comprises a pre-training process and an actual prediction process of a deep learning model. The pre-training process comprises the following steps: constructing a deep learning model, wherein the deep learning model is constructed on the basis of a bidirectional time sequence prediction model and an attention mechanism and is used for learning the corresponding relation between a chemical molecular structure sequence and a water-solubility attribute; and training the deep learning model by taking the set loss function minimization as a target, wherein the training process takes character sequence codes representing chemical molecule structures as input, and takes water-solubility attribute information related to the chemical molecules as output. The bidirectional time series prediction model can adopt a bidirectional long and short term memory network (BILSTM) or a bidirectional gating cyclic unit (BIGRU) and the like. The sequence of characters characterizing a chemical molecular structure may be in the SMILES format, which is a specification that explicitly describes the molecular structure in ASCII strings, or in other formats. For clarity, the BILSTM model and SMILES are described below as examples.
In the invention, a BCSA model architecture is constructed on the basis of the work of BILSTM, channel attention and spatial attention by using SMILES { Weininger, 1988#86} molecular characterization, and aiming at the non-uniqueness of SMILES molecular characterization, data is amplified by using a SMILES enhancement technology to obtain more effective marker data sets as the input of the model, and the average value of each amplified molecule is used as the final prediction result, so that the model has stronger generalization capability. Then, different common graph neural network models are used for carrying out comparative research on the same data set and the invention, and the performance advantages of the models provided by the invention under different molecular representations are explored.
Hereinafter, the data preprocessing process, the model architecture, and the evaluation result will be described in detail.
First, representation and preprocessing of molecular data sets
In one embodiment, the dataset used was derived from the work of Cui { Cui, 2020#69} et al 2020, containing 9943 non-redundant compounds. The molecules are presented in the format of SMILES (Simplex Molecular-Input Line-Entry System). This symbolic format is characterized by a single line of text and a series of atoms and covalent bonds. From the perspective of formal language theory, both atoms and covalent bonds are considered symbol labels, whereas the SMILES string is just a sequence of symbols. This representation has been used to predict biochemical properties, and to encode SMILES, the present invention labels them using regular expressions in { Schwaller, 2018#64}, with the labels separated by white spaces. The processing results are, for example: "c 1 c (C) c ccc 1". Next, a method similar to word2vec is employed for the embedding input. Further, the dataset is enhanced by SMILES enumeration to expand the dataset, and the SMILES string is padded with "pad" to a fixed length of 150 characters. Excess text beyond this length is directly discarded. Finally, the data set was randomly divided into a training set (80%), a validation set (10%) and a test set (10%).
Second, deep learning model architecture
Referring to fig. 1, the deep learning model body includes a BILSTM, a channel attention module and a spatial attention module, which are used for learning the corresponding relationship between the chemical molecule structure sequence and the water-solubility attribute.
BILSTM is mainly used for acquiring sequence information of SMILES, and the invention utilizes good processing capability of RNN (recurrent neural network) model on remote relations in a sequence in natural language processing to acquire context information of the SMILES sequence based on the BILSTM of a special variant of the LSTM model in a batch processing mode. The BILSTM is a combination of a forward processing sequence of LSTMs and a backward processing sequence of LSTMs, which allows it to process features not only from the past, but also from the future. BILSTM uses SMILES sequence encoding as input Each time step t will output the hidden layer state in the forward directionAnd backward hidden layer stateThe output of the hidden layer at time t of the BILSTM is a concatenation of two states, which can be expressed as:
further, the processing procedure of BILSTM can be summarized as:
C=f(Wexi,ht-1) (2)
wherein f represents a multilayer BILSTM, WeIs the learning weight of the embedded vector, which is expressed in simplified form as:
C={h1,h2,…,hT} (3)
aiming at the Attention mechanism, the embodiment of the invention optimally embeds a CBAM (convolutional Block Attention Module) mechanism into a current forward propagation sequence neural network model, and the CBAM mechanism comprises two sub-modules, wherein one sub-Module is marked as Channel Attention map (M)c) And the other is labeled Spatial association map (M)s) For obtaining the emphasis information on different channels and spatial axes, respectively, the whole attention output process can be expressed as:
whereinRepresenting a dot product of the elements. σ denotes sigmoid activation function, and C' is the final output.
Specifically, the Channel Attention module (Channel Attention module) focuses mainly on what the SMILES character content is. For example, the spatial information of the BILSTM output matrix is first aggregated by averaging-pooling and max-pooling operations to obtain two different spatial context descriptors CavgAnd CmaxRespectively representing average pooling output information and maximum pooling output information; and respectively inputting the two descriptors into a 2-layer shared MLP network, and finally obtaining an output vector of the Channel Attention by utilizing a summing mode. The whole process is formalized as:
Mc(C)=MLP(AvgPool1d(C))+MLP(MaxPool1d(C))
=W1(σ(W0(Cavg))+W1(σ(W0(Cmax))) (5)
to mitigate network overhead, σ uses, for example, a relu activation function, W0,W1The learning weights of the first and second layers of the shared MLP (multi-layer perceptron) model, respectively.
The Spatial attention module (Spatial attention module) focuses primarily on the SMILES character sequence information portion. In one embodiment, the implementation is realized by using a one-dimensional convolution network with two layers of kernels being 7, and the implementation is specifically realized by the following steps:
Ms(C)=Conv1d7,1(σ(Conv1d7,16(C)))(6)
where σ denotes the relu activation function, Conv1d7,xRepresenting a 1-dimensional convolved layer with kernel size of 7 and filters of x. The final overall attention network module is represented as:
whereinRepresenting a dot product and O representing the hidden state mapping vector after aggregation attention weighting by the Avg-posing operation.
In the present invention, the last part of the regression task is to deliver the trained vector O to a two-layer fully-connected layer to predict the final attribute values. For example, relu, which is commonly used during deep learning studies, can be used as an intermediate activation function, and dropout can be used to mitigate the occurrence of overfitting. During the training process, MSE (mean square error) is used as a loss function for model training, and is expressed as:
wherein N represents the size of the training data,indicates the predicted value, yiRepresenting the true values of the experiment.
Selection of hyper-parameters
In the model provided by the invention, a plurality of parameters influence training and architecture, and the performance of the model is different under different parameter settings. In one embodiment, Bayesian optimization { Bergstra, 2011#92} is employed to explore the hyperparametric best choices to As a minimized target acquisition function, whereinIndicates the predicted value, yiWhich represents the true value of the image data,and (4) representing the mean value of the experimental true values. During optimization, a probability model is constructed according to the past result by utilizing a TPE (Tree-structured park Estimator) algorithm. Training is carried out on a training set, a total of 100 models are generated, each model is trained for 60 epochs, and an early-stopping strategy (probability is 20) is added to accelerate the training speed. Finally, the best hyper-parameter for training is found by using the best prediction effect of the verifier as shown in table 1. Eventually the model will be further trained to 30 points on an enumeration training set in anticipation of improving the final accuracy.
Table 1: hyper-parameter selection space and optimal hyper-parameters
The framework of models is implemented using a pytorch and all computations and model training are on a Linux server (openuse) Intel (R) Xeon (R) Platinum 8173M CPU @2.00GHz and NvidiGeForce RTX 2080Ti graphics card with 11G.
Fourth, evaluation criteria
In one embodiment, the provided model is evaluated using four performance indicators commonly used in regression tasks, including: (coefficient of determination) R-Squared (R)2) Spearman, RMSE, MAE. Wherein R is2The spearman coefficient measurement can help to observe whether the fitting capability of the whole model to data is good or not, the closer the calculation result is to 1, the better the model fitting effect is, and vice versa. The RMSE and MAE error measurement can help measure the difference between the predicted value and the true value, and the closer the calculation result is to 0, the better the prediction effect is, and vice versa.
Fifthly, aiming at the verification result of water solubility
The invention aims to develop a deep learning model by utilizing self-coding of a molecular SMILES sequence, which is used for exploring the effect of a deep neural network based on SMILES molecular sequence descriptors on predicting the solubility of molecules. For example, 7955 training sets, 996 validation sets, and 995 test sets were included on the original data set. And respectively building a BILSTM model by utilizing the optimal hyper-parameters trained in the table 1 and building a BCSA model on the basis of the BILSTM model. Fig. 2 shows the trend of the model fitting effect R2 for the validation set and the test set during 400 epochs of training when the smoothness of the curve is 0.8. As is obvious from the figure, the model of the invention has stronger fitting effect and generalization capability than the BILSTM model on both verification sets (evaluation sets) and test sets (test sets).
In deep learning, the more the number of samples is, the better the trained effect is, and the stronger the generalization ability of the model is. Data enhancement is possible and necessary because the model of the present invention is based on sequence encoding of SMILES molecules and there are a variety of different SMILES characters, i.e. there are a variety of sequence encodings, for different molecules. Preferably, the original segmented data set is further amplified using a SMILES enhancement technique, and BCSA models of 20-fold (20 SMILES per molecule) and 40-fold (40 SMILES per molecule) molecular enhancement are trained, respectively, wherein structurally simple molecules may have repeated SMILES. In order to prevent the influence on the training result, the repeated data is eliminated, and the finally obtained training set, the verification set and the test set are respectively the amplification data of (134454:19881:16834) and (239260:30042: 39800). In the experiment, the model with the best performance effect of the verification set R2 in the training process is utilized, the average value of amplified molecules in the test set is used as a final prediction result to measure the extraction capability of the model to the information of the molecular sequence, and the result is shown in Table 2. Verification results show that the stability and generalization capability of the enhanced data model are remarkably improved, and the model obtains the best effect in the SMILES40 data set, which shows that the enhanced model better focuses on different sequence information of molecules. The model will further increase the accuracy of the model by molecular amplification. Accuracy was achieved in the test set (R2-0.83-0.88, rmse-0.79-0.95). Compared with a deeper-net model (R2-0.72-0.79 and RMSE-0.988-1.151) which is originally developed by cui based on the data set and constructed by utilizing molecular fingerprints, the invention shows better prediction performance.
Table 2: prediction statistics for training and test sets
In order to better show the competitiveness of the model of the invention, a series of GCN { Kipf, 2016#3}, MPNN { Gilmer, 2017#50}, attentiveFP { P é z Sant i n, 2021#53} baseline models based on a graph neural network are further built, and the influence of sequence descriptors and molecular diagram descriptors based on molecular enhancement on the aspect of predicting solubility is studied. The construction of the models is realized by using a life science python software package DGL-Life Sci released by a DGL team. FIG. 3 shows scatter plots of predicted and actual solubility values for different models on a unified test set. As can be seen from the figure, the SEBSCA model based on molecular enhancement realizes the best molecular solubility performance prediction, and has better prediction on data in different ranges. Therefore, the model of the invention has certain competitive advantage.
Sixth, prediction for other related attributes
In the experiment, the BCSA (SMILES40) model was also used to make a correlation prediction of the oil-water distribution coefficients logP and logD (pH 7.4). The logP dataset is still based on the Cui { Cui, 2020#69} et al dataset. As can be seen in the left panel of fig. 4, good results were obtained on the test data set with an R2 of 0.99 and an RMSE of 0.29. As can be seen from the scatter plot, a better fit can be achieved for the data in each range. In addition, the logD (pH 7.4) training dataset is from Wang et al. The data set was randomly divided into 8:1: 1. Training data was obtained using the SMILES evaluation 40 x. Finally, a 40x data set was obtained at a ratio of 31290:3858:4031 (training: validation: testing). The average prediction per molecule was selected as the final prediction. As can be seen from the right panel of fig. 4, the test set had R2 of 0.93 and RMSE of 0.36. Compared with the reported Wang SVM model, where R2 is 0.89 and RMSE is 0.56 for the test set and R2 is 0.92 and RMSE is 0.51 for the training set, the prediction of the test set of the model provided by the present invention also exceeds the performance of the training set of Wang, 2015# 97. It can be seen that the present invention also exhibits better performance in oil-water related predictions, which can provide reliable and robust predictions.
In conclusion, aiming at the problem that accurate prediction of water solubility is a challenging task in drug deficiency, the invention provides an end-to-end deep learning model framework based on a molecular enhanced fusion attention mechanism utilizing LSTM, which utilizes a channel assessment and spatial assessment module added to the advantage of sequence processing in a long-short memory network to extract the important information part of SMILES sequence about water solubility prediction and utilizes Bayesian optimization, so that the provided model is simple and independent of additional auxiliary knowledge (such as a molecular complex spatial structure) and can be used for prediction of other physicochemical and ADMET characteristics (absorption, distribution, metabolism, excretion and toxicity characteristics).
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.
Claims (10)
1. A chemical molecule related water solubility prediction method based on deep learning comprises the following steps:
constructing a deep learning model, wherein the deep learning model is constructed on the basis of a bidirectional time series prediction model and an attention mechanism and is used for learning the corresponding relation between a chemical molecular structure sequence and a water-solubility attribute;
and training the deep learning model by taking the set loss function minimization as a target, wherein the training process takes character sequence codes representing chemical molecule structures as input, and takes water-solubility attribute information related to the chemical molecules as output.
2. The method of claim 1, wherein the deep learning model is a two-way long-short term memory network and embeds a channel attention module and a spatial attention module in the forward propagation for obtaining information on different channel and spatial axes, respectively.
3. The method of claim 2, wherein the character sequence code characterizing the structure of the chemical molecule is a SMILES sequence code, and wherein the SMILES sequence code is used as an input to the two-way long-short term memory network and is labeled as SMILES sequence codeOutput forward hidden layer states at each time step tAnd backward hidden layer stateThe output of the hidden layer of the bidirectional long and short term memory network at the time t is connected with two states which are expressed asThe processing procedure of the bidirectional long-short term memory network is represented as follows:
C=f(Wexi,ht-1)
wherein f represents a multi-layer bidirectional long-short term memory network, WeIs the learning weight of the embedded vector.
4. A method according to claim 3, wherein the channel attention module is adapted to characterize the contents of the SMILES character by performing the steps of:
aggregating spatial information of the bidirectional long-short term memory network output matrix by average pooling and maximum pooling to obtain two different spatial context descriptors CavgAnd Cmax;
Two descriptors CavgAnd CmaxRespectively inputting the signals into a multilayer shared sensor, and acquiring an output vector of channel attention by using a summation mode;
wherein C isavgAnd CmaxRespectively, mean pooling output information and maximum pooling output information.
5. The method of claim 4, wherein the shared multi-layer perceptron is a 2-layer shared perceptron, and wherein the execution of the channel attention module is represented as:
Mc(C)=MLP(AvgPool1d(C))+MLP(MaxPool1d(C))
=W1(σ(W0(Cavg))+W1(σ(W0(Cmax)))
where σ denotes the relu activation function, W0And W1The learning weights of the first and second layers of the shared multi-layer perceptual machine are, respectively.
6. The method of claim 5 wherein the spatial attention module is configured to characterize the SMILES character sequence information portion using a two-layer 7-kernel one-dimensional convolutional network represented as:
Ms(C)=Conv1d7,1(σ(Conv1d7,16(C)))
where σ denotes the relu activation function, Conv1d7,xRepresenting a 1-dimensional convolution layer with kernel size 7 and filter x, the overall attention mechanism is represented as:
7. The method of claim 6, wherein the vector O obtained is transmitted to a fully-connected layer of two layers to predict the corresponding value of the chemical molecule-related water solubility attribute.
9. A method for predicting water solubility associated with a chemical molecule, comprising the steps of:
acquiring a character sequence code representing a chemical molecular structure to be detected;
inputting the character sequence code into a trained deep learning model obtained according to the method of any one of claims 1 to 8, and obtaining the water solubility attribute information related to the chemical molecule.
10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8 or 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111228584.4A CN114093435A (en) | 2021-10-21 | 2021-10-21 | Chemical molecule related water solubility prediction method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111228584.4A CN114093435A (en) | 2021-10-21 | 2021-10-21 | Chemical molecule related water solubility prediction method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114093435A true CN114093435A (en) | 2022-02-25 |
Family
ID=80297311
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111228584.4A Pending CN114093435A (en) | 2021-10-21 | 2021-10-21 | Chemical molecule related water solubility prediction method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114093435A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115171807A (en) * | 2022-09-07 | 2022-10-11 | 合肥机数量子科技有限公司 | Molecular coding model training method, molecular coding method and molecular coding system |
CN116386753A (en) * | 2023-06-07 | 2023-07-04 | 烟台国工智能科技有限公司 | Reverse synthesis reaction template applicability filtering method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109741797A (en) * | 2018-12-10 | 2019-05-10 | 中国药科大学 | A method of small molecule compound water solubility grade is predicted using depth learning technology |
US20200176087A1 (en) * | 2018-12-03 | 2020-06-04 | Battelle Memorial Institute | Method for simultaneous characterization and expansion of reference libraries for small molecule identification |
CN111710375A (en) * | 2020-05-13 | 2020-09-25 | 中国科学院计算机网络信息中心 | Molecular property prediction method and system |
CN113128360A (en) * | 2021-03-30 | 2021-07-16 | 苏州乐达纳米科技有限公司 | Driver driving behavior detection and identification method based on deep learning |
CN113241128A (en) * | 2021-04-29 | 2021-08-10 | 天津大学 | Molecular property prediction method based on molecular space position coding attention neural network model |
CN113436115A (en) * | 2021-07-30 | 2021-09-24 | 西安热工研究院有限公司 | Image shadow detection method based on depth unsupervised learning |
-
2021
- 2021-10-21 CN CN202111228584.4A patent/CN114093435A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200176087A1 (en) * | 2018-12-03 | 2020-06-04 | Battelle Memorial Institute | Method for simultaneous characterization and expansion of reference libraries for small molecule identification |
CN109741797A (en) * | 2018-12-10 | 2019-05-10 | 中国药科大学 | A method of small molecule compound water solubility grade is predicted using depth learning technology |
CN111710375A (en) * | 2020-05-13 | 2020-09-25 | 中国科学院计算机网络信息中心 | Molecular property prediction method and system |
CN113128360A (en) * | 2021-03-30 | 2021-07-16 | 苏州乐达纳米科技有限公司 | Driver driving behavior detection and identification method based on deep learning |
CN113241128A (en) * | 2021-04-29 | 2021-08-10 | 天津大学 | Molecular property prediction method based on molecular space position coding attention neural network model |
CN113436115A (en) * | 2021-07-30 | 2021-09-24 | 西安热工研究院有限公司 | Image shadow detection method based on depth unsupervised learning |
Non-Patent Citations (1)
Title |
---|
SANGHYUN WOO ET AL.: "CBAM: Convolutional Block Attention Module", 《PROCEEDINGS OF THE EUROPEAN CONFERENCE ON COMPUTER VISION (ECCV)》, 31 December 2018 (2018-12-31), pages 1 - 17 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115171807A (en) * | 2022-09-07 | 2022-10-11 | 合肥机数量子科技有限公司 | Molecular coding model training method, molecular coding method and molecular coding system |
CN115171807B (en) * | 2022-09-07 | 2022-12-06 | 合肥机数量子科技有限公司 | Molecular coding model training method, molecular coding method and molecular coding system |
CN116386753A (en) * | 2023-06-07 | 2023-07-04 | 烟台国工智能科技有限公司 | Reverse synthesis reaction template applicability filtering method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Duboue | The art of feature engineering: essentials for machine learning | |
Wilkinson et al. | A comparison of joint species distribution models for presence–absence data | |
US11651860B2 (en) | Drug efficacy prediction for treatment of genetic disease | |
US10430690B1 (en) | Machine learning predictive labeling system | |
Liang et al. | Fast automated detection of COVID-19 from medical images using convolutional neural networks | |
JP6584477B2 (en) | Skip architecture neural network device and method for improved semantic segmentation | |
Wang et al. | Research on Healthy Anomaly Detection Model Based on Deep Learning from Multiple Time‐Series Physiological Signals | |
EP3859560A2 (en) | Method and apparatus for visual question answering, computer device and medium | |
CN114830133A (en) | Supervised contrast learning with multiple positive examples | |
CN109766557B (en) | Emotion analysis method and device, storage medium and terminal equipment | |
US11755838B2 (en) | Machine learning for joint recognition and assertion regression of elements in text | |
Zhang et al. | Simultaneous pixel-level concrete defect detection and grouping using a fully convolutional model | |
US20220253747A1 (en) | Likelihood Ratios for Out-of-Distribution Detection | |
Lee et al. | Localization uncertainty estimation for anchor-free object detection | |
WO2023065220A1 (en) | Chemical molecule related water solubility prediction method based on deep learning | |
CN114093435A (en) | Chemical molecule related water solubility prediction method based on deep learning | |
EP4273754A1 (en) | Neural network training method and related device | |
Li et al. | Remaining useful life prognostics of bearings based on a novel spatial graph-temporal convolution network | |
Wang et al. | Predicting Protein Interactions Using a Deep Learning Method‐Stacked Sparse Autoencoder Combined with a Probabilistic Classification Vector Machine | |
US20230281826A1 (en) | Panoptic segmentation with multi-database training using mixed embedding | |
Liu et al. | Thruster fault identification based on fractal feature and multiresolution wavelet decomposition for autonomous underwater vehicle | |
Dafflon et al. | Neuroimaging: into the multiverse | |
Wheeler | Bayesian additive adaptive basis tensor product models for modeling high dimensional surfaces: an application to high-throughput toxicity testing | |
WO2020201913A1 (en) | Computer architecture for labeling documents | |
CN111611796A (en) | Hypernym determination method and device for hyponym, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |