CN111640471A - Method and system for predicting activity of drug micromolecules based on two-way long-short memory model - Google Patents

Method and system for predicting activity of drug micromolecules based on two-way long-short memory model Download PDF

Info

Publication number
CN111640471A
CN111640471A CN202010464590.9A CN202010464590A CN111640471A CN 111640471 A CN111640471 A CN 111640471A CN 202010464590 A CN202010464590 A CN 202010464590A CN 111640471 A CN111640471 A CN 111640471A
Authority
CN
China
Prior art keywords
short memory
smiles
model
molecules
activity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010464590.9A
Other languages
Chinese (zh)
Inventor
牛张明
韦德·门佩斯-史密斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou derizhi Pharmaceutical Technology Co.,Ltd.
Original Assignee
Wei DeMenpeisi Shimisi
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wei DeMenpeisi Shimisi filed Critical Wei DeMenpeisi Shimisi
Priority to CN202010464590.9A priority Critical patent/CN111640471A/en
Publication of CN111640471A publication Critical patent/CN111640471A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Biotechnology (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medicinal Chemistry (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for predicting the activity of a small drug molecule based on a two-way long-short memory model, which comprises the following steps: acquiring a data set; preprocessing the data set, including representing all compound molecules in the data set by SMILES, standardizing SMILES expressions of all the molecules, unifying encoding modes and sequences of atoms, bonds and connection relations in the SMILES expressions of the molecules, and performing de-duplication processing by using InChIKey of the molecules; encoding the preprocessed data set, wherein the individual elements, individual numbers, individual symbols and the whole square brackets of the SMILES sequence are treated as a sequence token by one-hot encoding, each token has chemical meaning and directionality, and any combination of tokens conforms to chemical rules; constructing a bidirectional long-short memory core segment recognition model; inputting the coded data into the bidirectional long-short memory core segment recognition model to obtain a hidden state moment; and evaluating the two-way long-short memory core segment recognition model.

Description

Method and system for predicting activity of drug micromolecules based on two-way long-short memory model
Technical Field
The present invention relates to the fields of chemical informatics and bioinformatics. In particular, the invention relates to a method and a system for predicting the activity of a small drug molecule based on a two-way long-short memory model.
Background
Elucidation of the relationship between molecular structure and biological activity has been an important issue in the field of medicinal chemistry. However, as experimental data grows explosively, methods based on empirical measurements and heuristic rules become increasingly difficult to elucidate such relationships.
Chemical informatics is an active area of research in predicting biological activity from molecular structures by means of high-performance computer and machine learning methods. In recent decades, with the advent of deep learning methods, machine learning has received increasing attention from the scientific community. Data-driven analysis has become a routine procedure for many chemical and pharmaceutical applications, including virtual screening, chemical property prediction, and de novo molecular design. In many of these applications, machine learning shows great potential to compete with, and even surpass, conventional approaches.
The merck molecular activity challenge has led to a trend towards training deep learning networks with molecular fingerprints and other descriptors. The winning team uses a multitasking model that contains a large number of pre-computed molecular descriptors, which improves performance by 15% over the random forest baseline. By using the same training strategy, Andreas and colleagues presented the Tox21 challenge results with the most accurate toxicity predictions. Although many studies have shown that a large-scale multitask network trained using a large number of molecular descriptors can significantly improve the predictive ability of traditional models for virtual screening and attribute prediction, its inherent black-box property has been heavily criticized by the modeling community. Such models make the relationships between attributes and structures more difficult to interpret.
Therefore, learning molecular properties of compounds directly from the topology of the molecule, rather than defining fingerprints or descriptors in advance, has attracted increasing interest in both the chemical and machine learning fields. Duvenaud and coworkers showed Neurofingerprints (NFPs) that attempted to extract data-driven features from molecules rather than hand-made features. The architecture is based on generalization of fingerprints so that it can be learned by back-propagation algorithms. Later, Kearnes and colleagues proposed molecular graph convolution using undirected graphs to represent small molecules. Later, researchers have proposed several improved convolutional graph networks (GCNs) for dynamic extraction of molecular features
And predicting the target characteristic. Despite the rather high predictive performance, these inherent deficiencies of GCN, such as limited information dissemination over the entire graph and non-intuitive feature extraction, indicate that the model still has room for improvement.
In addition to graphical representations, researchers have focused more on molecular linear representations as generative models have grown in popularity. Many unsupervised learning techniques with different generative models are used for new molecular design. Most of them use SMILES (Simplified molecular input line entry specification) as an input to generate new molecules with specific properties. Furthermore, Vidal and his colleagues suggested that a simple SMILES string fragment could be used directly to calculate molecular similarity and predict lipid-water partition coefficients. These studies demonstrate that molecular linear representation can be used directly in SAR studies. It is easier to input structural linear symbols into a sequence-based network than CT-based methods. However, there is currently no study of directly inputting SMILES into a sequence-based deep learning model for biological activity prediction.
Disclosure of Invention
In order to solve the problems, the invention adopts a bidirectional long-short term memory (BilSTM) model by using a sequence learning method in NLP for reference so as to obtain convenient modeling and considerable prediction performance. The accuracy and the application range of the prediction by using the method are greatly improved. The method is based on the deep learning model, can effectively extract the characteristics of the input information, including a plurality of undiscovered characteristic rules, and provides a more accurate prediction result.
According to one aspect of the invention, a method for predicting activity based on the semantic analysis of molecular SMILES expression recognized by a two-way long-short memory core fragment is provided, which comprises the following steps:
acquiring a data set;
preprocessing the data set, including representing all compound molecules in the data set by SMILES, standardizing SMILES expressions of all the molecules, unifying encoding modes and sequences of atoms, bonds and connection relations in the SMILES expressions of the molecules, and performing de-duplication processing by using InChIKey of the molecules;
encoding the preprocessed data set, wherein the individual elements, individual numbers, individual symbols and the whole square brackets of the SMILES sequence are treated as a sequence token by one-hot encoding, each token has chemical meaning and directionality, and any combination of tokens conforms to chemical rules;
constructing a bidirectional long-short memory core segment recognition model;
inputting the coded data into the bidirectional long-short memory core segment recognition model to obtain a hidden state moment; and
and evaluating the two-way long-short memory core segment recognition model.
In one embodiment of the invention, the data set comprises three open-source data sets.
In one embodiment of the invention, the de-duplication process using molecular InChIKey includes converting the SMILES expression into each molecular unique InChIKey, directly removing the SMILES corresponding to the completely identical InChIKey by comparing the InChIKey,
the preprocessing of the data sets further comprises the step of randomly dividing each data set into a training set, a verification set and a test set according to a certain proportion.
In an embodiment of the present invention, the method for predicting activity based on the semantic analysis of the molecular SMILES expression identified by the two-way long-short memory core fragment further includes converting the positive integer sequence corresponding to each token into a vector, and converting the SMILES sequence into a word embedding matrix S:
S=(w1,w2,...,wL)T
where each w is a d-dimensional row vector.
In one embodiment of the invention, the word embedding matrix S is input into the two-way long-short memory core segment recognition model from the current input xtAnd h passed by the last statet-1Four states z and z are obtained through different weight training calculationsi、zfAnd zo
Where z is converted into a value between-1 and 1 by a tanh activation function, and zi、zfAnd zoThe activation function is converted to a value between 0 and 1 as a gated state.
z=tanh(W·[xt,ht-1])
zi=σ(Wi·[xt,ht-1])
zf=σ(Wf·[xt,ht-1])
zo=σ(Wo·[xt,ht-1])
Where σ is the relu activation function, W is the network weight,
then through zfSelective forgetting of input from previous node, via ziSelective memorization of c thereintDifferent from h in RNNtWill change less with different nodes, will slowly pass on, and finally pass through zoSelectively outputting the resulting hidden state ht
ct=zf·ct-1+zi·z
ht=zo·tanh(ct)。
In an embodiment of the present invention, the bidirectional long and short memory core segment recognition model includes two recurrent neural networks to acquire information in two different directions, and both the two layers are connected to the same input layer, wherein one layer of information is transmitted forward at the same time step to update information of all hidden layers, the information of the other layer is transmitted in a direction opposite to that of the previous layer, hidden state vectors in different directions after being encoded are finally spliced into a matrix by calculating the output layer and then obtaining hidden layer values in different directions.
In one embodiment of the invention, the hidden state h in both directionstIs composed of
Figure BDA0002510735230000041
Figure BDA0002510735230000042
Wherein t represents the time of day, where,
will be provided with
Figure BDA0002510735230000043
And
Figure BDA0002510735230000044
spliced to form a hidden state h at the moment ttI.e. by
Figure BDA0002510735230000045
If the number of hidden units in each direction of the LSTM is set to u, then htIs 1 × 2u, and then all the time instants are spliced to obtain a hidden state matrix H
H=(h1,h2,…,hL)T
Where the dimension of H is L × 2 u.
In an embodiment of the present invention, the core identification fragment unit originally created in the model may make the model focus on different partial areas of the hidden state matrix, and the principle is to assign different weight values to the different partial areas, and the formula is as follows:
C=softmax(Wbtanh(WaHT))
SubCore=C·H
wherein WaAnd WbThe core segment vector values are trainable matrixes, dimensions are trainable model hyper-parameters, a matrix core after formula calculation represents that a model focuses on a plurality of specific regions in an SMILES sequence, and finally C and a previous hidden state matrix H are combined to obtain a final core segment SubCore vector value.
According to another embodiment, there is provided a system for predicting activity based on semantic analysis of molecular SMILES expressions for long-short memory core fragment recognition, comprising:
a data preprocessing unit;
a data encoding unit;
a bidirectional long and short memory core segment identification unit; and
the classification regression device is used for classifying the regression,
wherein the system is adapted to perform the above method.
In another embodiment of the invention, the encoded training set and validation set data are loaded to the bidirectional long and short memory core segment identification unit, and the bidirectional long and short memory core segment identification unit is subjected to large-scale training and validation.
Drawings
To further clarify the above and other advantages and features of embodiments of the present invention, a more particular description of embodiments of the invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. In the drawings, the same or corresponding parts will be denoted by the same or similar reference numerals for clarity.
Fig. 1 illustrates a system for predicting activity based on molecular SMILES expression semantic analysis of two-way long-short memory core fragment recognition according to one embodiment of the present invention.
Fig. 2 shows a flowchart of a method for predicting activity based on molecular SMILES expression semantic analysis of two-way long-short memory core fragment recognition according to an embodiment of the present invention.
Fig. 3 shows an example of one-hot encoding according to the present invention.
Detailed Description
In the following description, the invention is described with reference to various embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details, or with other alternative and/or additional methods, materials, or components. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of embodiments of the invention. Similarly, for purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the embodiments of the invention. However, the invention may be practiced without specific details. Further, it should be understood that the embodiments shown in the figures are illustrative representations and are not necessarily drawn to scale.
Reference in the specification to "one embodiment" or "the embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.
The invention adopts a bidirectional long-short term memory (BilSTM) model, and uses a sequence learning method in NLP for reference to obtain convenient modeling and considerable prediction performance. The accuracy and the application range of the prediction by using the method are greatly improved. The method is based on the deep learning model, can effectively extract the characteristics of the input information, including a plurality of undiscovered characteristic rules, and provides a more accurate prediction result.
The invention provides a method for analyzing and predicting molecular bioactivity tendency based on a molecular SMILES expression of a bidirectional long-short memory core fragment recognition technology, which comprises the following steps: original preprocessing of three different public datasets. DUD-E active compound and bait compound data, HIV inhibitor data, anti-plasmodium falciparum compound data were used as model raw data sets, and activity indices (e.g., half maximal effect concentration EC50) were used as label values. The molecule of each data set is represented by SMILES, standardization and repeated treatment are carried out, a self-created word segmentation processing technology is adopted for the data SMILES, a corresponding vocabulary is constructed, then each sample data set is randomly divided into a training set, a verification set and a test set according to a certain proportion, the SMILES is converted into a vector form by means of word vector embedding in virtue of the vocabulary, and the vector form is input into a bidirectional long and short memory core segment recognition network for training. And finally, loading a deep learning model established based on the bidirectional long and short memory core segment recognition technology, verifying the deep learning model, and comparing the deep learning model with other baseline models through different evaluation indexes, so that a more accurate activity prediction result can be provided, and a practical and effective new analysis method is provided for the structure-activity relationship research.
Fig. 1 illustrates a system for predicting activity based on molecular SMILES expression semantic analysis of two-way long-short memory core fragment recognition according to one embodiment of the present invention. The system comprises a data preprocessing unit 101, a data encoding unit 102, a bidirectional long-short memory core segment recognition 103 and a classification regressor 104. The specific functions of the data preprocessing unit 101, the data encoding unit 102, the bidirectional long-short memory core fragment recognition 103, and the classification regressor 104 will be described below in conjunction with a method of semantic analysis prediction activity of a molecular SMILES expression based on bidirectional long-short memory core fragment recognition.
Fig. 2 shows a flowchart of a method for predicting activity based on molecular SMILES expression semantic analysis of two-way long-short memory core fragment recognition according to an embodiment of the present invention.
First, at step 110, a data set is acquired.
In embodiments of the invention, the data set may comprise three open source data sets, including the DUD-E sample data set disclosed in the documents Directory of usefull records, enhanced (DUD-E): beta ligands and records for beta bearing marking.J.Med.chem.2012,55,6582-6594.Doi:10.1021/jm 300687E; drugefficcy activity data set disclosed in the literature of chemical starting points for anti-mental medical introduction, Nature 2010,465,305-310.Doi:10.1038/nature09107, and HIV activity data set derived from the AIDS antibody screening Drug Treatment Program (DTP) AIDS anti viral Screen. The data set details are shown in table 1 below:
Figure BDA0002510735230000071
TABLE 1 basic information of the three public data sets
The above embodiments give three examples of data sets, and it should be clear to those skilled in the art that other data sets can be obtained by the present invention.
Generally, a compound is a positive sample of a biological activity prediction task, as long as the corresponding biological activity of the compound is reported in the literature. Multiple different biological activities in the same data set, DUD-E, require the establishment of multiple predictive tasks, belonging to a multi-classification task. While the Drug efficacy data with the explicit EC50 data value is set up as a regression task. The table details the total data amount and the distribution of positive and negative samples for all data sets.
According to the detailed information of the data set, three data acquisition modes are divided into three different modeling tasks and modes, namely multi-classification, two-classification and regression, and training and prediction are respectively carried out.
The three data sets are respectively divided into a Training set (Training set), a verification set (Validation set) and a Test set (Test set) according to a certain proportion. The corresponding model is first trained using a Training set (Training set) and a Validation set (Validation set), and then evaluated using a Test set (Test set). In this process, it is guaranteed that no data leakage occurs, which may make the test result high. Specifically, the following equation is ensured to hold:
Figure BDA0002510735230000081
Figure BDA0002510735230000082
where Φ represents an empty set.
To ensure that both are true, the entire data set is preprocessed at step 120. In an embodiment of the invention, the preprocessing of the entire data set includes a normalization process and a de-duplication process.
The data processing flow will be described in detail below based on the above.
First, all compound molecules in the data set are represented by SMILES for subsequent analysis. The molecules of each data set are expressed by specific linear SMILES, firstly, according to the thought of a graph theory, by means of an open source chemical information tool RDkit and an open source data processing tool KNIME, the SMILES expressions of all molecules are standardized, and the encoding modes and the sequences of atoms, bonds and connection relations in the molecular SMILES expressions are unified. This operation is to ensure that all molecules are used in a uniform representation. Next, the de-duplication process will be performed using the molecular InChIKey. In order to remove redundancy on one hand and ensure that data of the verification set and the test set is data which never appears in the training set so as to improve the generalization capability of the model and the reliability of the result, in order to improve the accuracy of the operation, the SMILES expression is converted into each molecule-specific lnchikey (an lnchi compressed hash version consisting of 27 characters and commonly used for internet and database search/index), the operation can be directly realized by comparing the lnchikeys, and SMILES corresponding to the completely consistent lnchikeys can be directly removed. And finally, randomly dividing each sample data set into a training set, a verification set and a test set according to a certain proportion, and deeply coding. The data set division adopts a random division mode, and the division ratio is a training set: and (4) verification set: the test set is 7:1:2, and the previous division can be reproduced by controlling the random seed, so that the data processing process can be reproduced.
Finally, positive and negative examples in the dataset to be predicted are set to specific label values (or consecutive EC50 values). To this end, the problem translates into: in the regression problem, the model should give a direct floating point reference in the interval [0, 1] for different compound molecules, whether they have a certain biological activity. The predicted EC50 value for a particular biological activity is for different compound molecules, the input for the problem is the SMILES sequence expression for the different molecules, and the output is the predicted EC50 value.
Next, at step 130, the pre-processed data set is digitally encoded. In the embodiment of the invention, the SMILES sequence input into the two-way long-short memory core fragment recognition model needs to be subjected to digital coding, the invention uses improved One-Hot Encoding (One-Hot Encoding), and the Encoding type needs a dictionary as an index. And coding the SMILES sequence by improving the original unique thermal state coding mode. In the original one-hot coding mode, a vocabulary table needs to be constructed first, and single characters of the SMILES are generally extracted directly according to the analysis of the SMILES. The invention originally creates a novel word segmentation method, comprehensively considers chemical and informatics knowledge, and considers a single element (such as C, c and the like), a single number (such as 1, 2 and the like), a single symbol (such as (and) and the like) and a whole square bracket (such as (nH) and the like) as a sequence token (token). Each token has chemical meaning and directivity, and any combination of tokens conforms to chemical rules, so that the authenticity and the reliability of the subsequent exploration of expression composition rules can be guaranteed.
Finally, after statistics are carried out on the whole data set, more than 70 kinds of SMILES are formed, so that the tokens are used as basic composition vocabularies. To facilitate the entry of the later model and the use of the test set, "< GO >" characters are also used as pre-filled words and "< EOS >" characters are used as post-filled words, for a total of 76 words. For each token in SMILES, a positive integer value is assigned according to the word list, and then the value is converted into an index value in a 76-dimensional vector, a number "1" is set at the state index position, and a number "0" is set at the other state index positions of the vector. The entire vector is converted to an L × 76 matrix, where L is the padded SMILES length. The complete digital encoding is shown in table 2 below.
Figure BDA0002510735230000091
Figure BDA0002510735230000101
Figure BDA0002510735230000111
Figure BDA0002510735230000121
TABLE 2 Token and its corresponding numerical code
After the Token of the SMILES is encoded, the unique hot encoding converts the positive integer sequence corresponding to each Token into a vector, the dimension d of the vector is the size of the dictionary, in this embodiment, d is 80, and corresponds to the position 0 to the position 79, and then sets the value "1" at the corresponding position and sets the value "0" at other positions according to the positive integer encoding of Token. Therefore, if the length of a SMILES sequence is set to L0After equal length padding, it becomes a sequence of length L, and after unique hot encoding, it finally becomes a matrix of L × 80, in this experiment, the length of the padded equal length is set to L13.
Fig. 3 shows an example of a one-hot encoding according to the present invention, assuming in this example the equal length L of the padding is 13 for convenience of illustration.
In general, after being thermally encoded, the SMILES sequence is converted into a Word embedding (Word embedding) matrix S.
s=(w1,w2,…,wL)T(3)
Where each w is a d-dimensional row vector corresponding to a unique heat vector, and the dimension of the word embedding matrix S is then lxd.
Next, in step 140, a two-way long-short memory core fragment recognition model is constructed.
In the embodiment of the invention, aiming at different data set compositions, a regression prediction model can be established for the same target according to different activity indexes, and a multi-target prediction model can also be established for all target data.
The DUD-E data set contains activity data of a plurality of targets, and one molecule has different activities corresponding to the targets, so that a multi-target prediction model is constructed. And (3) representing the activity conditions of the same molecule to different targets in parallel by a one-hot code (one-dimensional one-hot code) mode. In total, 10 prediction tasks are established, and according to the activity indexes of the molecules to the prediction tasks, specific tag values are set at the state index positions of corresponding vectors, for example: the active "Positive" is "1" and the inactive "Negative" is "0". Similarly, the HIV data set is also directly tagged with the activity index of the sample molecule.
And the data for the anti-plasmodium falciparum compound with half maximal effect concentration EC50 as the label value was used to construct a regression prediction model. And (4) exploring the structure-activity relationship of compound molecules by taking EC50 as a regression value.
In order to obtain some correlations among tokens in each SMILES sequence, a word embedding matrix S is input into a bidirectional long and short memory core segment recognition network, and a hidden state in propagation is obtained through the regulation and transformation of a series of gating elements. First, the current input xtAnd h passed by the last statet-1Four states z and z are obtained through different weight training calculationsi、zfAnd zo. Where z is converted into a value between-1 and 1 by a tanh activation function, and zi、zfAnd zoThe activation function is converted to a value between 0 and 1 as a gated state.
z=tanh(W·[xt,ht-1]) (4)
zi=σ(Wi·[xt,ht-1]) (5)
zf=σ(Wf·[xt,ht-1]) (6)
zo=σ(Wo·[xt,ht-1]) (7)
Where σ is the relu activation function and W is the network weight.
Then, input coming from the previous node is selectively forgotten through zf, and the input is selectively forgotten through ziSelective memorization of c thereintDifferent from h in RNNtWill change less and slowly with different nodesIs passed on. Finally by zoThe resulting hidden state is selectively output.
ct=zf·ct-1+zi·z (8)
ht=zo·tanh(ct) (9)
The bidirectional long and short term memory core segment recognition network constructed by the method not only uses a bidirectional architecture in the traditional long and short term memory network, but also creates a core segment recognition unit originally, so that a model result is greatly improved. The model first realizes information acquisition in two different directions by constructing two recurrent neural networks, and the two layers are connected with the same input layer. This structure can provide complete context information for each unit structure in the previous layer. One layer of information is transmitted forward at the same time step, and the information of all hidden layers is updated. The propagation of the information of the other layer is opposite to that of the previous layer, the hidden layer values in different directions are obtained by calculating the output layer firstly, and finally the coded hidden state vectors in different directions are spliced into a matrix. Since it is a bidirectional transfer, a Hidden state (Hidden state) in both directions is finally obtained.
Figure BDA0002510735230000141
Figure BDA0002510735230000142
Where t represents the time of day.
The next step is to
Figure BDA0002510735230000143
And
Figure BDA0002510735230000144
spliced to form a hidden state h at the moment ttI.e. by
Figure BDA0002510735230000145
If we set the Hidden unit (Hidden unit) in each direction of LSTM) Is u, then htThe dimension of (a) is 1 × 2u, and all the time instants are spliced together to obtain a hidden state matrix H.
H=(h1,h2,…,hL)T(12)
Where the dimension of H is L × 2 u.
In addition, the core identification fragment unit originally created in the model disclosed by the embodiment of the invention can enable the model to focus on different partial areas of the hidden state matrix, and the principle is that different weight values are given to the different partial areas, and the formula is as follows:
C=softmax(Wbtanh(WaHT)) (13)
SubCore=C·H (14)
wherein WaAnd WbThe models are trainable matrixes, the dimensions are trainable model hyper-parameters, and the matrix core after formula calculation represents that the models focus on a plurality of specific areas in the SMILES sequence. And finally, combining the C and the previous hidden state matrix H to obtain a final core segment SubCore vector value.
And according to the final optimization, the last layer adopts a full connection layer processing mode. The vector is followed by a Linear layer (Linear) that converts the hidden state matrix into an output of set dimensions. The formula is as follows:
Or=Linear(H) (15)
of primary interest in the model disclosed in this invention are several hyper-parameters as in table 3 below, other parameters may refer to the actual code.
Figure BDA0002510735230000151
TABLE 3 Superparameter of the model Primary tuning
At step 150, the model is evaluated. Because the invention is a regression task, the index used for evaluation is Mean Square Error (MSE), and the deviation of the predicted value and the true value of the regression model is measured. Mean Square Error (MSE) is the most common regression loss function, and the calculation method is to calculate the sum of squares of the distances between the predicted values and the true values, and the formula is as follows:
Figure BDA0002510735230000152
wherein y is the true label of the specimen, and
Figure BDA0002510735230000153
is the result predicted by the model, and n is the total number of samples.
The invention provides a method for analyzing and predicting molecular biological activity by using a molecular SMILES expression based on a bidirectional long-short term memory core fragment recognition technology. The invention uses the deep learning model, and can effectively extract the characteristics of the input information, including a plurality of undiscovered hidden characteristic rules. The invention has wider applicability for any activity which all compound molecules can predict. Compared with the conventional similar SAR analysis or activity prediction model, the time used for prediction is greatly reduced, the result is more accurate, and a user can obtain the prediction result more quickly. In addition, the method can also feed back the core substructure fragments quickly, and provides a certain chemical guiding significance.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various combinations, modifications, and changes can be made thereto without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims (10)

1. A method for predicting the activity of a small drug molecule based on a two-way long-short memory model comprises the following steps:
acquiring a data set;
preprocessing the data set, including representing all compound molecules in the data set by SMILES, standardizing SMILES expressions of all the molecules, unifying encoding modes and sequences of atoms, bonds and connection relations in the SMILES expressions of the molecules, and performing de-duplication processing by using InChIKey of the molecules;
encoding the preprocessed data set, wherein the individual elements, individual numbers, individual symbols and the whole square brackets of the SMILES sequence are treated as a sequence token by one-hot encoding, each token has chemical meaning and directionality, and any combination of tokens conforms to chemical rules;
constructing a bidirectional long-short memory core segment recognition model;
inputting the coded data into the bidirectional long-short memory core segment recognition model to obtain a hidden state moment; and
and evaluating the two-way long-short memory core segment recognition model.
2. The method for predicting the activity of a small molecule of a drug based on a two-way long-short memory model according to claim 1, wherein the data set comprises three open-source data sets.
3. The method of claim 1, wherein the de-duplication process using the molecular InChIKey includes converting the SMILES expression into each molecular InChIKey, directly removing the completely identical SMILES corresponding to the InChIKey by comparing the InChIKey,
the preprocessing of the data sets further comprises the step of randomly dividing each data set into a training set, a verification set and a test set according to a certain proportion.
4. The method for predicting the activity of a small drug molecule based on the two-way long-short memory model as claimed in claim 1, further comprising converting the sequence of positive integers corresponding to each token into a vector, and converting the sequence of SMILES into a word-embedding matrix S:
S=(w1,w2,...,wL)T
where each w is a d-dimensional row vector.
5. The method for predicting the activity of small molecules of drugs based on the two-way long-short memory model as claimed in claim 4, wherein the word embedding matrix S is inputted into the two-way long-short memory core segment recognition model according to the current input xtAnd h passed by the last statet-1Four states z and z are obtained through different weight training calculationsi、zfAnd zo
Where z is converted into a value between-1 and 1 by a tanh activation function, and zi、zfAnd zoThe transition to a value between 0 and 1, by the activation function, is taken as a gated state,
z=tanh(W·[xt,ht-1])
zi=σ(Wi·[xt,ht-1])
zf=σ(Wf·[xt,ht-1])
zo=σ(Wo·[xt,ht-1])
where σ is the relu activation function, W is the network weight,
then through zfSelective forgetting of input from previous node, via ziSelective memory so that here the hidden vector weights ctDifferent from h in RNNtWill change less with different nodes, will slowly pass on, and finally pass through zoSelectively outputting the resulting hidden state ht
ct=zf·ct-1+zi·z
ht=zo·tanh(ct)。
6. The method for predicting the activity of small drug molecules based on the two-way long-short memory model as claimed in claim 5, wherein the two-way long-short memory core segment recognition model comprises two recurrent neural networks for obtaining information in two different directions, and the two layers are connected with the same input layer, wherein one layer of information is transmitted forward in the same time step to update the information of all hidden layers, the other layer of information is transmitted in the opposite direction to the previous layer, the hidden state vectors in different directions after being coded are spliced into a matrix by calculating the output layer and then obtaining the hidden layer values in different directions.
7. The method for predicting the activity of small drug molecules based on the two-way long-short memory model as claimed in claim 6, wherein the hidden states h in two directionstIs composed of
Figure FDA0002510735220000031
Figure FDA0002510735220000032
WhereintWhich is indicative of the time of day,
will be provided with
Figure FDA0002510735220000033
And
Figure FDA0002510735220000034
spliced to form a hidden state h at the moment ttI.e. by
Figure FDA0002510735220000035
If the number of hidden units in each direction of the LSTM is set to u, then htIs 1 × 2u, and then all the time instants are spliced to obtain a hidden state matrix H
H=(h1,h2,...,hL)T
Where the dimension of H is L × 2 u.
8. The method for predicting the activity of small molecules of a drug based on two-way long-short memory model of claim 7, wherein the method comprises
The core identification fragment unit originally created in the model can enable the model to pay attention to different partial areas of the hidden state matrix, the principle is that different weight values C are given to the different partial areas, and the formula is as follows:
C=softmax(Wbtanh(WaHT))
SubCore=C·H
wherein WaAnd WbThe core segment vector values are trainable matrixes, dimensions are trainable model hyper-parameters, a matrix core after formula calculation represents that a model focuses on a plurality of specific regions in an SMILES sequence, and finally the weight C and the previous hidden state matrix H are combined to obtain the final core segment SubCore vector value.
9. A system for predicting the activity of small drug molecules based on a two-way long-short memory model comprises:
a data preprocessing unit;
a data encoding unit;
a bidirectional long and short memory core segment identification unit; and
the classification regression device is used for classifying the regression,
wherein the system is configured to perform the method of any one of claims 1 to 8.
10. The system for predicting the activity of small molecules of a drug based on a two-way long-short memory model as claimed in claim 9, wherein the encoded training set and validation set data are loaded to the two-way long-short memory core segment recognition unit, and the two-way long-short memory core segment recognition unit is trained and validated on a large scale.
CN202010464590.9A 2020-05-27 2020-05-27 Method and system for predicting activity of drug micromolecules based on two-way long-short memory model Pending CN111640471A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010464590.9A CN111640471A (en) 2020-05-27 2020-05-27 Method and system for predicting activity of drug micromolecules based on two-way long-short memory model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010464590.9A CN111640471A (en) 2020-05-27 2020-05-27 Method and system for predicting activity of drug micromolecules based on two-way long-short memory model

Publications (1)

Publication Number Publication Date
CN111640471A true CN111640471A (en) 2020-09-08

Family

ID=72329534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010464590.9A Pending CN111640471A (en) 2020-05-27 2020-05-27 Method and system for predicting activity of drug micromolecules based on two-way long-short memory model

Country Status (1)

Country Link
CN (1) CN111640471A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164427A (en) * 2020-09-23 2021-01-01 常州微亿智造科技有限公司 Method and device for predicting activity of small drug molecule target based on deep learning
CN112164426A (en) * 2020-09-22 2021-01-01 常州微亿智造科技有限公司 Drug small molecule target activity prediction method and device based on TextCNN
CN112786120A (en) * 2021-01-26 2021-05-11 云南大学 Method for synthesizing chemical material with assistance of neural network
CN112786108A (en) * 2021-01-21 2021-05-11 北京百度网讯科技有限公司 Molecular understanding model training method, device, equipment and medium
CN114049922A (en) * 2021-11-09 2022-02-15 四川大学 Molecular design method based on small-scale data set and generation model
CN114187978A (en) * 2021-11-24 2022-03-15 中山大学 Compound optimization method based on deep learning connection fragment
WO2023065220A1 (en) * 2021-10-21 2023-04-27 深圳阿尔法分子科技有限责任公司 Chemical molecule related water solubility prediction method based on deep learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491680A (en) * 2018-03-07 2018-09-04 安庆师范大学 Drug relationship abstracting method based on residual error network and attention mechanism
CN109408812A (en) * 2018-09-30 2019-03-01 北京工业大学 A method of the sequence labelling joint based on attention mechanism extracts entity relationship

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491680A (en) * 2018-03-07 2018-09-04 安庆师范大学 Drug relationship abstracting method based on residual error network and attention mechanism
CN109408812A (en) * 2018-09-30 2019-03-01 北京工业大学 A method of the sequence labelling joint based on attention mechanism extracts entity relationship

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHUANGJIA ZHENG ET AL: "Identifying Structure−Property Relationships through SMILES Syntax Analysis with Self-Attention Mechanism" *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164426A (en) * 2020-09-22 2021-01-01 常州微亿智造科技有限公司 Drug small molecule target activity prediction method and device based on TextCNN
CN112164427A (en) * 2020-09-23 2021-01-01 常州微亿智造科技有限公司 Method and device for predicting activity of small drug molecule target based on deep learning
CN112786108A (en) * 2021-01-21 2021-05-11 北京百度网讯科技有限公司 Molecular understanding model training method, device, equipment and medium
CN112786108B (en) * 2021-01-21 2023-10-24 北京百度网讯科技有限公司 Training method, device, equipment and medium of molecular understanding model
CN112786120A (en) * 2021-01-26 2021-05-11 云南大学 Method for synthesizing chemical material with assistance of neural network
CN112786120B (en) * 2021-01-26 2022-07-05 云南大学 Method for synthesizing chemical material with assistance of neural network
WO2023065220A1 (en) * 2021-10-21 2023-04-27 深圳阿尔法分子科技有限责任公司 Chemical molecule related water solubility prediction method based on deep learning
CN114049922A (en) * 2021-11-09 2022-02-15 四川大学 Molecular design method based on small-scale data set and generation model
CN114187978A (en) * 2021-11-24 2022-03-15 中山大学 Compound optimization method based on deep learning connection fragment

Similar Documents

Publication Publication Date Title
CN111640471A (en) Method and system for predicting activity of drug micromolecules based on two-way long-short memory model
Nadif et al. Unsupervised and self-supervised deep learning approaches for biomedical text mining
Bertolazzi et al. Learning to classify species with barcodes
Zhou et al. Time series forecasting and classification models based on recurrent with attention mechanism and generative adversarial networks
Karim et al. Toxicity prediction by multimodal deep learning
Asgari et al. DeepPrime2Sec: deep learning for protein secondary structure prediction from the primary sequences
Lawrence et al. Evolving deep architecture generation with residual connections for image classification using particle swarm optimization
Lu et al. Extracting chemical-protein interactions from biomedical literature via granular attention based recurrent neural networks
Yu et al. Perturbnet predicts single-cell responses to unseen chemical and genetic perturbations
Zhang et al. protein2vec: predicting protein-protein interactions based on LSTM
Zeng et al. Automatic melody harmonization via reinforcement learning by exploring structured representations for melody sequences
Al-Saffar et al. A sequential handwriting recognition model based on a Dynamically configurable CRNN
Rahman et al. IDMIL: an alignment-free Interpretable Deep Multiple Instance Learning (MIL) for predicting disease from whole-metagenomic data
Hu et al. Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series
Leng et al. Bi-level artificial intelligence model for risk classification of acute respiratory diseases based on Chinese clinical data
Stoean et al. Author identification using chaos game representation and deep learning
Fan et al. Distribution structure learning loss (DSLL) based on deep metric learning for image retrieval
Duong et al. Evaluating representations for gene ontology terms
Christou Feature extraction using latent dirichlet allocation and neural networks: a case study on movie synopses
Shen et al. Chinese knowledge base question answering by attention-based multi-granularity model
Tuggener et al. Design patterns for resource-constrained automated deep-learning methods
CN116313148A (en) Drug sensitivity prediction method, device, terminal equipment and medium
Tohti et al. Medical qa oriented multi-task learning model for question intent classification and named entity recognition
Dubois et al. Effective representations of clinical notes
Khosa et al. Unifying Sentence Transformer Embedding and Softmax Voting Ensemble for Accurate News Category Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210318

Address after: Room 202, building 1, 366 Tongyun street, Liangzhu street, Yuhang District, Hangzhou City, Zhejiang Province

Applicant after: Hangzhou derizhi Pharmaceutical Technology Co.,Ltd.

Address before: 11 / F, building 15, Singapore Science Park, Qiantang New District, Hangzhou, Zhejiang 310000

Applicant before: Niu Zhangming

Applicant before: Wade Menpes Smith