CN111640471A

CN111640471A - Method and system for predicting activity of drug micromolecules based on two-way long-short memory model

Info

Publication number: CN111640471A
Application number: CN202010464590.9A
Authority: CN
Inventors: 牛张明; 韦德·门佩斯-史密斯
Original assignee: Wei DeMenpeisi Shimisi
Current assignee: Hangzhou derizhi Pharmaceutical Technology Co.,Ltd.
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-09-08

Abstract

The invention discloses a method for predicting the activity of a small drug molecule based on a two-way long-short memory model, which comprises the following steps: acquiring a data set; preprocessing the data set, including representing all compound molecules in the data set by SMILES, standardizing SMILES expressions of all the molecules, unifying encoding modes and sequences of atoms, bonds and connection relations in the SMILES expressions of the molecules, and performing de-duplication processing by using InChIKey of the molecules; encoding the preprocessed data set, wherein the individual elements, individual numbers, individual symbols and the whole square brackets of the SMILES sequence are treated as a sequence token by one-hot encoding, each token has chemical meaning and directionality, and any combination of tokens conforms to chemical rules; constructing a bidirectional long-short memory core segment recognition model; inputting the coded data into the bidirectional long-short memory core segment recognition model to obtain a hidden state moment; and evaluating the two-way long-short memory core segment recognition model.

Description

Method and system for predicting activity of drug micromolecules based on two-way long-short memory model

Technical Field

The present invention relates to the fields of chemical informatics and bioinformatics. In particular, the invention relates to a method and a system for predicting the activity of a small drug molecule based on a two-way long-short memory model.

Background

Elucidation of the relationship between molecular structure and biological activity has been an important issue in the field of medicinal chemistry. However, as experimental data grows explosively, methods based on empirical measurements and heuristic rules become increasingly difficult to elucidate such relationships.

Chemical informatics is an active area of research in predicting biological activity from molecular structures by means of high-performance computer and machine learning methods. In recent decades, with the advent of deep learning methods, machine learning has received increasing attention from the scientific community. Data-driven analysis has become a routine procedure for many chemical and pharmaceutical applications, including virtual screening, chemical property prediction, and de novo molecular design. In many of these applications, machine learning shows great potential to compete with, and even surpass, conventional approaches.

The merck molecular activity challenge has led to a trend towards training deep learning networks with molecular fingerprints and other descriptors. The winning team uses a multitasking model that contains a large number of pre-computed molecular descriptors, which improves performance by 15% over the random forest baseline. By using the same training strategy, Andreas and colleagues presented the Tox21 challenge results with the most accurate toxicity predictions. Although many studies have shown that a large-scale multitask network trained using a large number of molecular descriptors can significantly improve the predictive ability of traditional models for virtual screening and attribute prediction, its inherent black-box property has been heavily criticized by the modeling community. Such models make the relationships between attributes and structures more difficult to interpret.

Therefore, learning molecular properties of compounds directly from the topology of the molecule, rather than defining fingerprints or descriptors in advance, has attracted increasing interest in both the chemical and machine learning fields. Duvenaud and coworkers showed Neurofingerprints (NFPs) that attempted to extract data-driven features from molecules rather than hand-made features. The architecture is based on generalization of fingerprints so that it can be learned by back-propagation algorithms. Later, Kearnes and colleagues proposed molecular graph convolution using undirected graphs to represent small molecules. Later, researchers have proposed several improved convolutional graph networks (GCNs) for dynamic extraction of molecular features

And predicting the target characteristic. Despite the rather high predictive performance, these inherent deficiencies of GCN, such as limited information dissemination over the entire graph and non-intuitive feature extraction, indicate that the model still has room for improvement.

In addition to graphical representations, researchers have focused more on molecular linear representations as generative models have grown in popularity. Many unsupervised learning techniques with different generative models are used for new molecular design. Most of them use SMILES (Simplified molecular input line entry specification) as an input to generate new molecules with specific properties. Furthermore, Vidal and his colleagues suggested that a simple SMILES string fragment could be used directly to calculate molecular similarity and predict lipid-water partition coefficients. These studies demonstrate that molecular linear representation can be used directly in SAR studies. It is easier to input structural linear symbols into a sequence-based network than CT-based methods. However, there is currently no study of directly inputting SMILES into a sequence-based deep learning model for biological activity prediction.

Disclosure of Invention

In order to solve the problems, the invention adopts a bidirectional long-short term memory (BilSTM) model by using a sequence learning method in NLP for reference so as to obtain convenient modeling and considerable prediction performance. The accuracy and the application range of the prediction by using the method are greatly improved. The method is based on the deep learning model, can effectively extract the characteristics of the input information, including a plurality of undiscovered characteristic rules, and provides a more accurate prediction result.

According to one aspect of the invention, a method for predicting activity based on the semantic analysis of molecular SMILES expression recognized by a two-way long-short memory core fragment is provided, which comprises the following steps:

acquiring a data set;

preprocessing the data set, including representing all compound molecules in the data set by SMILES, standardizing SMILES expressions of all the molecules, unifying encoding modes and sequences of atoms, bonds and connection relations in the SMILES expressions of the molecules, and performing de-duplication processing by using InChIKey of the molecules;

encoding the preprocessed data set, wherein the individual elements, individual numbers, individual symbols and the whole square brackets of the SMILES sequence are treated as a sequence token by one-hot encoding, each token has chemical meaning and directionality, and any combination of tokens conforms to chemical rules;

constructing a bidirectional long-short memory core segment recognition model;

inputting the coded data into the bidirectional long-short memory core segment recognition model to obtain a hidden state moment; and

and evaluating the two-way long-short memory core segment recognition model.

In one embodiment of the invention, the data set comprises three open-source data sets.

In one embodiment of the invention, the de-duplication process using molecular InChIKey includes converting the SMILES expression into each molecular unique InChIKey, directly removing the SMILES corresponding to the completely identical InChIKey by comparing the InChIKey,

the preprocessing of the data sets further comprises the step of randomly dividing each data set into a training set, a verification set and a test set according to a certain proportion.

In an embodiment of the present invention, the method for predicting activity based on the semantic analysis of the molecular SMILES expression identified by the two-way long-short memory core fragment further includes converting the positive integer sequence corresponding to each token into a vector, and converting the SMILES sequence into a word embedding matrix S:

S＝(w₁，w₂，...，w_L)^T

where each w is a d-dimensional row vector.

In one embodiment of the invention, the word embedding matrix S is input into the two-way long-short memory core segment recognition model from the current input x^tAnd h passed by the last state^t-1Four states z and z are obtained through different weight training calculationsⁱ、z^fAnd z^o，

Where z is converted into a value between-1 and 1 by a tanh activation function, and zⁱ、z^fAnd z^oThe activation function is converted to a value between 0 and 1 as a gated state.

z＝tanh(W·[x^t，h^t-1])

zⁱ＝σ(Wⁱ·[x^t，h^t-1])

z^f＝σ(W^f·[x^t，h^t-1])

z^o＝σ(W^o·[x^t,h^t-1])

Where σ is the relu activation function, W is the network weight,

then through z^fSelective forgetting of input from previous node, via zⁱSelective memorization of c therein^tDifferent from h in RNN^tWill change less with different nodes, will slowly pass on, and finally pass through z^oSelectively outputting the resulting hidden state h^t

c^t＝z^f·c^t-1+zⁱ·z

h^t＝z^o·tanh(c^t)。

In an embodiment of the present invention, the bidirectional long and short memory core segment recognition model includes two recurrent neural networks to acquire information in two different directions, and both the two layers are connected to the same input layer, wherein one layer of information is transmitted forward at the same time step to update information of all hidden layers, the information of the other layer is transmitted in a direction opposite to that of the previous layer, hidden state vectors in different directions after being encoded are finally spliced into a matrix by calculating the output layer and then obtaining hidden layer values in different directions.

In one embodiment of the invention, the hidden state h in both directions^tIs composed of

Wherein t represents the time of day, where,

will be provided with

And

spliced to form a hidden state h at the moment t_tI.e. by

If the number of hidden units in each direction of the LSTM is set to u, then h_tIs 1 × 2u, and then all the time instants are spliced to obtain a hidden state matrix H

H＝(h₁，h₂，…，h_L)^T

Where the dimension of H is L × 2 u.

In an embodiment of the present invention, the core identification fragment unit originally created in the model may make the model focus on different partial areas of the hidden state matrix, and the principle is to assign different weight values to the different partial areas, and the formula is as follows:

C＝softmax(W_btanh(W_aH^T))

SubCore＝C·H

wherein W_aAnd W_bThe core segment vector values are trainable matrixes, dimensions are trainable model hyper-parameters, a matrix core after formula calculation represents that a model focuses on a plurality of specific regions in an SMILES sequence, and finally C and a previous hidden state matrix H are combined to obtain a final core segment SubCore vector value.

According to another embodiment, there is provided a system for predicting activity based on semantic analysis of molecular SMILES expressions for long-short memory core fragment recognition, comprising:

a data preprocessing unit;

a data encoding unit;

a bidirectional long and short memory core segment identification unit; and

the classification regression device is used for classifying the regression,

wherein the system is adapted to perform the above method.

In another embodiment of the invention, the encoded training set and validation set data are loaded to the bidirectional long and short memory core segment identification unit, and the bidirectional long and short memory core segment identification unit is subjected to large-scale training and validation.

Drawings

To further clarify the above and other advantages and features of embodiments of the present invention, a more particular description of embodiments of the invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. In the drawings, the same or corresponding parts will be denoted by the same or similar reference numerals for clarity.

Fig. 1 illustrates a system for predicting activity based on molecular SMILES expression semantic analysis of two-way long-short memory core fragment recognition according to one embodiment of the present invention.

Fig. 2 shows a flowchart of a method for predicting activity based on molecular SMILES expression semantic analysis of two-way long-short memory core fragment recognition according to an embodiment of the present invention.

Fig. 3 shows an example of one-hot encoding according to the present invention.

Detailed Description

In the following description, the invention is described with reference to various embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details, or with other alternative and/or additional methods, materials, or components. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of embodiments of the invention. Similarly, for purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the embodiments of the invention. However, the invention may be practiced without specific details. Further, it should be understood that the embodiments shown in the figures are illustrative representations and are not necessarily drawn to scale.

Reference in the specification to "one embodiment" or "the embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

The invention adopts a bidirectional long-short term memory (BilSTM) model, and uses a sequence learning method in NLP for reference to obtain convenient modeling and considerable prediction performance. The accuracy and the application range of the prediction by using the method are greatly improved. The method is based on the deep learning model, can effectively extract the characteristics of the input information, including a plurality of undiscovered characteristic rules, and provides a more accurate prediction result.

The invention provides a method for analyzing and predicting molecular bioactivity tendency based on a molecular SMILES expression of a bidirectional long-short memory core fragment recognition technology, which comprises the following steps: original preprocessing of three different public datasets. DUD-E active compound and bait compound data, HIV inhibitor data, anti-plasmodium falciparum compound data were used as model raw data sets, and activity indices (e.g., half maximal effect concentration EC50) were used as label values. The molecule of each data set is represented by SMILES, standardization and repeated treatment are carried out, a self-created word segmentation processing technology is adopted for the data SMILES, a corresponding vocabulary is constructed, then each sample data set is randomly divided into a training set, a verification set and a test set according to a certain proportion, the SMILES is converted into a vector form by means of word vector embedding in virtue of the vocabulary, and the vector form is input into a bidirectional long and short memory core segment recognition network for training. And finally, loading a deep learning model established based on the bidirectional long and short memory core segment recognition technology, verifying the deep learning model, and comparing the deep learning model with other baseline models through different evaluation indexes, so that a more accurate activity prediction result can be provided, and a practical and effective new analysis method is provided for the structure-activity relationship research.

Fig. 1 illustrates a system for predicting activity based on molecular SMILES expression semantic analysis of two-way long-short memory core fragment recognition according to one embodiment of the present invention. The system comprises a data preprocessing unit 101, a data encoding unit 102, a bidirectional long-short memory core segment recognition 103 and a classification regressor 104. The specific functions of the data preprocessing unit 101, the data encoding unit 102, the bidirectional long-short memory core fragment recognition 103, and the classification regressor 104 will be described below in conjunction with a method of semantic analysis prediction activity of a molecular SMILES expression based on bidirectional long-short memory core fragment recognition.

First, at step 110, a data set is acquired.

In embodiments of the invention, the data set may comprise three open source data sets, including the DUD-E sample data set disclosed in the documents Directory of usefull records, enhanced (DUD-E): beta ligands and records for beta bearing marking.J.Med.chem.2012,55,6582-6594.Doi:10.1021/jm 300687E; drugefficcy activity data set disclosed in the literature of chemical starting points for anti-mental medical introduction, Nature 2010,465,305-310.Doi:10.1038/nature09107, and HIV activity data set derived from the AIDS antibody screening Drug Treatment Program (DTP) AIDS anti viral Screen. The data set details are shown in table 1 below:

TABLE 1 basic information of the three public data sets

The above embodiments give three examples of data sets, and it should be clear to those skilled in the art that other data sets can be obtained by the present invention.

Generally, a compound is a positive sample of a biological activity prediction task, as long as the corresponding biological activity of the compound is reported in the literature. Multiple different biological activities in the same data set, DUD-E, require the establishment of multiple predictive tasks, belonging to a multi-classification task. While the Drug efficacy data with the explicit EC50 data value is set up as a regression task. The table details the total data amount and the distribution of positive and negative samples for all data sets.

According to the detailed information of the data set, three data acquisition modes are divided into three different modeling tasks and modes, namely multi-classification, two-classification and regression, and training and prediction are respectively carried out.

The three data sets are respectively divided into a Training set (Training set), a verification set (Validation set) and a Test set (Test set) according to a certain proportion. The corresponding model is first trained using a Training set (Training set) and a Validation set (Validation set), and then evaluated using a Test set (Test set). In this process, it is guaranteed that no data leakage occurs, which may make the test result high. Specifically, the following equation is ensured to hold:

where Φ represents an empty set.

To ensure that both are true, the entire data set is preprocessed at step 120. In an embodiment of the invention, the preprocessing of the entire data set includes a normalization process and a de-duplication process.

The data processing flow will be described in detail below based on the above.

First, all compound molecules in the data set are represented by SMILES for subsequent analysis. The molecules of each data set are expressed by specific linear SMILES, firstly, according to the thought of a graph theory, by means of an open source chemical information tool RDkit and an open source data processing tool KNIME, the SMILES expressions of all molecules are standardized, and the encoding modes and the sequences of atoms, bonds and connection relations in the molecular SMILES expressions are unified. This operation is to ensure that all molecules are used in a uniform representation. Next, the de-duplication process will be performed using the molecular InChIKey. In order to remove redundancy on one hand and ensure that data of the verification set and the test set is data which never appears in the training set so as to improve the generalization capability of the model and the reliability of the result, in order to improve the accuracy of the operation, the SMILES expression is converted into each molecule-specific lnchikey (an lnchi compressed hash version consisting of 27 characters and commonly used for internet and database search/index), the operation can be directly realized by comparing the lnchikeys, and SMILES corresponding to the completely consistent lnchikeys can be directly removed. And finally, randomly dividing each sample data set into a training set, a verification set and a test set according to a certain proportion, and deeply coding. The data set division adopts a random division mode, and the division ratio is a training set: and (4) verification set: the test set is 7:1:2, and the previous division can be reproduced by controlling the random seed, so that the data processing process can be reproduced.

Finally, positive and negative examples in the dataset to be predicted are set to specific label values (or consecutive EC50 values). To this end, the problem translates into: in the regression problem, the model should give a direct floating point reference in the interval [0, 1] for different compound molecules, whether they have a certain biological activity. The predicted EC50 value for a particular biological activity is for different compound molecules, the input for the problem is the SMILES sequence expression for the different molecules, and the output is the predicted EC50 value.

Next, at step 130, the pre-processed data set is digitally encoded. In the embodiment of the invention, the SMILES sequence input into the two-way long-short memory core fragment recognition model needs to be subjected to digital coding, the invention uses improved One-Hot Encoding (One-Hot Encoding), and the Encoding type needs a dictionary as an index. And coding the SMILES sequence by improving the original unique thermal state coding mode. In the original one-hot coding mode, a vocabulary table needs to be constructed first, and single characters of the SMILES are generally extracted directly according to the analysis of the SMILES. The invention originally creates a novel word segmentation method, comprehensively considers chemical and informatics knowledge, and considers a single element (such as C, c and the like), a single number (such as 1, 2 and the like), a single symbol (such as (and) and the like) and a whole square bracket (such as (nH) and the like) as a sequence token (token). Each token has chemical meaning and directivity, and any combination of tokens conforms to chemical rules, so that the authenticity and the reliability of the subsequent exploration of expression composition rules can be guaranteed.

Finally, after statistics are carried out on the whole data set, more than 70 kinds of SMILES are formed, so that the tokens are used as basic composition vocabularies. To facilitate the entry of the later model and the use of the test set, "< GO >" characters are also used as pre-filled words and "< EOS >" characters are used as post-filled words, for a total of 76 words. For each token in SMILES, a positive integer value is assigned according to the word list, and then the value is converted into an index value in a 76-dimensional vector, a number "1" is set at the state index position, and a number "0" is set at the other state index positions of the vector. The entire vector is converted to an L × 76 matrix, where L is the padded SMILES length. The complete digital encoding is shown in table 2 below.

TABLE 2 Token and its corresponding numerical code

After the Token of the SMILES is encoded, the unique hot encoding converts the positive integer sequence corresponding to each Token into a vector, the dimension d of the vector is the size of the dictionary, in this embodiment, d is 80, and corresponds to the position 0 to the position 79, and then sets the value "1" at the corresponding position and sets the value "0" at other positions according to the positive integer encoding of Token. Therefore, if the length of a SMILES sequence is set to L₀After equal length padding, it becomes a sequence of length L, and after unique hot encoding, it finally becomes a matrix of L × 80, in this experiment, the length of the padded equal length is set to L13.

Fig. 3 shows an example of a one-hot encoding according to the present invention, assuming in this example the equal length L of the padding is 13 for convenience of illustration.

In general, after being thermally encoded, the SMILES sequence is converted into a Word embedding (Word embedding) matrix S.

s＝(w₁，w₂，…，w_L)^T(3)

Where each w is a d-dimensional row vector corresponding to a unique heat vector, and the dimension of the word embedding matrix S is then lxd.

Next, in step 140, a two-way long-short memory core fragment recognition model is constructed.

In the embodiment of the invention, aiming at different data set compositions, a regression prediction model can be established for the same target according to different activity indexes, and a multi-target prediction model can also be established for all target data.

The DUD-E data set contains activity data of a plurality of targets, and one molecule has different activities corresponding to the targets, so that a multi-target prediction model is constructed. And (3) representing the activity conditions of the same molecule to different targets in parallel by a one-hot code (one-dimensional one-hot code) mode. In total, 10 prediction tasks are established, and according to the activity indexes of the molecules to the prediction tasks, specific tag values are set at the state index positions of corresponding vectors, for example: the active "Positive" is "1" and the inactive "Negative" is "0". Similarly, the HIV data set is also directly tagged with the activity index of the sample molecule.

And the data for the anti-plasmodium falciparum compound with half maximal effect concentration EC50 as the label value was used to construct a regression prediction model. And (4) exploring the structure-activity relationship of compound molecules by taking EC50 as a regression value.

In order to obtain some correlations among tokens in each SMILES sequence, a word embedding matrix S is input into a bidirectional long and short memory core segment recognition network, and a hidden state in propagation is obtained through the regulation and transformation of a series of gating elements. First, the current input x^tAnd h passed by the last state^t-1Four states z and z are obtained through different weight training calculationsⁱ、z^fAnd z^o. Where z is converted into a value between-1 and 1 by a tanh activation function, and zⁱ、z^fAnd z^oThe activation function is converted to a value between 0 and 1 as a gated state.

z＝tanh(W·[x^t，h^t-1]) (4)

zⁱ＝σ(Wⁱ·[x^t，h^t-1]) (5)

z^f＝σ(W^f·[x^t，h^t-1]) (6)

z^o＝σ(W^o·[x^t，h^t-1]) (7)

Where σ is the relu activation function and W is the network weight.

Then, input coming from the previous node is selectively forgotten through zf, and the input is selectively forgotten through zⁱSelective memorization of c therein^tDifferent from h in RNN^tWill change less and slowly with different nodesIs passed on. Finally by z^oThe resulting hidden state is selectively output.

c^t＝z^f·c^t-1+zⁱ·z (8)

h^t＝z^o·tanh(c^t) (9)

The bidirectional long and short term memory core segment recognition network constructed by the method not only uses a bidirectional architecture in the traditional long and short term memory network, but also creates a core segment recognition unit originally, so that a model result is greatly improved. The model first realizes information acquisition in two different directions by constructing two recurrent neural networks, and the two layers are connected with the same input layer. This structure can provide complete context information for each unit structure in the previous layer. One layer of information is transmitted forward at the same time step, and the information of all hidden layers is updated. The propagation of the information of the other layer is opposite to that of the previous layer, the hidden layer values in different directions are obtained by calculating the output layer firstly, and finally the coded hidden state vectors in different directions are spliced into a matrix. Since it is a bidirectional transfer, a Hidden state (Hidden state) in both directions is finally obtained.

Where t represents the time of day.

The next step is to

And

spliced to form a hidden state h at the moment t_tI.e. by

If we set the Hidden unit (Hidden unit) in each direction of LSTM) Is u, then h_tThe dimension of (a) is 1 × 2u, and all the time instants are spliced together to obtain a hidden state matrix H.

H＝(h₁，h₂，…，h_L)^T(12)

Where the dimension of H is L × 2 u.

In addition, the core identification fragment unit originally created in the model disclosed by the embodiment of the invention can enable the model to focus on different partial areas of the hidden state matrix, and the principle is that different weight values are given to the different partial areas, and the formula is as follows:

C＝softmax(W_btanh(W_aH^T)) (13)

SubCore＝C·H (14)

wherein W_aAnd W_bThe models are trainable matrixes, the dimensions are trainable model hyper-parameters, and the matrix core after formula calculation represents that the models focus on a plurality of specific areas in the SMILES sequence. And finally, combining the C and the previous hidden state matrix H to obtain a final core segment SubCore vector value.

And according to the final optimization, the last layer adopts a full connection layer processing mode. The vector is followed by a Linear layer (Linear) that converts the hidden state matrix into an output of set dimensions. The formula is as follows:

O_r＝Linear(H) (15)

of primary interest in the model disclosed in this invention are several hyper-parameters as in table 3 below, other parameters may refer to the actual code.

TABLE 3 Superparameter of the model Primary tuning

At step 150, the model is evaluated. Because the invention is a regression task, the index used for evaluation is Mean Square Error (MSE), and the deviation of the predicted value and the true value of the regression model is measured. Mean Square Error (MSE) is the most common regression loss function, and the calculation method is to calculate the sum of squares of the distances between the predicted values and the true values, and the formula is as follows:

wherein y is the true label of the specimen, and

is the result predicted by the model, and n is the total number of samples.

The invention provides a method for analyzing and predicting molecular biological activity by using a molecular SMILES expression based on a bidirectional long-short term memory core fragment recognition technology. The invention uses the deep learning model, and can effectively extract the characteristics of the input information, including a plurality of undiscovered hidden characteristic rules. The invention has wider applicability for any activity which all compound molecules can predict. Compared with the conventional similar SAR analysis or activity prediction model, the time used for prediction is greatly reduced, the result is more accurate, and a user can obtain the prediction result more quickly. In addition, the method can also feed back the core substructure fragments quickly, and provides a certain chemical guiding significance.

While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various combinations, modifications, and changes can be made thereto without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. A method for predicting the activity of a small drug molecule based on a two-way long-short memory model comprises the following steps:

acquiring a data set;

constructing a bidirectional long-short memory core segment recognition model;

and evaluating the two-way long-short memory core segment recognition model.

2. The method for predicting the activity of a small molecule of a drug based on a two-way long-short memory model according to claim 1, wherein the data set comprises three open-source data sets.

3. The method of claim 1, wherein the de-duplication process using the molecular InChIKey includes converting the SMILES expression into each molecular InChIKey, directly removing the completely identical SMILES corresponding to the InChIKey by comparing the InChIKey,

4. The method for predicting the activity of a small drug molecule based on the two-way long-short memory model as claimed in claim 1, further comprising converting the sequence of positive integers corresponding to each token into a vector, and converting the sequence of SMILES into a word-embedding matrix S:

S＝(w₁，w₂，...，w_L)^T

where each w is a d-dimensional row vector.

5. The method for predicting the activity of small molecules of drugs based on the two-way long-short memory model as claimed in claim 4, wherein the word embedding matrix S is inputted into the two-way long-short memory core segment recognition model according to the current input x^tAnd h passed by the last state^t-1Four states z and z are obtained through different weight training calculationsⁱ、z^fAnd z^o，

Where z is converted into a value between-1 and 1 by a tanh activation function, and zⁱ、z^fAnd z^oThe transition to a value between 0 and 1, by the activation function, is taken as a gated state,

z＝tanh(W·[x^t，h^t-1])

zⁱ＝σ(Wⁱ·[x^t，h^t-1])

z^f＝σ(W^f·[x^t，h^t-1])

z^o＝σ(W^o·[x^t，h^t-1])

where σ is the relu activation function, W is the network weight,

then through z^fSelective forgetting of input from previous node, via zⁱSelective memory so that here the hidden vector weights c^tDifferent from h in RNN^tWill change less with different nodes, will slowly pass on, and finally pass through z^oSelectively outputting the resulting hidden state h^t

c^t＝z^f·c^t-1+zⁱ·z

h^t＝z^o·tanh(c^t)。

6. The method for predicting the activity of small drug molecules based on the two-way long-short memory model as claimed in claim 5, wherein the two-way long-short memory core segment recognition model comprises two recurrent neural networks for obtaining information in two different directions, and the two layers are connected with the same input layer, wherein one layer of information is transmitted forward in the same time step to update the information of all hidden layers, the other layer of information is transmitted in the opposite direction to the previous layer, the hidden state vectors in different directions after being coded are spliced into a matrix by calculating the output layer and then obtaining the hidden layer values in different directions.

7. The method for predicting the activity of small drug molecules based on the two-way long-short memory model as claimed in claim 6, wherein the hidden states h in two directions^tIs composed of

Wherein^tWhich is indicative of the time of day,

will be provided with

And

spliced to form a hidden state h at the moment t_tI.e. by

H＝(h₁，h₂，...，h_L)^T

Where the dimension of H is L × 2 u.

8. The method for predicting the activity of small molecules of a drug based on two-way long-short memory model of claim 7, wherein the method comprises

The core identification fragment unit originally created in the model can enable the model to pay attention to different partial areas of the hidden state matrix, the principle is that different weight values C are given to the different partial areas, and the formula is as follows:

C＝softmax(W_btanh(W_aH^T))

SubCore＝C·H

wherein W_aAnd W_bThe core segment vector values are trainable matrixes, dimensions are trainable model hyper-parameters, a matrix core after formula calculation represents that a model focuses on a plurality of specific regions in an SMILES sequence, and finally the weight C and the previous hidden state matrix H are combined to obtain the final core segment SubCore vector value.

9. A system for predicting the activity of small drug molecules based on a two-way long-short memory model comprises:

a data preprocessing unit;

a data encoding unit;

a bidirectional long and short memory core segment identification unit; and

the classification regression device is used for classifying the regression,

wherein the system is configured to perform the method of any one of claims 1 to 8.

10. The system for predicting the activity of small molecules of a drug based on a two-way long-short memory model as claimed in claim 9, wherein the encoded training set and validation set data are loaded to the two-way long-short memory core segment recognition unit, and the two-way long-short memory core segment recognition unit is trained and validated on a large scale.