CN116612835A

CN116612835A - Training method for compound property prediction model and prediction method for compound property

Info

Publication number: CN116612835A
Application number: CN202310877635.9A
Authority: CN
Inventors: 耿威; 李世博; 徐敏捷; 吕川
Original assignee: Micro Era Hefei Quantum Technology Co ltd
Current assignee: Hefei Micro Era Digital Technology Co ltd
Priority date: 2023-07-18
Filing date: 2023-07-18
Publication date: 2023-08-18
Anticipated expiration: 2043-07-18
Also published as: CN116612835B

Abstract

The invention discloses a training method of a compound property prediction model and a prediction method of compound property, wherein the training method of the compound property prediction model comprises the following steps: acquiring a first data set comprising a plurality of sets of training data, each set of training data comprising a first compound and a property of the first compound; encoding the plurality of first compounds in the training data set according to a preset characteristic database to obtain characteristic representations of the plurality of first compounds in the training data set; and training a preset neural network model by using the encoded training data set to obtain a compound physical property prediction model. Because the first compounds with characteristic representation are input during training of the neural network model, the training task is lightened during training of the neural network model, and the technical problems that a deep model is difficult to train and a large number of super parameters are required to be adjusted are solved.

Description

Training method for compound property prediction model and prediction method for compound property

Technical Field

The invention relates to the technical field of pharmaceutical chemistry, in particular to a training method of a compound property prediction model and a prediction method of compound properties.

Background

The task of predicting molecular characteristics is the core of the application scenario of drug discovery, in particular, after a long and expensive development process is carried out on a small-molecule drug, the designed small-molecule drug may have extremely good targeting property or practicality, but still has the problem that the drug candidates of nearly 7 products cannot pass clinical experiments due to the toxicity or biological activity of the small-molecule drug. If the drug development team can realize mass management of physicochemical properties of small molecular drugs to be designed at the beginning of design, drug development failure can be effectively avoided.

For the prediction of physicochemical properties of small molecule drugs, there are two main methods currently acknowledged: the first method is as follows: molecular modeling is carried out based on basic physics knowledge, and the physicochemical properties of the compound are finally obtained according to the setting and simulation of energy and force fields. The second method is as follows: modeling predictions are made based on existing data. Deep neural networks (DNN, deep Neural Networks) are widely used for modeling in the field of drug discovery, but DNN has many problems in practical applications, the biggest one of which is that deep models are difficult to train and there are a large number of hyper-parameters to adjust.

Disclosure of Invention

Therefore, the embodiment of the invention provides a training method of a compound property prediction model and a compound property prediction method, so as to solve the problems that a deep model of DNN is difficult to train and a large number of super parameters are required to be adjusted.

According to a first aspect, an embodiment of the present invention provides a method for training a compound property prediction model, including the steps of: obtaining a first dataset comprising a plurality of sets of training data, each set of training data comprising a first compound and a property of the first compound; encoding the plurality of first compounds in the training data set according to a preset characteristic database to obtain characteristic representations of the plurality of first compounds in the training data set; training a preset neural network model by using a coded training data set to obtain a compound physical property prediction model, wherein the coded training data set comprises a plurality of groups of training data, and each group of training data comprises a characteristic representation of a first compound and the property of the first compound.

Specifically, before encoding the plurality of first compounds in the training data set according to a preset feature database, the method further includes: acquiring a small molecular drug database as a second data set, wherein the second data set comprises a plurality of molecular SMILES formulas of small molecular drugs serving as second compounds; and extracting the characteristics of the plurality of second compounds to obtain the characteristic database, wherein the characteristic database comprises a plurality of molecular characteristics.

Specifically, the extracting features of the plurality of second compounds to obtain the feature database includes: and pretraining a preset transducer model by using the plurality of second compounds to obtain the characteristic database.

Specifically, the transducer model comprises a first molecular attention converter, a first addition-normalization layer, a second molecular attention converter, a second addition-normalization layer, a feedforward network layer, a third addition-normalization layer, a linear connection layer and a normalization layer which are sequentially connected.

Specifically, the first molecular attention converter and the second attention converter each include an atomic characterization layer that calculates a modified attention representation based on the following attention formula:

；

wherein ,is a super parameter, Q _i The matrix obtained by multiplying the input data based on the transducer model with the corresponding weight matrix is shown as the i th matrix, K is the matrix obtained by multiplying the input data based on the transducer model with the corresponding weight matrix, and V _i For the matrix obtained by multiplying the input data based on the transducer model with the corresponding weight matrix,ρrepresenting a softmax function, D representing an interatomic distance matrix, A representing a domain matrix, a functiongRepresenting the Softmax function,/->Is Q _i And the matrix dimension of K, attention is the output of the Attention representation, i.e., atomic characterization layer.

According to a second aspect, embodiments of the present invention further provide a method for predicting a property of a compound, comprising the steps of: obtaining a structural formula of a compound to be predicted; inputting the structural formula into a compound property prediction model obtained by the training method of the compound property prediction model according to the first aspect or any one of the first aspects, so as to obtain the SMILES formula of the compound to be predicted.

Specifically, the method for predicting the properties of the compound further comprises the following steps: obtaining the predicted property of the compound to be predicted; inputting the structural formula into a compound property prediction model obtained by using the training method of the compound property prediction model according to the first aspect or any one of the first aspects comprises: searching a prediction model corresponding to the predicted property in a compound property prediction model obtained by using the training method of the compound property prediction model of the first aspect or any one of the first aspects, and inputting the structural formula into the searched prediction model.

According to a third aspect, the embodiment of the invention also provides a training device of the compound property prediction model, which comprises a first acquisition module, a processing module and a training module; the first acquisition module is used for acquiring a first data set, wherein the first data set comprises a plurality of groups of training data, and each group of training data comprises a first compound and the property of the first compound; the processing module is used for encoding the plurality of first compounds in the training data set according to a preset characteristic database to obtain characteristic representations of the plurality of first compounds in the training data set; the training module is used for training a preset neural network model by utilizing the characteristic representations of the plurality of first compounds and the properties of the plurality of first compounds in the training data set to obtain a compound property prediction model.

According to a fourth aspect, an embodiment of the present invention further provides a device for predicting a property of a compound, including a second obtaining module and a predicting module; the second acquisition module is used for acquiring the structural formula of the compound to be predicted; the prediction module is configured to input the structural formula into a compound property prediction model obtained by using the training method of the compound property prediction model according to the first aspect or any one of the first aspects, so as to obtain an SMILES formula of the compound to be predicted.

According to a fifth aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory and the processor are communicatively connected to each other, and the memory stores computer instructions, and the processor executes the computer instructions, thereby executing the training method of the compound property prediction model according to the first aspect or any one of the first aspects or the prediction method of the compound property according to the second aspect or any one of the second aspects.

The scheme of the embodiment of the invention has the following beneficial effects:

according to the training method, the device and the electronic equipment for the compound property prediction model, the characteristic representations of the plurality of first compounds in the training data set are obtained by encoding the plurality of first compounds in the training data set according to the preset characteristic database, and then the training data set after encoding can be used for training the preset neural network model. Because the first compounds with characteristic representation are input during training of the neural network model, the training task is lightened during training of the neural network model, and the technical problems that a deep model is difficult to train and a large number of super parameters are required to be adjusted are solved.

Furthermore, the training of the compound physical property prediction model is divided into two parts, namely the training of a characteristic database and the training of a neural network model, and only molecular characteristics are required to be extracted during the training of the characteristic database, and data of the physical and chemical properties of small molecular medicines are not required to be used, so that the small molecular medicine database can be used, and the requirement on the data quantity during the training of the characteristic database is met; meanwhile, when the neural network model is trained, a plurality of first compounds with characteristic representation are input, so that the neural network model has smaller requirement for the data quantity of the second compounds.

Further, a molecular attention converter is used in the transducer model. Therefore, not only the self-attention of the molecular structure is considered, but also the interatomic distance and the neighborhood of the atomic diagram are considered, so that the transducer model can contain global information in a small molecule.

According to the training method, the training device and the electronic equipment for the compound property prediction model, provided by the embodiment of the invention, the property of the compound to be predicted can be obtained only by inputting the structural formula of the compound to be predicted, so that a drug development team can know the physicochemical property of small-molecule drugs to be designed in a large scale at the beginning of design, and drug development failure is effectively avoided.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the invention in any way, in which:

FIG. 1 is a schematic flow chart of a method for training a compound property prediction model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network architecture of a transducer model employing a molecular attention converter;

FIG. 3 is a schematic flow chart of a method for predicting properties of a compound according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a device for training a model for predicting the properties of a compound according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a device for predicting properties of a compound according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an example of an electronic device.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

The embodiment of the invention provides a training method of a compound property prediction model. Fig. 1 is a flow chart of a training method of a compound property prediction model according to an embodiment of the present invention, as shown in fig. 1, the training method of a compound property prediction model according to an embodiment of the present invention includes the following steps:

s101: a first dataset is acquired, the first dataset comprising a plurality of sets of training data, each set of training data comprising a first compound and a property of the first compound.

In the embodiments of the present invention, the property of the first compound refers to the physicochemical property of the first compound, such as toxicity data, water solubility data, side effects, adverse drug reaction data, and the like.

Wherein the first dataset may be selected from a public dataset, such as (1) a BBBP dataset: blood brain barrier permeability data; (2) CLINTOX dataset: drug toxicity data for FDA approval and clinical failure; (3) HIV dataset: whether the compound has HIV inhibitory capacity data; (4) MUA dataset: verifying virtual screening technology verification data; (5) SIDER dataset: side effects and adverse drug reaction data; (6) TOX21 dataset: compound toxicological data; (7) ESOL dataset: compound water solubility data; (8) HPPB dataset: thermodynamic solubility data; (9) LIPO dataset: drug lipophilicity data; (10) PCBA dataset: small molecule bioactivity data; (11) Free solution dataset: experimental and calculated free energy data for small molecules in water.

S102: and encoding the plurality of first compounds in the training data set according to a preset characteristic database to obtain characteristic representations of the plurality of first compounds in the training data set.

In the embodiment of the invention, the preset feature database comprises a plurality of molecular features.

As a further embodiment, the method further comprises a method for constructing a feature database. That is, before encoding the plurality of first compounds in the training data set according to the preset feature database to obtain the feature representations of the plurality of first compounds in the training data set, the training method of the compound physical property prediction model further includes: acquiring a small molecular medicine database as a second data set, wherein the second data set comprises a plurality of molecular SMILES (sequential molecular SMILES) type small molecular medicines serving as second compounds; and extracting the characteristics of the plurality of second compounds to obtain a characteristic database.

The second data set may be a small molecule drug database ZINC, including a plurality of second compounds, where the second compounds may be molecular SMILES (Simplified molecular input line entry system, simplified molecular linear input specification) of the small molecule drug. There is a large amount of data in the small molecule drug database ZINC, but these data do not necessarily include trusted physicochemical property data, but only the small molecule itself structure data that has been confirmed to exist in a specific environment, and most of them are represented by SMILES. The small molecule drug database ZINC is mainly used for not predicting the physicochemical properties of molecules, but obtaining a characteristic database; the feature database may also be referred to as a pre-training model, and the process of deriving the feature database from the second data set is referred to as pre-training. The small molecular medicine is an organic compound with molecular weight smaller than 1000, and has the advantages of wide application, theoretical maturity and the like. It is counted that in the usual drugs, the amount of small molecule drugs may be 98% of the total amount.

A popular explanation for pre-training is to use a large amount of "inexpensive" training data (i.e., small molecule drug database ZINC), learn the commonality (i.e., molecular characteristics) of the data through a training process, and then transplant the "commonality" into predictions of certain physicochemical properties of the compound. Therefore, only a small amount of label data is needed in the prediction of a certain physical and chemical property of the compound, and the training task of a certain physical and chemical property prediction model can be completed.

As a specific embodiment, the feature database is obtained by pre-training a preset transducer model with a plurality of second compounds.

The transform model is taken as a milestone with absolute significance in deep learning history, not only promotes the level of natural language processing NLP exponentially, but also is applied to the image field later, and is accepted by the industry.

The most central part in the classical transducer model is the self-attention mechanism, which is found in both the coding and decoding modules of the transducer. The matrix Q, K, V is calculated in the self-attention mechanism structure and the output Z is calculated from the matrix Q, K, V, these three matrices Q, K, V being understood as being the result of different linear transformations of the raw data.

Multi-Head is multiple self-attention layers, and finally the Z calculated by all layers are spliced together and then are linearly transformed, so as to obtain the final output。

Output of the layerThen passes through a Feed Forward network (Feed Forward), a normalization layer (Add)&Norm) and the like, the complete transducer structure can be obtained and used for pre-training the module.

As a further embodiment of the above embodiment, a molecular attention converter (Molecule Attention Transformer) is employed in the transducer model. As shown in fig. 2, the conventional transducer is modified by using the transducer model of the molecular attention converter, so that not only the self-attention of the molecular structure is considered, but also the interatomic distance and the neighborhood of the atomic diagram are considered, and the transducer model can further contain global information in a small molecule. The molecular attention converter is used for pre-training the ZINC data, and small molecular characteristics based on the ZINC data can be finally extracted through an encoding-decoding process and minimizing a cross entropy loss function.

Molecular attention converters are derived from the network architecture of classical transducer models and are improved with inventive effort. As shown in fig. 2, the transducer model includes a first molecular attention converter, a first addition-normalization layer, a second molecular attention converter, a second addition-normalization layer, a feed forward network layer, a third addition-normalization layer, a linear connection layer, and a normalization layer, which are sequentially connected. The first addition-normalization layer is used for adding 26-dimensional input and output of the first molecular attention converter and then performing normalization calculation, and the normalization calculation adopts a layer normalization calculation (layer Normalization) mode. The second addition-normalization layer is used for adding the input and the output of the second molecular attention converter and then performing the normalization calculation. The third addition-normalization layer is used for adding the input and the output of the feedforward network layer and then performing the normalization calculation.

The molecular attention converter (including the first molecular attention converter and the second molecular attention converter) modifies the Multi-Head part of the network architecture of the transducer model, i.e., the self-attention mechanism part. The classical transducer model can indeed be applied directly to the molecular description SMILES type, but this would lose much spatial information. Thus, the atomic characterization layer is embedded into the molecular attention converter to replace the original Multi-Head portion, such that the first molecular attention converter and the second attention converter include the atomic characterization layer, wherein the input of the atomic characterization layer comprises 12-dimensional single-atom single-heat codes, 6-dimensional neighborhood number single-heat codes, 5-dimensional hydrogen atom single-heat codes, 1-dimensional charge numbers, 1-dimensional data for judging whether the molecule is in the ring, and 1-dimensional data for judging whether the molecule has aromaticity. The 26-dimensional data are specifically as follows:

the dimensions 0-11 are atoms as a single one-hot code, i.e., a single atom one-hot code, including B, N, C, O, F, P, S, cl, br, I, virtual nodes, and other atoms; wherein B, N, C, O, F, P, S, cl, br, I is the chemical symbol of boron, nitrogen, carbon, oxygen, fluorine, phosphorus, sulfur, chlorine, bromine, iodine;

the dimension number 12-17 is the neighborhood number single-hot code, 0-5; wherein the one-hot encoding is a method for representing one-hot-encoding, for example, one-hot encoding of 0-4 of the number of hydrogen atoms, which represents 0, 1, 2, 3, 4 columns, each number representing a core atom linked with several hydrogens;

the number 18-22 dimension is hydrogen atom number 0-4;

the number 23 dimension is the number of charges;

the dimension number 24 is data for judging whether a molecule is in a ring, and is represented by 0 or 1;

the dimension number 25 is data for judging whether a molecule has aromaticity, and is represented by 0 or 1.

The atomic characterization layer replaces the self-attentive mechanism layer in the classical transducer completely to calculate a modified attentive representation from the self-attentive representation, the inter-atomic distance matrix and the domain matrix, and in particular, the atomic characterization layer calculates the modified attentive representation based on the following attentive formula:

；

wherein ,is a super parameter, Q _i The matrix obtained by multiplying the input data based on the ith transducer model with the corresponding weight matrix is K, which is Tra-basedMatrix obtained by multiplying input data of a transformer model with corresponding weight matrix, V _i For the matrix obtained by multiplying the input data based on the transducer model with the corresponding weight matrix,ρrepresenting a softmax function, D representing an interatomic distance matrix, A representing a domain matrix, a functiongRepresenting the Softmax function,/->Is Q _i And the matrix dimension of K, attention is the output of the Attention representation, i.e., atomic characterization layer.

wherein ,,N _atoms for atomic number->. Function ofgFor normalizing the distance matrix. The D matrix may be calculated using the RDKit packet interface. />Is a hyper-parameter used to control the influence of distance and domain matrices on overall attention. The left first term of the formula after expansion is used to calculate the self-attention representation. Q (Q) _i Multiplied by->The resulting matrix represents the number of tokens of the raw data, i.e., the intensity of attention between tokens. The stronger the attention, the higher the association.

Further, the virtual node mentioned in the dimension 0-11 is a point representing that no atom is connected by an "edge", and the distance thereof is set to 106. Such a setup may allow the model to skip searching for molecules, ignoring the effect of remote nodes on the molecules.

The pre-training is performed on a preset transducer model by using the plurality of second compounds to obtain the feature database, which comprises the following steps: inputting a molecular SMILES formula of a small molecular drug in a small molecular drug database, ZINC, dataset as a small molecular drug of a second compound in a transducer model employing a first molecular attention transducer and a second molecular attention transducer; after the SMILES type input, the SMILES2mol function in the open source toolkit is called to convert the SMILES type into an RDKit object of 26-dimensional feature data, and the 26-dimensional feature data comprises 12-dimensional single-atom single-heat codes, 6-dimensional neighborhood number single-heat codes, 5-dimensional hydrogen atom single-heat codes, 1-dimensional charge numbers, 1-dimensional data for judging whether molecules are in a ring and 1-dimensional data for judging whether the molecules have aromaticity or not, as described above. And then, based on the calculation process of a transducer model, processing the RDkit object, reducing 26-dimensional characteristics in a pre-training output part through the processes of encoding and decoding of an Encoder, obtaining characteristic data output by the transducer model, and performing model judgment through cross entropy loss function loss. The goal of the pre-training is to minimize the cross entropy loss function.

Information entropy of discrete caseWhere p is a distribution, the entropy of the information represents the uncertainty of the distribution p itself.

The relative entropy measures the difference between the two distributions p and q, also called KL-divergence, defined herein as:wherein if p and q are not identically distributed, the difference is greater than zero. Thus, in the pre-training process, the relative entropy between the original feature and the feature after the "encode-decode" process is as small as possible to meet the threshold or number of iterations, and the pre-training can be completed. Specifically, model evaluation by cross entropy loss function loss includes:

calculating the relative entropy between the RDkit object and the characteristic data output by the transducer model;

and when the relative entropy is smaller than or equal to a preset threshold value or the number of pre-training iterations of the transducer model reaches the preset number, completing pre-training to obtain a feature database.

S103: and training a preset neural network model by utilizing the characteristic representations of the plurality of first compounds and the properties of the plurality of first compounds in the training data set to obtain a compound property prediction model.

That is, a compound property prediction model is obtained by training a predetermined neural network model using an encoded training data set, wherein the processed training data set includes a plurality of sets of training data, each set of training data including a characteristic representation of a first compound and a property of the first compound. I.e. after the feature database is obtained, the feature extraction result of the small molecules can be obtained by loading the feature database. When training of a particular neural network model is desired, such as training of a neural network model that enables solubility prediction, training can be performed using a small-scale dataset, such as Free solution data. Because the feature database contains extensive molecular feature extraction results, small molecules in the solubility dataset Free solution can be encoded by the extracted features and can contain information which cannot be given by the small-scale dataset. Therefore, only a very simple deep neural network is needed at this time to complete the training task of the neural network model capable of realizing solubility prediction.

Specifically, the following method may be used to train the neural network model: loading a characteristic database during training of the neural network model; randomly dividing the encoded training data set into a training subset and a testing subset, wherein the ratio is 4:1, and finally mapping a softmax layer by means of a fully connected network comprising a plurality of hidden layers to obtain a regression prediction task result; or mapped to a sigmod function, resulting in a classification task result. Both regression tasks and classification tasks are with generic DNNs, except that the activation function is changed according to the task. The regression task uses tanh as the activation function, and the classification task uses Sigmoid as the activation function. In summary, according to the training method for the compound property prediction model provided by the embodiment of the invention, the characteristic representations of the plurality of first compounds in the training data set are obtained by encoding the plurality of first compounds in the training data set according to the preset characteristic database, so that the training data set after encoding can be used for training the preset neural network model. Because the first compounds with characteristic representation are input during training of the neural network model, the training task is lightened during training of the neural network model, and the technical problems that a deep model is difficult to train and a large number of super parameters are required to be adjusted are solved.

In addition, the physical and chemical property data of the prior small molecular drugs are relatively lacking, and when the deep neural network is trained in the prior art, a large amount of physical and chemical properties of the small molecular drugs are needed, so that the data size of the physical and chemical properties of the prior small molecular drugs cannot meet the requirement of the direct training of the neural network.

In the embodiment of the invention, the training of the compound property prediction model is divided into two parts of the training of the feature database and the training of the neural network model, and the physical and chemical property data of the small molecular medicine are not needed because only the molecular features are required to be extracted during the training of the feature database, so that the small molecular medicine database can be used, and the requirement on the data quantity during the training of the feature database is met; meanwhile, when the neural network model is trained, a plurality of first compounds with characteristic representation are input, so that the neural network model has smaller requirement for the data quantity of the second compounds.

Based on the training method of the compound property prediction model, the embodiment of the invention also provides a compound property prediction method. FIG. 3 is a schematic flow chart of a method for predicting properties of a compound according to an embodiment of the present invention, as shown in FIG. 3, the method for predicting properties of a compound according to an embodiment of the present invention includes the following steps:

s201: and obtaining the structural formula of the compound to be predicted.

Specifically, the structural formula of the compound to be predicted may include the SMILES formula, the compound name, and the 2D molecular diagram of the compound to be predicted.

S202: inputting the structural formula into a compound property prediction model obtained by the training method of the compound property prediction model to obtain the property of the compound to be predicted.

Wherein, the property of the compound to be predicted refers to the physicochemical property of the compound to be predicted, such as toxicological data, water-solubility data, side effects, adverse drug reaction data and the like.

Further, the method for predicting the properties of the compound further comprises: and obtaining the predicted property of the compound to be predicted. Thus, the following method may be employed for inputting the structural formula into the compound property prediction model obtained by the training method of the compound property prediction model described above: searching a prediction model corresponding to the predicted property in the compound property prediction model obtained by the training method of the compound property prediction model, and inputting a SMILES formula into the searched prediction model.

That is, on the basis of training to obtain a feature database, different compound property prediction models can be trained for different properties of the compound. For example, ESOL data can be used to train a compound water solubility predictive model, and TOX21 data can be used to train a compound toxicology predictive model. When the predicted property of the compound to be predicted is toxicity, it is necessary to find a compound toxicity prediction model among a plurality of compound physical prediction models, and input the SMILES formula of the compound to be predicted into the found compound toxicity prediction model.

Corresponding to the training method of the compound property prediction model, the embodiment of the invention also provides a training device of the compound property prediction model. Fig. 4 is a schematic structural diagram of a training device for a compound property prediction model according to an embodiment of the present invention, and as shown in fig. 4, the training device for a compound property prediction model includes a first obtaining module 41, a processing module 42, and a training module 43.

Specifically, the first obtaining module 41 is configured to obtain a first data set, where the first data set includes multiple sets of training data, and each set of training data includes a first compound and a property of the first compound;

a processing module 42, configured to encode the plurality of first compounds in the training data set according to a preset feature database, to obtain feature representations of the plurality of first compounds in the training data set;

the training module 43 is configured to train a preset neural network model by using a coded training data set to obtain a compound property prediction model, where the coded training data set includes multiple sets of training data, and each set of training data includes a feature representation of a first compound and a property of the first compound.

The specific details of the device for training the model for predicting the properties of the compound may be understood correspondingly with reference to the corresponding relevant descriptions and effects in the embodiments shown in fig. 1 to 3, which are not repeated here.

Corresponding to the method for predicting the properties of the compound, the embodiment of the invention also provides a device for predicting the properties of the compound. Fig. 5 is a schematic structural diagram of a device for predicting properties of a compound according to an embodiment of the present invention, and as shown in fig. 5, the device for predicting properties of a compound according to an embodiment of the present invention includes a second obtaining module 51 and a predicting module 52.

A second obtaining module 51, configured to obtain a structural formula of the compound to be predicted;

the prediction module 52 is configured to input the SMILES formula into the compound property prediction model obtained by the training method of the compound property prediction model, so as to obtain the property of the compound to be predicted.

Specific details of the device for predicting properties of compounds may be understood with reference to the corresponding relevant descriptions and effects in the embodiments shown in fig. 1 to 3, and will not be repeated here.

The embodiment of the present invention also provides an electronic device, as shown in fig. 6, which may include a processor 61 and a memory 62, where the processor 61 and the memory 62 may be connected by a bus or other means.

The processor 61 may be a central processing unit (Central Processing Unit, CPU). Processor 61 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or a combination of the above.

The memory 62 is used as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and a module, such as program instructions/modules corresponding to a training method of a compound property prediction model (e.g., the first acquisition module 41, the processing module 42, and the training module 43 shown in fig. 4) or program instructions/modules corresponding to a prediction method of a compound property (e.g., the second acquisition module 51 and the prediction module 52 shown in fig. 5) in an embodiment of the present invention. The processor 61 executes various functional applications of the processor and data processing, i.e., a training method for realizing a compound property prediction model or a prediction method of a compound property in the above-described method embodiment, by executing a non-transitory software program, instructions, and modules stored in the memory 62.

Memory 62 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by the processor 61, etc. In addition, the memory 62 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 62 may optionally include memory located remotely from processor 61, which may be connected to processor 61 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory 62 that, when executed by the processor 61, perform the training method of the compound property prediction model in the embodiment shown in fig. 1-2 or the prediction method of the compound property in the embodiment shown in fig. 3.

As shown in fig. 7, in order to realize a training method of a compound property prediction model and a prediction method of a compound property, an electronic device includes the following modules:

(1) ZINC data set storage module: the module comprises a database and a ZINC data set loading script. Because of the large size of the ZINC dataset, this module data is stored in an unwritable state. The required ZINC data sets or subsets thereof can be selectively loaded to conduct personalized pre-training through calling scripts; and the whole data set can be loaded in batches under the condition of hardware condition permission, so that large-scale pre-training and super-parameter adjustment can be performed.

(2) Molecular attention converter pre-training module: the main function of the module is to pretrain the ZINC data set loaded into the system so as to extract the characteristics. The ZINC data sets with different sizes are loaded, so that the pre-training of different conditions can be carried out according to the requirements, and the actual industrial requirements are met.

(3) A public data set storage module: the public data set contains the 11 public data sets mentioned earlier. These data sets are not large-scale and are therefore stored in a file form read-only within the module. The module is simultaneously provided with a script, and different data sets can be loaded according to the requirement through the script to carry out downstream specific task training.

(4) A pre-training model management module: the system can perform pre-training of ZINC data sets with different scales, and the module is used for storing pre-training models with different data scales for loading downstream tasks.

(5) Compound property prediction model management module: this module is used to store different downstream task models. Single or multiple loads are available for later prediction work. If a new prediction model needs to be trained by using other small-scale data sets, the system can accept scripts written by a user to train a new downstream task.

(6) And a prediction output module: outputting a physical and chemical property prediction result stored in a JSON file according to the loaded downstream task model and a molecular SMILES formula to be predicted. The module provides JSON file download and can provide a back-end service interface for the front end of the web page.

It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment method may be implemented by a computer program to instruct related hardware, and the program may be stored in a computer readable storage medium, and the program may include the above-described embodiment method when executed. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations are within the scope of the invention as defined by the appended claims.

Claims

1. A method for training a compound property prediction model, comprising:

obtaining a first dataset comprising a plurality of sets of training data, each set of training data comprising a first compound and a property of the first compound;

encoding the plurality of first compounds in the training data set according to a preset characteristic database to obtain characteristic representations of the plurality of first compounds in the training data set;

training a preset neural network model by using a coded training data set to obtain a compound physical property prediction model, wherein the coded training data set comprises a plurality of groups of training data, and each group of training data comprises a characteristic representation of the first compound and the property of the first compound.

2. The training method of claim 1, further comprising, prior to encoding the plurality of first compounds in the training dataset according to a pre-set feature database to obtain a feature representation of the plurality of first compounds in the training dataset:

acquiring a small molecular drug database as a second data set, wherein the second data set comprises a plurality of molecular SMILES formulas of small molecular drugs serving as second compounds;

and extracting the characteristics of the plurality of second compounds to obtain the characteristic database, wherein the characteristic database comprises a plurality of molecular characteristics.

3. The training method of claim 2, wherein the performing feature extraction on the plurality of second compounds to obtain the feature database comprises:

and pretraining a preset transducer model by using the plurality of second compounds to obtain the characteristic database.

4. The training method of claim 3, wherein the fransformer model comprises a first molecular attention converter, a first addition-normalization layer, a second molecular attention converter, a second addition-normalization layer, a feed-forward network layer, a third addition-normalization layer, a linear connection layer, and a normalization layer connected in sequence.

5. The training method of claim 4 wherein the first molecular attention converter and the second molecular attention converter each include an atomic characterization layer that calculates a modified attention representation based on the following attention formula:

；

6. A method for predicting a property of a compound, comprising:

obtaining a structural formula of a compound to be predicted;

inputting the structural formula into a compound property prediction model obtained by the training method of the compound property prediction model according to any one of claims 1-5, so as to obtain the property of the compound to be predicted.

7. The prediction method according to claim 6, characterized by further comprising, before inputting the structural formula into a compound property prediction model obtained by the training method of the compound property prediction model according to any one of claims 1 to 5:

obtaining the predicted property of the compound to be predicted;

inputting the structural formula into a compound property prediction model obtained by the training method of the compound property prediction model according to any one of claims 1 to 5 comprises:

searching a prediction model corresponding to the predicted property in a compound property prediction model obtained by using the training method of the compound property prediction model of any one of claims 1-5;

and inputting the structural formula into the searched prediction model.

8. A training device for a compound physical property prediction model, comprising:

a first acquisition module for acquiring a first data set comprising a plurality of sets of training data, each set of training data comprising a first compound and a property of the first compound;

the processing module is used for encoding the plurality of first compounds in the training data set according to a preset characteristic database to obtain characteristic representations of the plurality of first compounds in the training data set;

the training module is used for training a preset neural network model by utilizing the encoded training data set to obtain a compound property prediction model, wherein the encoded training data set comprises a plurality of groups of training data, and each group of training data comprises a characteristic representation of a first compound and the property of the first compound.

9. A device for predicting properties of a compound, comprising:

the second acquisition module is used for acquiring the structural formula of the compound to be predicted;

the prediction module is configured to input the structural formula into a compound property prediction model obtained by using the training method of the compound property prediction model according to any one of claims 1 to 5, so as to obtain the property of the compound to be predicted.

10. An electronic device, comprising:

the device comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions to execute the training method of the compound property prediction model according to any one of claims 1-5 or the prediction method of the compound property according to any one of claims 6-7.