CN116110504B

CN116110504B - Molecular property prediction method and system based on semi-supervised variation self-encoder

Info

Publication number: CN116110504B
Application number: CN202310384467.XA
Authority: CN
Inventors: 李中伟; 傅燕嵩; 却立勇
Original assignee: Yantai Guogong Intelligent Technology Co ltd
Current assignee: Yantai Guogong Intelligent Technology Co ltd
Priority date: 2023-04-12
Filing date: 2023-04-12
Publication date: 2023-06-23
Anticipated expiration: 2043-04-12
Also published as: CN116110504A

Abstract

The invention discloses a molecular property prediction method and a system based on a semi-supervised variation self-encoder, which belong to the technical field of artificial intelligence, and the technical problem to be solved by the invention is how to train a VAE model by using a small amount of labeled sample data, and the molecular property prediction precision is improved, and the technical scheme is as follows: generating a label-free molecular dataset; constructing a molecular property prediction model based on a variation self-encoder; constructing two predictor network models with the same network structure, wherein the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder network models as input, respectively train the two predictor network models with the same network structure through the labeled molecular samples, and take the average value of output results of the two predictor network models with the same network structure as a final prediction result during prediction; the design variation is derived from the encoder loss function.

Description

Molecular property prediction method and system based on semi-supervised variation self-encoder

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a molecular property prediction method and a molecular property prediction system based on a semi-supervised variation self-encoder.

Background

VAE is a probabilistic model (Probabilistic Model) based on variational inference (Variational Inference, variational Bayesian methods) that belongs to the generative model (and of course also the unsupervised model). The common VAE model is used for molecular property prediction and requires a large number of labeled molecules to train, however, the catalyst molecular property data needs to be obtained through experiments, the cost is extremely high, the labeled sample data size is limited, and the common VAE model is difficult to accurately predict the molecular property.

Therefore, how to train a VAE model by using a small amount of labeled sample data and improve the accuracy of molecular property prediction are technical problems to be solved urgently.

Disclosure of Invention

The technical task of the invention is to provide a molecular property prediction method and a system based on a semi-supervised variation self-encoder, so as to solve the problem of how to train a VAE model by using a small amount of labeled sample data and improve the molecular property prediction precision.

The technical task of the invention is realized in the following way, namely a molecular property prediction method based on a semi-supervised variation self-encoder, which comprises the following steps:

generating a label-free molecular dataset;

constructing a molecular property prediction model based on a variation self-encoder; the method comprises the following steps:

inputting the unlabeled molecular samples in the unlabeled molecular data set into an encoder network model in the form of 120 multiplied by 19 vectors, and obtaining continuous hidden molecular characterization vectors z_mean and middle corresponding to the unlabeled molecular samples through the encoder network model;

the continuous hidden molecule characterization vector z_mean and the middle of the unlabeled molecule sample are subjected to a variable sampling layer to obtain a 196-dimensional vector z_samp and a 196 multiplied by 2-dimensional vector z_mean_log_var, and then the vector z_samp is input into a decoder network model for processing;

constructing two predictor network models with the same network structure, wherein the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder network models as input, respectively train the two predictor network models with the same network structure through the labeled molecular samples, and take the average value of output results of the two predictor network models with the same network structure as a final prediction result during prediction;

the design variation is derived from the encoder loss function.

Preferably, the generation of the unlabeled molecular dataset is specifically as follows:

collecting and obtaining a labeled molecular data set consisting of fields formed by SMILES character string coding modes and molecular property fields; wherein molecular properties include activity, selectivity, and solids content;

obtaining a molecular fragment library according to the tagged molecular data set: generating molecular fragments by using a BRICSDecompose function in open source chemical information software RDkit;

and (3) obtaining the unlabeled molecular data set consisting of fields formed by the SMILES character string coding mode by a molecular fragment library splicing mode.

More preferably, the label-free molecular data set formed by the fields formed by the SMILES character string coding mode is obtained by a molecular fragment library splicing mode specifically as follows:

the method comprises the steps of performing pairwise splicing on fragments in a molecular fragment library by using a displacesubstructurts function to obtain a large number of molecular functional groups;

and respectively combining the molecular functional groups with the target approximate molecular structure at a preset position through a replaysubstructures function to form complete molecules, so as to obtain a label-free molecular data set.

Preferably, the encoder network model comprises an input layer, three one-dimensional convolution layers, four BatchNorm layers, a fully connected hidden layer, and an output layer;

the dimension of the input layer of the encoder network model is 120 multiplied by 19, and 120 is the maximum number of characters of the designated molecular SMILES; 19 is the number of non-repeated characters in all molecular SMILES character strings in the tagged molecular dataset; the output layer of the encoder network model has two heads, vectors z_mean and middle, respectively, which are vectors of dimension 196.

Preferably, the decoder network model includes an input layer, a fully connected hidden layer, a BatchNorm layer, three GRU layers, and an output layer;

wherein the input layer of the decoder network model has 196 neurons for receiving the vector z_samp; the output layer of the decoder network model has 19 neurons and the activation function is softmax.

Preferably, the predictor network model comprises an input layer, three fully connected hidden layers, two BatchNorm layers, three Dropout layers and an output layer;

the input layer of the predictor network model has 196 neurons for receiving the vector z_mean output by the encoder network model; the output layer of the predictor network model has 1 neuron, and the activation function is linear.

Preferably, the variation self-encoder loss function includes a cross entropy loss of the decoder network model, a KL divergence loss of the variation self-encoder, an MSE loss of two predictor network models having the same network structure, and a weight quadrature loss of two predictor network models having the same network structure;

wherein KL divergence loss is used to train the hidden feature distribution of the variation self-encoder; the KL divergence loss function is specifically as follows:

wherein (1)>

,/>

； p ₁ Representing a hidden feature distribution of the variation self-encoder;p ₂ representing a target distribution of the variation from the encoder; />

Is the standard deviation; />

Is the mean value;N ₁ 、N ₂ all represent normal distributions;

the MSE loss of two predictor network models with the same network structure is the sum of the MSE loss between the output values of the predictor network models with the same network structure and the MSE loss between the output values of the labeled molecular samples of the respective predictor network models with the same network structure and the corresponding actual values; the formula is as follows:

MSE _{total (S)} =MSE（y ₁ ，y ₂ ）+MSE（y _true ，y ₁ ）+ MSE（y _true ，y ₂ ）；

Wherein y is ₁ An output value representing a labeled molecular sample of one of the predictor network models; y is ₂ An output value representing a labeled molecular sample of another predictor network model; y is _true Representing corresponding actual values of labeled molecular samples of two predictor network models having the same network structure;

weight orthogonal loss for two predictor network models with identical network structure

The method comprises the following steps:

wherein C is the dimension of the output layer weight, < ->

And->

The j-th dimension weights of the two predictor output layers are respectively obtained; t denotes the vector transpose.

A molecular property prediction system based on a semi-supervised variation self-encoder, which comprises a generator, an encoder, a decoder, a constructor and a designer;

the generator is used for generating a label-free molecular data set;

the encoder is used for obtaining continuous hidden molecule characterization vectors z_mean and middle corresponding to the unlabeled molecule samples in a vector form of 120 multiplied by 19 by utilizing the unlabeled molecule samples in the unlabeled molecule data set;

the decoder is used for obtaining a 196-dimensional vector z_samp and a 196 x 2-dimensional vector z_mean_log_var by utilizing continuous hidden molecule representation vectors z_mean and middle of unlabeled molecular samples through a variable sampling layer, and then processing the vector z_samp;

the constructor is used for constructing two predictor network models with the same network structure, the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder as input, the two predictor network models with the same network structure are respectively trained through the labeled molecular samples, and when in prediction, the average value of output results of the two predictor network models with the same network structure is taken as a final prediction result;

the designer is configured to design a variance self-encoder loss function.

An electronic device, comprising: a memory and at least one processor;

wherein the memory has a computer program stored thereon;

the at least one processor executes the computer program stored by the memory, causing the at least one processor to perform the semi-supervised variation self encoder based molecular property prediction method as described above.

A computer readable storage medium having stored therein a computer program executable by a processor to implement a semi-supervised variation self encoder based molecular property prediction method as described above.

The molecular property prediction method and system based on the semi-supervised variation self-encoder have the following advantages: according to the invention, the VAE model is trained by using a large amount of unlabeled sample data and a small amount of labeled sample data by using a semi-supervision idea, so that the molecular property prediction accuracy is improved and the application field of the molecular property prediction model is enlarged.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a molecular property prediction method based on a semi-supervised variation self-encoder.

Detailed Description

The molecular property prediction method and system based on the semi-supervised variation self-encoder of the present invention are described in detail below with reference to the accompanying drawings and specific embodiments of the present invention.

Example 1:

as shown in fig. 1, the embodiment provides a molecular property prediction method based on a semi-supervised variation self-encoder, which specifically comprises the following steps:

s1, generating a label-free molecular data set;

s2, constructing a molecular property prediction model based on a variation self-encoder; the method comprises the following steps:

s201, inputting unlabeled molecular samples in an unlabeled molecular data set into an encoder network model in a vector form of 120 multiplied by 19, and obtaining continuous hidden molecular characterization vectors z_mean and middle corresponding to the unlabeled molecular samples through the encoder network model;

s202, carrying out variable sampling on a continuous hidden molecular characterization vector z_mean and a sample of an unlabeled molecular sample to obtain a 196-dimensional vector z_samp and a 196×2-dimensional vector z_mean_log_var, and inputting the vector z_samp into a decoder network model for processing;

s203, constructing two predictor network models with the same network structure, wherein the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder network models as input, respectively train the two predictor network models with the same network structure through the labeled molecular samples, and take the average value of output results of the two predictor network models with the same network structure as a final prediction result during prediction;

s204, designing a variation self-encoder loss function;

the generation of the unlabeled molecular data set in step S1 of this embodiment is specifically as follows:

s101, collecting and obtaining a labeled molecular data set consisting of a field formed by a SMILES character string coding mode and a molecular property field; wherein molecular properties include activity, selectivity, and solids content;

s102, obtaining a molecular fragment library according to the tagged molecular data set: generating molecular fragments by using a BRICSDecompose function in open source chemical information software RDkit;

s103, obtaining a label-free molecular data set composed of fields formed by a SMILES character string coding mode in a molecular fragment library splicing mode;

in step S103 of this embodiment, the label-free molecular data set composed of the fields formed by the SMILES string encoding method is obtained by means of the molecular fragment library splicing method specifically as follows:

s10301, splicing fragments in the molecular fragment library in pairs by using a displacesubstructures function to obtain a large number of molecular functional groups;

s10302, respectively combining the molecular functional groups and the target approximate molecular structure at preset positions to form complete molecules through a displacesubstructures function, and obtaining a label-free molecular data set.

The encoder network model in step S201 of this embodiment includes an input layer, three one-dimensional convolution layers, four batch norm layers, a fully-connected hidden layer, and an output layer;

The decoder network model in step S202 of the present embodiment includes an input layer, a full connection hidden layer, a batch norm layer, three GRU layers, and an output layer;

The predictor network model in step S203 of the present embodiment includes an input layer, three fully connected hidden layers, two Batchnorm layers, three Dropout layers, and an output layer;

The variation self-encoder loss function in step S204 of this embodiment includes cross entropy loss of the decoder network model, KL divergence loss of the variation self-encoder, MSE loss of two predictor network models with the same network structure, and weight quadrature loss of two predictor network models with the same network structure;

wherein (1)>

,/>

Is the standard deviation; />

Is the mean value;N ₁ 、N ₂ all represent normal distributions;

The method comprises the following steps:

wherein C is the dimension of the output layer weight, < ->

And->

The labeled molecular dataset collected in this example had 200 samples from which 20% of the samples were randomly selected as the test set, and the predicted effect of the semi-supervised VAE versus the normal VAE on the three molecular properties of the catalyst was compared, as shown in the following table:

。

example 2:

the embodiment provides a molecular property prediction system based on a semi-supervised variation self-encoder, which comprises a generator, an encoder, a decoder, a constructor and a designer;

the generator is used for generating a label-free molecular data set;

the designer is configured to design a variance self-encoder loss function.

The working process of the generator in this embodiment is specifically as follows:

(1) Collecting and obtaining a labeled molecular data set consisting of a field formed by the SMILES character string coding mode and a molecular property field; wherein molecular properties include activity, selectivity, and solids content;

(2) Obtaining a molecular fragment library according to the tagged molecular data set: generating molecular fragments by using a BRICSDecompose function in open source chemical information software RDkit;

(3) Obtaining a label-free molecular data set composed of fields formed by SMILES character string coding modes in a molecular fragment library splicing mode; the method comprises the following steps:

(1) the fragments in the molecular fragment library are spliced pairwise by using a displacesubstructurts function to obtain a large number of molecular functional groups;

(2) combining the molecular functional groups with the target approximate molecular structure at preset positions to form complete molecules through a displacesubstructures function, so as to obtain a label-free molecular data set.

The encoder in this embodiment includes an input layer, three one-dimensional convolutional layers, four BatchNorm layers, a fully-connected hidden layer, and an output layer;

The decoder in this embodiment includes an input layer, a fully connected hidden layer, a BatchNorm layer, three GRU layers, and an output layer;

The predictor network model in this embodiment includes an input layer, three fully connected hidden layers, two BatchNorm layers, three Dropout layers, and an output layer;

The variation self-encoder loss function in this embodiment includes cross entropy loss of the decoder network model, KL divergence loss of the variation self-encoder, MSE loss of two predictor network models with the same network structure, and weight quadrature loss of two predictor network models with the same network structure;

wherein (1)>

,/>

；p ₁ Representing a hidden feature distribution of the variation self-encoder;p ₂ representing a target distribution of the variation from the encoder; />

Is the standard deviation; />

Is all that isA value;N ₁ 、N ₂ all represent normal distributions;

The method comprises the following steps:

wherein C is the dimension of the output layer weight, < ->

And->

Example 3:

the embodiment also provides an electronic device, including: a memory and a processor;

wherein the memory stores computer-executable instructions;

the processor executes the computer-executable instructions stored in the memory, causing the processor to perform the semi-supervised variation self encoder-based molecular property prediction method of any embodiment of the present invention.

The processor may be a Central Processing Unit (CPU), but may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be used to store computer programs and/or modules, and the processor implements various functions of the electronic device by running or executing the computer programs and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal, etc. The memory may also include high-speed random access memory, but may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, memory card only (SMC), secure Digital (SD) card, flash memory card, at least one disk storage period, flash memory device, or other volatile solid state memory device.

Example 4:

the present embodiment also provides a computer readable storage medium having stored therein a plurality of instructions, the instructions being loaded by a processor, to cause the processor to perform the method for predicting molecular properties based on a semi-supervised variation self encoder in any of the embodiments of the present invention. Specifically, a system or apparatus provided with a storage medium on which a software program code realizing the functions of any of the above embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be caused to read out and execute the program code stored in the storage medium.

In this case, the program code itself read from the storage medium may realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code form part of the present invention.

Examples of storage media for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RYM, DVD-RWs, DVD+RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer by a communication network.

Further, it should be apparent that the functions of any of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform part or all of the actual operations based on the instructions of the program code.

Further, it is understood that the program code read out by the storage medium is written into a memory provided in an expansion board inserted into a computer or into a memory provided in an expansion unit connected to the computer, and then a CPU or the like mounted on the expansion board or the expansion unit is caused to perform part and all of actual operations based on instructions of the program code, thereby realizing the functions of any of the above embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A molecular property prediction method based on a semi-supervised variation self-encoder is characterized by comprising the following steps:

generating a label-free molecular dataset; the method comprises the following steps:

obtaining a label-free molecular data set consisting of fields formed by SMILES character string coding modes in a molecular fragment library splicing mode;

designing a variation self-encoder loss function;

the variation self-encoder loss function comprises cross entropy loss of a decoder network model, KL divergence loss of the variation self-encoder, MSE loss of two predictor network models with the same network structure and weight orthogonal loss of the two predictor network models with the same network structure;

wherein (1)>

,/>

Is the standard deviation; />

Is the mean value;N ₁ 、N ₂ all represent normal distributions;

The method comprises the following steps:

wherein C is the dimension of the output layer weight, < ->

And->

2. The molecular property prediction method based on the semi-supervised variation self-encoder according to claim 1, wherein the unlabeled molecular data set consisting of fields formed by the SMILES string coding method is obtained by a molecular fragment library splicing method specifically as follows:

the method comprises the steps of performing pairwise splicing on fragments in a molecular fragment library by using a displacesubstructurts function to obtain molecular functional groups;

3. The method for predicting molecular properties based on a semi-supervised variation self encoder as recited in claim 1, wherein the encoder network model includes an input layer, three one-dimensional convolution layers, four BatchNorm layers, a fully connected hidden layer, and an output layer;

4. The method of claim 1, wherein the decoder network model comprises an input layer, a fully connected hidden layer, a BatchNorm layer, three GRU layers, and an output layer;

5. The method for predicting molecular properties based on a semi-supervised variation self encoder of claim 1, wherein the predictor network model comprises an input layer, three fully connected hidden layers, two BatchNorm layers, three Dropout layers, and an output layer;

6. A molecular property prediction system based on a semi-supervised variation self-encoder, which is characterized by comprising a generator, an encoder, a decoder, a constructor and a designer;

the generator is used for generating a label-free molecular data set; wherein generating a label-free molecular dataset is specifically as follows:

the designer is used for designing a variation self-encoder loss function;

wherein (1)>

,/>

Is the standard deviation；/>

Is the mean value;N ₁ 、N ₂ all represent normal distributions;

The method comprises the following steps:

wherein C is the dimension of the output layer weight, < ->

And->

7. An electronic device, comprising: a memory and at least one processor;

wherein the memory has a computer program stored thereon;

the at least one processor executing the computer program stored by the memory, causes the at least one processor to perform the semi-supervised variation self encoder based molecular property prediction method of any of claims 1 to 5.

8. A computer readable storage medium having stored therein a computer program executable by a processor to implement the semi-supervised variation self encoder based molecular property prediction method of any of claims 1 to 5.