CN116110504B - Molecular property prediction method and system based on semi-supervised variation self-encoder - Google Patents

Molecular property prediction method and system based on semi-supervised variation self-encoder Download PDF

Info

Publication number
CN116110504B
CN116110504B CN202310384467.XA CN202310384467A CN116110504B CN 116110504 B CN116110504 B CN 116110504B CN 202310384467 A CN202310384467 A CN 202310384467A CN 116110504 B CN116110504 B CN 116110504B
Authority
CN
China
Prior art keywords
molecular
encoder
predictor
loss
network models
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310384467.XA
Other languages
Chinese (zh)
Other versions
CN116110504A (en
Inventor
李中伟
傅燕嵩
却立勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yantai Guogong Intelligent Technology Co ltd
Original Assignee
Yantai Guogong Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yantai Guogong Intelligent Technology Co ltd filed Critical Yantai Guogong Intelligent Technology Co ltd
Priority to CN202310384467.XA priority Critical patent/CN116110504B/en
Publication of CN116110504A publication Critical patent/CN116110504A/en
Application granted granted Critical
Publication of CN116110504B publication Critical patent/CN116110504B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a molecular property prediction method and a system based on a semi-supervised variation self-encoder, which belong to the technical field of artificial intelligence, and the technical problem to be solved by the invention is how to train a VAE model by using a small amount of labeled sample data, and the molecular property prediction precision is improved, and the technical scheme is as follows: generating a label-free molecular dataset; constructing a molecular property prediction model based on a variation self-encoder; constructing two predictor network models with the same network structure, wherein the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder network models as input, respectively train the two predictor network models with the same network structure through the labeled molecular samples, and take the average value of output results of the two predictor network models with the same network structure as a final prediction result during prediction; the design variation is derived from the encoder loss function.

Description

Molecular property prediction method and system based on semi-supervised variation self-encoder
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a molecular property prediction method and a molecular property prediction system based on a semi-supervised variation self-encoder.
Background
VAE is a probabilistic model (Probabilistic Model) based on variational inference (Variational Inference, variational Bayesian methods) that belongs to the generative model (and of course also the unsupervised model). The common VAE model is used for molecular property prediction and requires a large number of labeled molecules to train, however, the catalyst molecular property data needs to be obtained through experiments, the cost is extremely high, the labeled sample data size is limited, and the common VAE model is difficult to accurately predict the molecular property.
Therefore, how to train a VAE model by using a small amount of labeled sample data and improve the accuracy of molecular property prediction are technical problems to be solved urgently.
Disclosure of Invention
The technical task of the invention is to provide a molecular property prediction method and a system based on a semi-supervised variation self-encoder, so as to solve the problem of how to train a VAE model by using a small amount of labeled sample data and improve the molecular property prediction precision.
The technical task of the invention is realized in the following way, namely a molecular property prediction method based on a semi-supervised variation self-encoder, which comprises the following steps:
generating a label-free molecular dataset;
constructing a molecular property prediction model based on a variation self-encoder; the method comprises the following steps:
inputting the unlabeled molecular samples in the unlabeled molecular data set into an encoder network model in the form of 120 multiplied by 19 vectors, and obtaining continuous hidden molecular characterization vectors z_mean and middle corresponding to the unlabeled molecular samples through the encoder network model;
the continuous hidden molecule characterization vector z_mean and the middle of the unlabeled molecule sample are subjected to a variable sampling layer to obtain a 196-dimensional vector z_samp and a 196 multiplied by 2-dimensional vector z_mean_log_var, and then the vector z_samp is input into a decoder network model for processing;
constructing two predictor network models with the same network structure, wherein the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder network models as input, respectively train the two predictor network models with the same network structure through the labeled molecular samples, and take the average value of output results of the two predictor network models with the same network structure as a final prediction result during prediction;
the design variation is derived from the encoder loss function.
Preferably, the generation of the unlabeled molecular dataset is specifically as follows:
collecting and obtaining a labeled molecular data set consisting of fields formed by SMILES character string coding modes and molecular property fields; wherein molecular properties include activity, selectivity, and solids content;
obtaining a molecular fragment library according to the tagged molecular data set: generating molecular fragments by using a BRICSDecompose function in open source chemical information software RDkit;
and (3) obtaining the unlabeled molecular data set consisting of fields formed by the SMILES character string coding mode by a molecular fragment library splicing mode.
More preferably, the label-free molecular data set formed by the fields formed by the SMILES character string coding mode is obtained by a molecular fragment library splicing mode specifically as follows:
the method comprises the steps of performing pairwise splicing on fragments in a molecular fragment library by using a displacesubstructurts function to obtain a large number of molecular functional groups;
and respectively combining the molecular functional groups with the target approximate molecular structure at a preset position through a replaysubstructures function to form complete molecules, so as to obtain a label-free molecular data set.
Preferably, the encoder network model comprises an input layer, three one-dimensional convolution layers, four BatchNorm layers, a fully connected hidden layer, and an output layer;
the dimension of the input layer of the encoder network model is 120 multiplied by 19, and 120 is the maximum number of characters of the designated molecular SMILES; 19 is the number of non-repeated characters in all molecular SMILES character strings in the tagged molecular dataset; the output layer of the encoder network model has two heads, vectors z_mean and middle, respectively, which are vectors of dimension 196.
Preferably, the decoder network model includes an input layer, a fully connected hidden layer, a BatchNorm layer, three GRU layers, and an output layer;
wherein the input layer of the decoder network model has 196 neurons for receiving the vector z_samp; the output layer of the decoder network model has 19 neurons and the activation function is softmax.
Preferably, the predictor network model comprises an input layer, three fully connected hidden layers, two BatchNorm layers, three Dropout layers and an output layer;
the input layer of the predictor network model has 196 neurons for receiving the vector z_mean output by the encoder network model; the output layer of the predictor network model has 1 neuron, and the activation function is linear.
Preferably, the variation self-encoder loss function includes a cross entropy loss of the decoder network model, a KL divergence loss of the variation self-encoder, an MSE loss of two predictor network models having the same network structure, and a weight quadrature loss of two predictor network models having the same network structure;
wherein KL divergence loss is used to train the hidden feature distribution of the variation self-encoder; the KL divergence loss function is specifically as follows:
Figure SMS_1
wherein (1)>
Figure SMS_2
,/>
Figure SMS_3
p 1 Representing a hidden feature distribution of the variation self-encoder;p 2 representing a target distribution of the variation from the encoder; />
Figure SMS_4
Is the standard deviation; />
Figure SMS_5
Is the mean value;N 1N 2 all represent normal distributions;
the MSE loss of two predictor network models with the same network structure is the sum of the MSE loss between the output values of the predictor network models with the same network structure and the MSE loss between the output values of the labeled molecular samples of the respective predictor network models with the same network structure and the corresponding actual values; the formula is as follows:
MSE total (S) =MSE(y 1 ,y 2 )+MSE(y true ,y 1 )+ MSE(y true ,y 2 );
Wherein y is 1 An output value representing a labeled molecular sample of one of the predictor network models; y is 2 An output value representing a labeled molecular sample of another predictor network model; y is true Representing corresponding actual values of labeled molecular samples of two predictor network models having the same network structure;
weight orthogonal loss for two predictor network models with identical network structure
Figure SMS_6
The method comprises the following steps:
Figure SMS_7
wherein C is the dimension of the output layer weight, < ->
Figure SMS_8
And->
Figure SMS_9
The j-th dimension weights of the two predictor output layers are respectively obtained; t denotes the vector transpose.
A molecular property prediction system based on a semi-supervised variation self-encoder, which comprises a generator, an encoder, a decoder, a constructor and a designer;
the generator is used for generating a label-free molecular data set;
the encoder is used for obtaining continuous hidden molecule characterization vectors z_mean and middle corresponding to the unlabeled molecule samples in a vector form of 120 multiplied by 19 by utilizing the unlabeled molecule samples in the unlabeled molecule data set;
the decoder is used for obtaining a 196-dimensional vector z_samp and a 196 x 2-dimensional vector z_mean_log_var by utilizing continuous hidden molecule representation vectors z_mean and middle of unlabeled molecular samples through a variable sampling layer, and then processing the vector z_samp;
the constructor is used for constructing two predictor network models with the same network structure, the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder as input, the two predictor network models with the same network structure are respectively trained through the labeled molecular samples, and when in prediction, the average value of output results of the two predictor network models with the same network structure is taken as a final prediction result;
the designer is configured to design a variance self-encoder loss function.
An electronic device, comprising: a memory and at least one processor;
wherein the memory has a computer program stored thereon;
the at least one processor executes the computer program stored by the memory, causing the at least one processor to perform the semi-supervised variation self encoder based molecular property prediction method as described above.
A computer readable storage medium having stored therein a computer program executable by a processor to implement a semi-supervised variation self encoder based molecular property prediction method as described above.
The molecular property prediction method and system based on the semi-supervised variation self-encoder have the following advantages: according to the invention, the VAE model is trained by using a large amount of unlabeled sample data and a small amount of labeled sample data by using a semi-supervision idea, so that the molecular property prediction accuracy is improved and the application field of the molecular property prediction model is enlarged.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a schematic diagram of a molecular property prediction method based on a semi-supervised variation self-encoder.
Detailed Description
The molecular property prediction method and system based on the semi-supervised variation self-encoder of the present invention are described in detail below with reference to the accompanying drawings and specific embodiments of the present invention.
Example 1:
as shown in fig. 1, the embodiment provides a molecular property prediction method based on a semi-supervised variation self-encoder, which specifically comprises the following steps:
s1, generating a label-free molecular data set;
s2, constructing a molecular property prediction model based on a variation self-encoder; the method comprises the following steps:
s201, inputting unlabeled molecular samples in an unlabeled molecular data set into an encoder network model in a vector form of 120 multiplied by 19, and obtaining continuous hidden molecular characterization vectors z_mean and middle corresponding to the unlabeled molecular samples through the encoder network model;
s202, carrying out variable sampling on a continuous hidden molecular characterization vector z_mean and a sample of an unlabeled molecular sample to obtain a 196-dimensional vector z_samp and a 196×2-dimensional vector z_mean_log_var, and inputting the vector z_samp into a decoder network model for processing;
s203, constructing two predictor network models with the same network structure, wherein the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder network models as input, respectively train the two predictor network models with the same network structure through the labeled molecular samples, and take the average value of output results of the two predictor network models with the same network structure as a final prediction result during prediction;
s204, designing a variation self-encoder loss function;
the generation of the unlabeled molecular data set in step S1 of this embodiment is specifically as follows:
s101, collecting and obtaining a labeled molecular data set consisting of a field formed by a SMILES character string coding mode and a molecular property field; wherein molecular properties include activity, selectivity, and solids content;
s102, obtaining a molecular fragment library according to the tagged molecular data set: generating molecular fragments by using a BRICSDecompose function in open source chemical information software RDkit;
s103, obtaining a label-free molecular data set composed of fields formed by a SMILES character string coding mode in a molecular fragment library splicing mode;
in step S103 of this embodiment, the label-free molecular data set composed of the fields formed by the SMILES string encoding method is obtained by means of the molecular fragment library splicing method specifically as follows:
s10301, splicing fragments in the molecular fragment library in pairs by using a displacesubstructures function to obtain a large number of molecular functional groups;
s10302, respectively combining the molecular functional groups and the target approximate molecular structure at preset positions to form complete molecules through a displacesubstructures function, and obtaining a label-free molecular data set.
The encoder network model in step S201 of this embodiment includes an input layer, three one-dimensional convolution layers, four batch norm layers, a fully-connected hidden layer, and an output layer;
the dimension of the input layer of the encoder network model is 120 multiplied by 19, and 120 is the maximum number of characters of the designated molecular SMILES; 19 is the number of non-repeated characters in all molecular SMILES character strings in the tagged molecular dataset; the output layer of the encoder network model has two heads, vectors z_mean and middle, respectively, which are vectors of dimension 196.
The decoder network model in step S202 of the present embodiment includes an input layer, a full connection hidden layer, a batch norm layer, three GRU layers, and an output layer;
wherein the input layer of the decoder network model has 196 neurons for receiving the vector z_samp; the output layer of the decoder network model has 19 neurons and the activation function is softmax.
The predictor network model in step S203 of the present embodiment includes an input layer, three fully connected hidden layers, two Batchnorm layers, three Dropout layers, and an output layer;
the input layer of the predictor network model has 196 neurons for receiving the vector z_mean output by the encoder network model; the output layer of the predictor network model has 1 neuron, and the activation function is linear.
The variation self-encoder loss function in step S204 of this embodiment includes cross entropy loss of the decoder network model, KL divergence loss of the variation self-encoder, MSE loss of two predictor network models with the same network structure, and weight quadrature loss of two predictor network models with the same network structure;
wherein KL divergence loss is used to train the hidden feature distribution of the variation self-encoder; the KL divergence loss function is specifically as follows:
Figure SMS_10
wherein (1)>
Figure SMS_11
,/>
Figure SMS_12
p 1 Representing a hidden feature distribution of the variation self-encoder;p 2 representing a target distribution of the variation from the encoder; />
Figure SMS_13
Is the standard deviation; />
Figure SMS_14
Is the mean value;N 1N 2 all represent normal distributions;
the MSE loss of two predictor network models with the same network structure is the sum of the MSE loss between the output values of the predictor network models with the same network structure and the MSE loss between the output values of the labeled molecular samples of the respective predictor network models with the same network structure and the corresponding actual values; the formula is as follows:
MSE total (S) =MSE(y 1 ,y 2 )+MSE(y true ,y 1 )+ MSE(y true ,y 2 );
Wherein y is 1 An output value representing a labeled molecular sample of one of the predictor network models; y is 2 An output value representing a labeled molecular sample of another predictor network model; y is true Representing corresponding actual values of labeled molecular samples of two predictor network models having the same network structure;
weight orthogonal loss for two predictor network models with identical network structure
Figure SMS_15
The method comprises the following steps:
Figure SMS_16
wherein C is the dimension of the output layer weight, < ->
Figure SMS_17
And->
Figure SMS_18
The j-th dimension weights of the two predictor output layers are respectively obtained; t denotes the vector transpose.
The labeled molecular dataset collected in this example had 200 samples from which 20% of the samples were randomly selected as the test set, and the predicted effect of the semi-supervised VAE versus the normal VAE on the three molecular properties of the catalyst was compared, as shown in the following table:
Figure SMS_19
example 2:
the embodiment provides a molecular property prediction system based on a semi-supervised variation self-encoder, which comprises a generator, an encoder, a decoder, a constructor and a designer;
the generator is used for generating a label-free molecular data set;
the encoder is used for obtaining continuous hidden molecule characterization vectors z_mean and middle corresponding to the unlabeled molecule samples in a vector form of 120 multiplied by 19 by utilizing the unlabeled molecule samples in the unlabeled molecule data set;
the decoder is used for obtaining a 196-dimensional vector z_samp and a 196 x 2-dimensional vector z_mean_log_var by utilizing continuous hidden molecule representation vectors z_mean and middle of unlabeled molecular samples through a variable sampling layer, and then processing the vector z_samp;
the constructor is used for constructing two predictor network models with the same network structure, the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder as input, the two predictor network models with the same network structure are respectively trained through the labeled molecular samples, and when in prediction, the average value of output results of the two predictor network models with the same network structure is taken as a final prediction result;
the designer is configured to design a variance self-encoder loss function.
The working process of the generator in this embodiment is specifically as follows:
(1) Collecting and obtaining a labeled molecular data set consisting of a field formed by the SMILES character string coding mode and a molecular property field; wherein molecular properties include activity, selectivity, and solids content;
(2) Obtaining a molecular fragment library according to the tagged molecular data set: generating molecular fragments by using a BRICSDecompose function in open source chemical information software RDkit;
(3) Obtaining a label-free molecular data set composed of fields formed by SMILES character string coding modes in a molecular fragment library splicing mode; the method comprises the following steps:
(1) the fragments in the molecular fragment library are spliced pairwise by using a displacesubstructurts function to obtain a large number of molecular functional groups;
(2) combining the molecular functional groups with the target approximate molecular structure at preset positions to form complete molecules through a displacesubstructures function, so as to obtain a label-free molecular data set.
The encoder in this embodiment includes an input layer, three one-dimensional convolutional layers, four BatchNorm layers, a fully-connected hidden layer, and an output layer;
the dimension of the input layer of the encoder network model is 120 multiplied by 19, and 120 is the maximum number of characters of the designated molecular SMILES; 19 is the number of non-repeated characters in all molecular SMILES character strings in the tagged molecular dataset; the output layer of the encoder network model has two heads, vectors z_mean and middle, respectively, which are vectors of dimension 196.
The decoder in this embodiment includes an input layer, a fully connected hidden layer, a BatchNorm layer, three GRU layers, and an output layer;
wherein the input layer of the decoder network model has 196 neurons for receiving the vector z_samp; the output layer of the decoder network model has 19 neurons and the activation function is softmax.
The predictor network model in this embodiment includes an input layer, three fully connected hidden layers, two BatchNorm layers, three Dropout layers, and an output layer;
the input layer of the predictor network model has 196 neurons for receiving the vector z_mean output by the encoder network model; the output layer of the predictor network model has 1 neuron, and the activation function is linear.
The variation self-encoder loss function in this embodiment includes cross entropy loss of the decoder network model, KL divergence loss of the variation self-encoder, MSE loss of two predictor network models with the same network structure, and weight quadrature loss of two predictor network models with the same network structure;
wherein KL divergence loss is used to train the hidden feature distribution of the variation self-encoder; the KL divergence loss function is specifically as follows:
Figure SMS_20
wherein (1)>
Figure SMS_21
,/>
Figure SMS_22
;p 1 Representing a hidden feature distribution of the variation self-encoder;p 2 representing a target distribution of the variation from the encoder; />
Figure SMS_23
Is the standard deviation; />
Figure SMS_24
Is all that isA value;N 1N 2 all represent normal distributions;
the MSE loss of two predictor network models with the same network structure is the sum of the MSE loss between the output values of the predictor network models with the same network structure and the MSE loss between the output values of the labeled molecular samples of the respective predictor network models with the same network structure and the corresponding actual values; the formula is as follows:
MSE total (S) =MSE(y 1 ,y 2 )+MSE(y true ,y 1 )+ MSE(y true ,y 2 );
Wherein y is 1 An output value representing a labeled molecular sample of one of the predictor network models; y is 2 An output value representing a labeled molecular sample of another predictor network model; y is true Representing corresponding actual values of labeled molecular samples of two predictor network models having the same network structure;
weight orthogonal loss for two predictor network models with identical network structure
Figure SMS_25
The method comprises the following steps:
Figure SMS_26
wherein C is the dimension of the output layer weight, < ->
Figure SMS_27
And->
Figure SMS_28
The j-th dimension weights of the two predictor output layers are respectively obtained; t denotes the vector transpose.
Example 3:
the embodiment also provides an electronic device, including: a memory and a processor;
wherein the memory stores computer-executable instructions;
the processor executes the computer-executable instructions stored in the memory, causing the processor to perform the semi-supervised variation self encoder-based molecular property prediction method of any embodiment of the present invention.
The processor may be a Central Processing Unit (CPU), but may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may be used to store computer programs and/or modules, and the processor implements various functions of the electronic device by running or executing the computer programs and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal, etc. The memory may also include high-speed random access memory, but may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, memory card only (SMC), secure Digital (SD) card, flash memory card, at least one disk storage period, flash memory device, or other volatile solid state memory device.
Example 4:
the present embodiment also provides a computer readable storage medium having stored therein a plurality of instructions, the instructions being loaded by a processor, to cause the processor to perform the method for predicting molecular properties based on a semi-supervised variation self encoder in any of the embodiments of the present invention. Specifically, a system or apparatus provided with a storage medium on which a software program code realizing the functions of any of the above embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be caused to read out and execute the program code stored in the storage medium.
In this case, the program code itself read from the storage medium may realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code form part of the present invention.
Examples of storage media for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RYM, DVD-RWs, DVD+RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer by a communication network.
Further, it should be apparent that the functions of any of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform part or all of the actual operations based on the instructions of the program code.
Further, it is understood that the program code read out by the storage medium is written into a memory provided in an expansion board inserted into a computer or into a memory provided in an expansion unit connected to the computer, and then a CPU or the like mounted on the expansion board or the expansion unit is caused to perform part and all of actual operations based on instructions of the program code, thereby realizing the functions of any of the above embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (8)

1. A molecular property prediction method based on a semi-supervised variation self-encoder is characterized by comprising the following steps:
generating a label-free molecular dataset; the method comprises the following steps:
collecting and obtaining a labeled molecular data set consisting of fields formed by SMILES character string coding modes and molecular property fields; wherein molecular properties include activity, selectivity, and solids content;
obtaining a molecular fragment library according to the tagged molecular data set: generating molecular fragments by using a BRICSDecompose function in open source chemical information software RDkit;
obtaining a label-free molecular data set consisting of fields formed by SMILES character string coding modes in a molecular fragment library splicing mode;
constructing a molecular property prediction model based on a variation self-encoder; the method comprises the following steps:
inputting the unlabeled molecular samples in the unlabeled molecular data set into an encoder network model in the form of 120 multiplied by 19 vectors, and obtaining continuous hidden molecular characterization vectors z_mean and middle corresponding to the unlabeled molecular samples through the encoder network model;
the continuous hidden molecule characterization vector z_mean and the middle of the unlabeled molecule sample are subjected to a variable sampling layer to obtain a 196-dimensional vector z_samp and a 196 multiplied by 2-dimensional vector z_mean_log_var, and then the vector z_samp is input into a decoder network model for processing;
constructing two predictor network models with the same network structure, wherein the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder network models as input, respectively train the two predictor network models with the same network structure through the labeled molecular samples, and take the average value of output results of the two predictor network models with the same network structure as a final prediction result during prediction;
designing a variation self-encoder loss function;
the variation self-encoder loss function comprises cross entropy loss of a decoder network model, KL divergence loss of the variation self-encoder, MSE loss of two predictor network models with the same network structure and weight orthogonal loss of the two predictor network models with the same network structure;
wherein KL divergence loss is used to train the hidden feature distribution of the variation self-encoder; the KL divergence loss function is specifically as follows:
Figure QLYQS_1
wherein (1)>
Figure QLYQS_2
,/>
Figure QLYQS_3
p 1 Representing a hidden feature distribution of the variation self-encoder;p 2 representing a target distribution of the variation from the encoder; />
Figure QLYQS_4
Is the standard deviation; />
Figure QLYQS_5
Is the mean value;N 1N 2 all represent normal distributions;
the MSE loss of two predictor network models with the same network structure is the sum of the MSE loss between the output values of the predictor network models with the same network structure and the MSE loss between the output values of the labeled molecular samples of the respective predictor network models with the same network structure and the corresponding actual values; the formula is as follows:
MSE total (S) =MSE(y 1 ,y 2 )+MSE(y true ,y 1 )+ MSE(y true ,y 2 );
Wherein y is 1 An output value representing a labeled molecular sample of one of the predictor network models; y is 2 An output value representing a labeled molecular sample of another predictor network model; y is true Representing corresponding actual values of labeled molecular samples of two predictor network models having the same network structure;
weight orthogonal loss for two predictor network models with identical network structure
Figure QLYQS_6
The method comprises the following steps:
Figure QLYQS_7
wherein C is the dimension of the output layer weight, < ->
Figure QLYQS_8
And->
Figure QLYQS_9
The j-th dimension weights of the two predictor output layers are respectively obtained; t denotes the vector transpose.
2. The molecular property prediction method based on the semi-supervised variation self-encoder according to claim 1, wherein the unlabeled molecular data set consisting of fields formed by the SMILES string coding method is obtained by a molecular fragment library splicing method specifically as follows:
the method comprises the steps of performing pairwise splicing on fragments in a molecular fragment library by using a displacesubstructurts function to obtain molecular functional groups;
and respectively combining the molecular functional groups with the target approximate molecular structure at a preset position through a replaysubstructures function to form complete molecules, so as to obtain a label-free molecular data set.
3. The method for predicting molecular properties based on a semi-supervised variation self encoder as recited in claim 1, wherein the encoder network model includes an input layer, three one-dimensional convolution layers, four BatchNorm layers, a fully connected hidden layer, and an output layer;
the dimension of the input layer of the encoder network model is 120 multiplied by 19, and 120 is the maximum number of characters of the designated molecular SMILES; 19 is the number of non-repeated characters in all molecular SMILES character strings in the tagged molecular dataset; the output layer of the encoder network model has two heads, vectors z_mean and middle, respectively, which are vectors of dimension 196.
4. The method of claim 1, wherein the decoder network model comprises an input layer, a fully connected hidden layer, a BatchNorm layer, three GRU layers, and an output layer;
wherein the input layer of the decoder network model has 196 neurons for receiving the vector z_samp; the output layer of the decoder network model has 19 neurons and the activation function is softmax.
5. The method for predicting molecular properties based on a semi-supervised variation self encoder of claim 1, wherein the predictor network model comprises an input layer, three fully connected hidden layers, two BatchNorm layers, three Dropout layers, and an output layer;
the input layer of the predictor network model has 196 neurons for receiving the vector z_mean output by the encoder network model; the output layer of the predictor network model has 1 neuron, and the activation function is linear.
6. A molecular property prediction system based on a semi-supervised variation self-encoder, which is characterized by comprising a generator, an encoder, a decoder, a constructor and a designer;
the generator is used for generating a label-free molecular data set; wherein generating a label-free molecular dataset is specifically as follows:
collecting and obtaining a labeled molecular data set consisting of fields formed by SMILES character string coding modes and molecular property fields; wherein molecular properties include activity, selectivity, and solids content;
obtaining a molecular fragment library according to the tagged molecular data set: generating molecular fragments by using a BRICSDecompose function in open source chemical information software RDkit;
obtaining a label-free molecular data set consisting of fields formed by SMILES character string coding modes in a molecular fragment library splicing mode;
the encoder is used for obtaining continuous hidden molecule characterization vectors z_mean and middle corresponding to the unlabeled molecule samples in a vector form of 120 multiplied by 19 by utilizing the unlabeled molecule samples in the unlabeled molecule data set;
the decoder is used for obtaining a 196-dimensional vector z_samp and a 196 x 2-dimensional vector z_mean_log_var by utilizing continuous hidden molecule representation vectors z_mean and middle of unlabeled molecular samples through a variable sampling layer, and then processing the vector z_samp;
the constructor is used for constructing two predictor network models with the same network structure, the predictor network models take continuous hidden molecular characterization vectors z_mean of unlabeled molecular samples output by the encoder as input, the two predictor network models with the same network structure are respectively trained through the labeled molecular samples, and when in prediction, the average value of output results of the two predictor network models with the same network structure is taken as a final prediction result;
the designer is used for designing a variation self-encoder loss function;
the variation self-encoder loss function comprises cross entropy loss of a decoder network model, KL divergence loss of the variation self-encoder, MSE loss of two predictor network models with the same network structure and weight orthogonal loss of the two predictor network models with the same network structure;
wherein KL divergence loss is used to train the hidden feature distribution of the variation self-encoder; the KL divergence loss function is specifically as follows:
Figure QLYQS_10
wherein (1)>
Figure QLYQS_11
,/>
Figure QLYQS_12
p 1 Representing a hidden feature distribution of the variation self-encoder;p 2 representing a target distribution of the variation from the encoder; />
Figure QLYQS_13
Is the standard deviation;/>
Figure QLYQS_14
Is the mean value;N 1N 2 all represent normal distributions;
the MSE loss of two predictor network models with the same network structure is the sum of the MSE loss between the output values of the predictor network models with the same network structure and the MSE loss between the output values of the labeled molecular samples of the respective predictor network models with the same network structure and the corresponding actual values; the formula is as follows:
MSE total (S) =MSE(y 1 ,y 2 )+MSE(y true ,y 1 )+ MSE(y true ,y 2 );
Wherein y is 1 An output value representing a labeled molecular sample of one of the predictor network models; y is 2 An output value representing a labeled molecular sample of another predictor network model; y is true Representing corresponding actual values of labeled molecular samples of two predictor network models having the same network structure;
weight orthogonal loss for two predictor network models with identical network structure
Figure QLYQS_15
The method comprises the following steps:
Figure QLYQS_16
wherein C is the dimension of the output layer weight, < ->
Figure QLYQS_17
And->
Figure QLYQS_18
The j-th dimension weights of the two predictor output layers are respectively obtained; t denotes the vector transpose.
7. An electronic device, comprising: a memory and at least one processor;
wherein the memory has a computer program stored thereon;
the at least one processor executing the computer program stored by the memory, causes the at least one processor to perform the semi-supervised variation self encoder based molecular property prediction method of any of claims 1 to 5.
8. A computer readable storage medium having stored therein a computer program executable by a processor to implement the semi-supervised variation self encoder based molecular property prediction method of any of claims 1 to 5.
CN202310384467.XA 2023-04-12 2023-04-12 Molecular property prediction method and system based on semi-supervised variation self-encoder Active CN116110504B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310384467.XA CN116110504B (en) 2023-04-12 2023-04-12 Molecular property prediction method and system based on semi-supervised variation self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310384467.XA CN116110504B (en) 2023-04-12 2023-04-12 Molecular property prediction method and system based on semi-supervised variation self-encoder

Publications (2)

Publication Number Publication Date
CN116110504A CN116110504A (en) 2023-05-12
CN116110504B true CN116110504B (en) 2023-06-23

Family

ID=86260100

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310384467.XA Active CN116110504B (en) 2023-04-12 2023-04-12 Molecular property prediction method and system based on semi-supervised variation self-encoder

Country Status (1)

Country Link
CN (1) CN116110504B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114692767A (en) * 2022-03-31 2022-07-01 中国电信股份有限公司 Abnormality detection method and apparatus, computer-readable storage medium, and electronic device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220406415A1 (en) * 2019-09-26 2022-12-22 Terramera, Inc. Systems and methods for synergistic pesticide screening
CN112990385B (en) * 2021-05-17 2021-09-21 南京航空航天大学 Active crowdsourcing image learning method based on semi-supervised variational self-encoder
CN113327651A (en) * 2021-05-31 2021-08-31 东南大学 Molecular diagram generation method based on variational self-encoder and message transmission neural network
US20220415453A1 (en) * 2021-06-25 2022-12-29 Deepmind Technologies Limited Determining a distribution of atom coordinates of a macromolecule from images using auto-encoders

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114692767A (en) * 2022-03-31 2022-07-01 中国电信股份有限公司 Abnormality detection method and apparatus, computer-readable storage medium, and electronic device

Also Published As

Publication number Publication date
CN116110504A (en) 2023-05-12

Similar Documents

Publication Publication Date Title
Blank et al. Quantum classifier with tailored quantum kernel
Udrescu et al. AI Feynman: A physics-inspired method for symbolic regression
CN110366734B (en) Optimizing neural network architecture
Rocchetto et al. Learning hard quantum distributions with variational autoencoders
Wang et al. A hybrid differential evolution approach to designing deep convolutional neural networks for image classification
Deng et al. Software defect prediction via LSTM
Landajuela et al. A unified framework for deep symbolic regression
Bartoldson et al. Compute-efficient deep learning: Algorithmic trends and opportunities
Gao et al. A novel gapg approach to automatic property generation for formal verification: The gan perspective
Sajadmanesh et al. Continuous-time relationship prediction in dynamic heterogeneous information networks
US20200364578A1 (en) Hamming distance based robust output encoding for improved generalization
Javeed et al. Discovering software developer's coding expertise through deep learning
CN117648950A (en) Training method and device for neural network model, electronic equipment and storage medium
Kolesov et al. On multilabel classification methods of incompletely labeled biomedical text data
Hillmich et al. Approximating decision diagrams for quantum circuit simulation
Fry et al. Optimizing quantum noise-induced reservoir computing for nonlinear and chaotic time series prediction
CN113110843B (en) Contract generation model training method, contract generation method and electronic equipment
Sannia et al. A hybrid classical-quantum approach to speed-up Q-learning
Srivastava et al. Generative and discriminative training of Boltzmann machine through quantum annealing
Thorbjarnarson et al. Optimal training of integer-valued neural networks with mixed integer programming
Symeonidis et al. A benchmark framework to evaluate energy disaggregation solutions
Schuld et al. Representing data on a quantum computer
CN116110504B (en) Molecular property prediction method and system based on semi-supervised variation self-encoder
Altares-López et al. AutoQML: Automatic generation and training of robust quantum-inspired classifiers by using evolutionary algorithms on grayscale images
Chatterjee et al. Class-biased sarcasm detection using BiLSTM variational autoencoder-based synthetic oversampling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant