CN110600085A

CN110600085A - Organic matter physicochemical property prediction method based on Tree-LSTM

Info

Publication number: CN110600085A
Application number: CN201910500140.8A
Authority: CN
Inventors: 申威峰; 粟杨
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2019-06-01
Filing date: 2019-06-01
Publication date: 2019-12-20
Anticipated expiration: 2039-06-01
Also published as: CN110600085B

Abstract

A Tree-LSTM-based organic matter physicochemical property prediction method comprises a generation prediction model and a physicochemical property prediction part, wherein the generation prediction model comprises the following steps: 1) the molecular structure of the organic matter is normalized and coded and a tree-shaped data structure (molecular feature descriptor) is generated; 2) training a Tree-LSTM model by using molecular feature descriptors and physicochemical property experimental data of organic matters to obtain a sea surface temperature prediction model based on LSTM; the predicted organic physicochemical properties include: and normalizing the molecular structure, coding and inputting the molecular structure into a prediction model to obtain an output result of the physicochemical property of the organic matter. The invention can lead the computer to automatically extract the relationship between the molecular structure and the physical and chemical properties, is more suitable for learning the molecular structure information of various organic matters and can obtain better prediction results.

Description

Organic matter physicochemical property prediction method based on Tree-LSTM

Technical Field

The invention relates to the field of chemistry C07, in particular to a prediction method of chemical substance quantitative structure correlation properties based on an artificial intelligence technology.

Background

The physicochemical property is basic data closely related to chemistry and chemical engineering, critical property, boiling point, generated heat, octanol water distribution coefficient and the like are closely related to scientific research and production practice of chemistry and chemical engineering, and the scientific and reasonable predicted value of various physicochemical properties can reduce the measurement work of the physicochemical properties and save a large amount of manpower and material resources. The data acquisition of physicochemical properties is usually difficult to develop due to harsh experimental determination conditions or objective factors such as easy decomposition of the measured substance, and is currently mainly estimated by using a group contribution method and a topological coefficient method based on multiple linear regression. However, the group contribution method and the topological coefficient method require manual extraction of molecular structure characteristics before prediction, so that the application range of the two methods is limited.

The Tree-LSTM recurrent neural network is improved on the basis of an LSTM (Long-Short Term Memory) recurrent neural network, the neural network can learn a dependency relationship which is more complex than a sequence structure, and can autonomously learn the contribution of a molecular Tree topology structure to prediction data for input data, particularly the neural network overcomes the defect that other neural networks cannot reproduce atom connection relationships in molecules, and is more suitable for mining the implicit relationship between a molecular structure and physicochemical properties of the molecular structure. The existing group contribution method needs to disassemble molecules into different groups (molecular substructure fragments) and needs to apply multivariate linear fitting to realize the prediction of the physicochemical properties of the organic matters. The group contribution method predicts that various group contribution methods have different resolving schemes, and some molecules cannot find a proper resolving scheme, so that the prediction is deviated or cannot be completed. The existing topological index rule is limited by the complex calculation of the topological index and the incapability of intuitively representing the local structure of a molecule, so that the existing topological index rule does not have the capability of predicting more extensive physicochemical properties. Therefore, a method for predicting the physicochemical properties of organic matters by using a Tree-LSTM recurrent neural network system alone does not appear.

Disclosure of Invention

The invention provides a method for predicting the physicochemical property of an organic matter based on Tree-LSTM, which solves the technical problems of the prior art that the prediction range is not wide, the coverage substance is not wide and the prediction precision is not high.

In order to solve the problems, the invention adopts the following technical scheme:

the method comprises the following steps: step A, generating a prediction model; b, predicting two parts of physical and chemical properties;

the step A comprises the following steps:

a1, acquiring experimental data of the physicochemical properties of the organic matters and molecular structure information of the organic matters, and capturing a large amount of data from various databases by using a web crawler technology;

a2 normalizing the structure of a single organic molecule (by a graph normalization algorithm), traversing each atom in the single organic molecule and generating a corresponding atom feature descriptor, ordering all the atom feature descriptors of the single organic molecule according to a lexicographic order, and taking the smallest atom feature descriptor as a molecule feature descriptor;

a3 generating molecular feature descriptors representing each representative molecular structure normalized graph and corresponding linear codes according to the obtained all organic molecular structures in the step A2;

a4 splitting all organic molecules into various chemical bonds, arranging character strings representing the chemical bonds according to each molecule, and generating word vectors for the character strings by adopting a word embedding algorithm;

a5 builds a neural network model based on Tree-LSTM, and loads the physicochemical data obtained from A1 and the molecular structure data processed by A2-A4, and the Tree-LSTM automatically adapts to the topological shape of the molecular structure normalized graph. Manually adjusting various hyper-parameters and training a model, and preferentially selecting parameters in the training process to obtain a Tree-LSTM-based organic matter physicochemical prediction model;

the step B comprises the following steps:

b1 processing the molecular structure of organic matter without experimental data of certain physicochemical properties by A2-A4 steps, loading the generated feature descriptor and code into the physicochemical property prediction model obtained by A5, and inputting the molecular feature descriptor to predict the data of unknown physicochemical properties.

As a further refinement, said step a5 comprises the following:

a51: building a Tree-LSTM model under a Linux system or a Windows system;

a52: setting the input dimension of Tree-LSTM and the length of input data; a53: setting the data quantity proportion of the Tree-LSTM training set and the test set; a54: setting a Tree-LSTM model optimizer and a learning rate; a55: setting the width of hidden layer neuron; a56: setting the iteration times of the model; a57: and continuously adjusting parameters, checking the convergence degree of the model according to the model loss, and preferentially selecting high-convergence parameters to form a physical and chemical property prediction model based on Tree-LSTM.

Drawings

FIG. 1 is a flow chart of the process for predicting the physicochemical properties of organic substances according to the present invention;

FIG. 2 is a computational graph of the Tree-LSTM recurrent neural network in predicting the properties of acetodoxime substance;

FIG. 3 is a graph of the prediction effect of the Tree-LSTM physicochemical property prediction model on the critical temperature of organic matters, wherein x represents a predicted value, and a straight line represents an actual value.

Figure 4 is a molecular feature descriptor generation exemplified by the acetodoxime substance.

FIG. 5 is a coding rule of molecular characterization illustrating the meaning of each bit of code.

Detailed Description

The invention will be described in detail with reference to the accompanying drawings and specific examples, it being understood that the examples described below are for the purpose of facilitating an understanding of the invention only and are not intended to limit the invention itself in any way.

The invention provides a method for predicting the physicochemical property of an organic matter based on Tree-LSTM, which comprises the following two steps as shown in figure 1: step A, generating a prediction model; b, predicting two parts of physical and chemical properties;

step A, generating a prediction model:

a1 obtains experimental data of the physicochemical properties of the organic matter and the molecular structure information of the organic matter, and captures a large amount of data from various databases by using a web crawler technology.

The physicochemical properties of the a11 organic substance mainly include: critical properties, normal boiling point, transfer properties, spontaneous combustion point, flash point, toxicity, octanol water partition coefficient, biochemical activity, etc.

The molecular structure information of A12 mainly takes SMILES expression, SMARTS expression, MOL file and SDF file as carriers.

A2 standardizes the structure of a single organic molecule, traverses each atom in the single organic molecule and generates corresponding atom feature descriptors, sorts all the atom feature descriptors of the single organic molecule according to a lexicographic order, and takes the smallest atom feature descriptor as a molecule feature descriptor and encodes the molecule feature descriptor.

A21, generating a specification graph from the two-dimensional topological graph of the organic molecule by a graph normalization algorithm in graph theory to realize isomorphic comparison of the molecular graph, for example, Nauty and Faulon graph normalization algorithms can be adopted.

The A22 encoding method is as follows:

the first method, which directly uses the molecular feature descriptor outputted by the Faulon normalization algorithm as the coding of the organic matter, is exemplified in fig. 4.

The second method, encoding the molecular feature descriptors in a linear encoded format, is exemplified in table 1.

And A3 generating a molecular feature descriptor and a corresponding code of each molecule according to the obtained molecular structure information of all the organic matters in the step A2.

A4 splits all organic molecules into various chemical bonds, arranges character strings representing the chemical bonds according to each molecule, and generates word vectors by using a word embedding algorithm for the character strings.

A5 builds a Tree-LSTM-based neural network model, loads physical and chemical data obtained from A1 and molecular structure data processed by A2-A4, continuously adjusts parameters, and preferentially selects the parameters to obtain a Tree-LSTM-based organic matter physical and chemical prediction model.

The step B comprises the following steps:

b1 processing the molecular structure of the organic matter without some physicochemical property experimental data by adopting the steps of A2-A4, loading the generated characteristic descriptors and codes into a physicochemical property prediction model to obtain the data of unknown physicochemical property;

step a4 further includes the following:

a41: traversing each molecule in the database, traversing the connected chemical bond and atom with each atom in each molecule as a starting point, forming a character string like 'A-B', and recording to form original data. Description of the drawings: "A" represents the symbol of the element of the atom A, "B" represents the symbol of the element of the atom B, and "-" represents the type of chemical bond between the atom A and the atom B.

A42: splitting a character string in the form of 'A-B' in original data to form a sub-character string set of three combination modes: the combination is as follows: "A" and "-B", in combination two: "A-" and "B", in combination three: "A", "-", and "B".

A43: and (3) building a neural network based on a skip-gram algorithm under a Linux system or a Windows system, and obtaining an embedded vector representing each character string in the character string set obtained by A42.

As a further refinement, said step a5 comprises the following:

a51: building a Tree-LSTM model under a Linux system or a Windows system;

a52: the feature descriptors or linear encodings of each numerator are parsed into a tree-like data structure and a corresponding embedded vector obtained by a4 is matched for each node in the tree-like structure (for each atom in the numerator).

A52: setting the input dimension of Tree-LSTM and the length of input data; the input dimension in the present invention is 1 and the length is 50.

A53: setting the data quantity proportion of the Tree-LSTM training set and the test set; the ratio in the present invention is 4: 1.

A54: setting a Tree-LSTM model optimizer and a learning rate; the method adopts an Adam algorithm optimizer, and the learning rate is 0.001:

a55: setting the width of each hidden layer neuron;

a56: setting the iteration times of the model;

a57: and adjusting the number of the cryptomelanic ganglion points under the same iteration number, adjusting the iteration number under the same number of the cryptomelanic ganglion points, checking the convergence degree of the model according to the overall loss and the iteration loss of the model, and preferentially selecting a high-convergence parameter to form a physical and chemical property prediction model based on Tree-LSTM.

The structure of the Tree-LSTM neural network is shown in FIG. 2.

Tree-LSTM has two mathematical models, one is a sub-node summation model, and the other is a sub-node independent model.

The core of Tree-LSTM is the control unit state c, the control includes forgetting gate f_jAnd input gate i_jAnd an output gate o_j. Current node j, forget gate f_jC in charge of controlling how much c of child node is saved to current node_j(ii) a Input door i_jResponsible for controlling how much the instant state of the current node is input into the current unit state c_j(ii) a Current input unit state u_jControlling how much new node information is added to the output; output gate o_jIs responsible for controlling the current cell state c_jHow many hidden layer outputs h as current node_j. The calculation formulas of the child node summation model are respectively as follows:

f_jk＝σ(W^(f)x_j+U^(f)h_k+b^(f)) (2)

c_j＝i_j·u_j+f_j (6)

h_j＝o_j·tanh(c_j) (7)

wherein, W^(f)、W⁽ⁱ⁾、W^(o)Weight matrices for forgetting gate, input gate and output gate, respectively, b^(f)、b⁽ⁱ⁾、b^(o)Are respectively offset terms of a forgetting gate, an input gate and an output gate, and sigma is a sigmoid function. The following is a child node independent modelCalculating the formula:

c_j＝i_j·u_j+f_j (14)

h_j＝o_j·tanh(c_j) (15)

the difference between the two models is whether or not h is a child node_jlThe addition is carried out, the sub-node independent model adds a parameter to hjl of each sub-node, and the sub-node addition model is h of the sub-node_jlSum ofProviding parameters for training.

The Tree-LSTM recurrent neural network structure is shown in fig. 3. The inputs to the LSTM include: cell state c of child node_jlHidden layer output value h of child node_jlInput value x of the current node_j(ii) a The output of the LSTM includes: cell state c at the present time_jHidden layer output value h of LSTM at current time_j。

Wherein the current input unit state u_jBy input x of the current node_jHidden layer output value h of child node_jl(if it is a child node summation model, here the hidden layer output value h of the child node_jlSum of) The calculation formula is shown in formula (4) or (12).

Wherein, W^(u)Is a weight matrix of the states of the input cells, b^(u)Is a bias term for the input cell state, and tanh is the hyperbolic tangent function. Current cell state c_jBy forgetting door f_j(including child node Unit State c)_jlChild node forgetting door f_jl) And input gate i_jAnd the currently input cell state u_jThe joint decision is shown in the formula (4) or (12), wherein the symbol is multiplied by the element. Hidden layer output value h of current node_jFrom an output gate o_jAnd the current cell state c_jThe calculation formula is shown in (7) or (15).

The Tree-LSTM neural network output is determined by a single-layer or multi-layer neural network, for example, the single-layer neural network is used as an output layer, and the calculation formula is as follows:

p_i＝w*h_ij+b (16)

property p of the ith component_iThe Tree-LSTM neural network output hj of the root node of the Tree structure represented by the molecular signature of the component is related, w and b are trainable parameters.

In the invention, Mean Square Error (MSE) or Mean Absolute Error (MAE) is used as a loss function (loss), and the calculation formula is as follows:

wherein N is the number of samples, x^expAs an observed value, x^prepIs a predicted value.

Examples of the experiments

The effect of the Tree-LSTM based physicochemical property prediction method will be exemplified below. Taking the critical temperature of an organic matter as an example, the property is used as basic data of various thermodynamic models and physical property estimation models, and has certain practical and representative significance for predicting the property.

Obtaining experimental data of critical temperature and molecular structure information of corresponding substances, wherein 1759 organic matters are counted, 1407 substances are used as a training set, and 352 substances are used as a testing set.

The construction principle of the molecular feature descriptor is illustrated by taking the substance acetaldoxime in the sample as an example, and the detailed description is shown in fig. 4. The molecular feature descriptor is a data structure for storing molecular structure information, which is developed by selecting a certain atom in a molecule as a starting point and developing the molecule according to a tree structure. The acetaldoxime in this example starts with the root atom, which is the carbon atom marked with a zero number. Starting from this root atom C0, a predetermined distance (or height) is searched down and the atoms encountered on the path and the type of chemical bond attached to the atom are recorded to record the characteristics of the molecule. Starting from the root atom, all atoms in the molecule are traversed to obtain an atom feature descriptor. If different root atoms are selected, different atom feature descriptors are generated and are arranged in descending order according to the dictionary order, wherein the first atom is a molecular feature descriptor. Fig. 4 depicts using acetodoxime as an example: (A) molecular structure (B) the tree expansion and atomic features of the molecular structure are described as atomic feature descriptors at different heights (C) from height 0 and height 1. Wherein the sub-atoms of an atom are indicated by nested brackets, and when the type of chemical bond is not specified, indicates that there is a single bond between the atoms in the atomic feature descriptor. Otherwise, the chemical bond is represented as follows ("═ is a double bond;" # "is a triple bond;": is an aromatic bond.)

In order to conveniently store the molecular feature descriptors, the invention develops a linear code to represent the tree-shaped expansion structure of the molecules, the linear code of the molecular feature descriptors with various depths, such as acetodoxime, is shown in a table 1, each atom is separated by an I in a character string, and the meaning of each number and letter in the atom is shown in a table 5. The first atom and the root atom have a current depth of 0, and are denoted by "S", and since there is no parent atom, the parent atom is encoded as "S", and since there is no chemical bond to the parent atom, the parent atom is also encoded as "S".

1759 organic substances are converted into molecular feature descriptors and are linearly encoded. Before being input into the neural network, the substances are analyzed into a tree structure, and each node (atom) is associated with the embedded vector obtained in the step A43. For each molecule in the sample, each atom corresponds to each node in the Tree-LSTM neural network, and the embedded vector of each atom is the input vector of the node. In the case of 300 iterations, the number of output layer nodes is continuously adjusted, and finally 128 output layer nodes are used as a better value in this example. The Tree-LSTM neural network structure is determined by the molecules of each organic substance, and is a dynamic neural network, adapting to the topological structure of different molecules. In this example, the learning rate was 0.008 in the first 300 trains, and then the learning rate was adjusted to 0.00001 for 5000 trains. To prevent overfitting, the calculation is ended in advance when the loss function value no longer decreases. And finally, obtaining the prediction results in the table 3, wherein the higher the coincidence degree of the experimental values and the predicted values in the table is, the better the prediction effect is. Table 2 shows the statistical evaluation parameters of the Tree-LSTM neural network for the training and prediction of critical temperature of organic matter. The predicted value in the x table of fig. 3, the straight line represents the experimental value, and it can be seen that for most data points, the invention obtains better prediction effect by using Tree-LSTM.

Table 1 example of linear encoding of molecular feature descriptors

TABLE 2 statistical parameters for organic matter critical temperature training and prediction

TABLE 3 partial prediction of critical temperature of organic substances

Comparing the present invention with representative methods of the group contribution method, the Joback and Constantinou-Gani (CG) methods, under the same list of materials, the following results were obtained, as shown in Table 4:

TABLE 4 comparison of the prediction ability of the present invention with classical group contribution method

The list of materials used for comparison in Table 4 contains 460 materials of which only 352 can be predicted by the Joback method, and when the prediction method of the present invention is used to predict the 352 materials, the prediction method of the present invention shows the method due to Joback. The amount of predictable substance predicted by the CG method is also less than and slightly less than the present invention. When the present invention predicts that the substance is all the substances in a single substance, 452 substances therein can be covered and an acceptable accuracy achieved. The superscript a indicates all predictable species and the superscript b indicates species with a number of carbon atoms greater than 3.

Claims

1. A Tree-LSTM-based organic matter physicochemical property prediction method is characterized in that a molecular graph of an organic matter is converted into a canonical graph so as to be convenient for computer recognition and learning, so that a computer can capture structural features of molecules, and the computer can correlate the features with the organic matter and physical or chemical properties, and finally the prediction of the material properties is realized, and the process comprises the following steps: step A, producing a prediction model; b, predicting two parts of physical and chemical properties;

the step A comprises the following steps:

the step B comprises the following steps:

2. The method for predicting the physicochemical property of organic substances based on Tree-LSTM according to claim 1, wherein the step A5 comprises the following steps:

a51: building a Tree-LSTM-based neural network under a Linux system or a Windows system; a52: setting the input dimension of Tree-LSTM and the length of input data; a53: setting the data quantity proportion of the Tree-LSTM training set and the test set; a54: setting a Tree-LSTM model optimizer and a learning rate; a55: setting the width of hidden layer neuron; a56: setting the iteration times of the model; a57: and continuously adjusting parameters, checking the convergence degree of the model according to the model loss, and preferentially selecting high-convergence parameters to form a physical and chemical property prediction model based on Tree-LSTM.