CN110600085A - Organic matter physicochemical property prediction method based on Tree-LSTM - Google Patents
Organic matter physicochemical property prediction method based on Tree-LSTM Download PDFInfo
- Publication number
- CN110600085A CN110600085A CN201910500140.8A CN201910500140A CN110600085A CN 110600085 A CN110600085 A CN 110600085A CN 201910500140 A CN201910500140 A CN 201910500140A CN 110600085 A CN110600085 A CN 110600085A
- Authority
- CN
- China
- Prior art keywords
- tree
- lstm
- organic
- molecular
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 239000005416 organic matter Substances 0.000 title claims abstract description 22
- 239000000126 substance Substances 0.000 claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 12
- 238000013528 artificial neural network Methods 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 8
- 239000000463 material Substances 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 238000003062 neural network model Methods 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000000547 structure data Methods 0.000 claims description 3
- 125000004429 atom Chemical group 0.000 description 40
- 239000010410 layer Substances 0.000 description 12
- 210000004027 cell Anatomy 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- FZENGILVLUJGJX-NSCUHMNNSA-N (E)-acetaldehyde oxime Chemical compound C\C=N\O FZENGILVLUJGJX-NSCUHMNNSA-N 0.000 description 2
- 208000003098 Ganglion Cysts Diseases 0.000 description 2
- 208000005400 Synovial Cyst Diseases 0.000 description 2
- 238000009835 boiling Methods 0.000 description 2
- 125000004432 carbon atom Chemical group C* 0.000 description 2
- 238000003889 chemical engineering Methods 0.000 description 2
- HGASFNYMVGEKTF-UHFFFAOYSA-N octan-1-ol;hydrate Chemical compound O.CCCCCCCCO HGASFNYMVGEKTF-UHFFFAOYSA-N 0.000 description 2
- 239000002356 single layer Substances 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 229910052799 carbon Inorganic materials 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000002485 combustion reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002269 spontaneous effect Effects 0.000 description 1
- 238000010972 statistical evaluation Methods 0.000 description 1
- 230000001988 toxicity Effects 0.000 description 1
- 231100000419 toxicity Toxicity 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Abstract
A Tree-LSTM-based organic matter physicochemical property prediction method comprises a generation prediction model and a physicochemical property prediction part, wherein the generation prediction model comprises the following steps: 1) the molecular structure of the organic matter is normalized and coded and a tree-shaped data structure (molecular feature descriptor) is generated; 2) training a Tree-LSTM model by using molecular feature descriptors and physicochemical property experimental data of organic matters to obtain a sea surface temperature prediction model based on LSTM; the predicted organic physicochemical properties include: and normalizing the molecular structure, coding and inputting the molecular structure into a prediction model to obtain an output result of the physicochemical property of the organic matter. The invention can lead the computer to automatically extract the relationship between the molecular structure and the physical and chemical properties, is more suitable for learning the molecular structure information of various organic matters and can obtain better prediction results.
Description
Technical Field
The invention relates to the field of chemistry C07, in particular to a prediction method of chemical substance quantitative structure correlation properties based on an artificial intelligence technology.
Background
The physicochemical property is basic data closely related to chemistry and chemical engineering, critical property, boiling point, generated heat, octanol water distribution coefficient and the like are closely related to scientific research and production practice of chemistry and chemical engineering, and the scientific and reasonable predicted value of various physicochemical properties can reduce the measurement work of the physicochemical properties and save a large amount of manpower and material resources. The data acquisition of physicochemical properties is usually difficult to develop due to harsh experimental determination conditions or objective factors such as easy decomposition of the measured substance, and is currently mainly estimated by using a group contribution method and a topological coefficient method based on multiple linear regression. However, the group contribution method and the topological coefficient method require manual extraction of molecular structure characteristics before prediction, so that the application range of the two methods is limited.
The Tree-LSTM recurrent neural network is improved on the basis of an LSTM (Long-Short Term Memory) recurrent neural network, the neural network can learn a dependency relationship which is more complex than a sequence structure, and can autonomously learn the contribution of a molecular Tree topology structure to prediction data for input data, particularly the neural network overcomes the defect that other neural networks cannot reproduce atom connection relationships in molecules, and is more suitable for mining the implicit relationship between a molecular structure and physicochemical properties of the molecular structure. The existing group contribution method needs to disassemble molecules into different groups (molecular substructure fragments) and needs to apply multivariate linear fitting to realize the prediction of the physicochemical properties of the organic matters. The group contribution method predicts that various group contribution methods have different resolving schemes, and some molecules cannot find a proper resolving scheme, so that the prediction is deviated or cannot be completed. The existing topological index rule is limited by the complex calculation of the topological index and the incapability of intuitively representing the local structure of a molecule, so that the existing topological index rule does not have the capability of predicting more extensive physicochemical properties. Therefore, a method for predicting the physicochemical properties of organic matters by using a Tree-LSTM recurrent neural network system alone does not appear.
Disclosure of Invention
The invention provides a method for predicting the physicochemical property of an organic matter based on Tree-LSTM, which solves the technical problems of the prior art that the prediction range is not wide, the coverage substance is not wide and the prediction precision is not high.
In order to solve the problems, the invention adopts the following technical scheme:
the method comprises the following steps: step A, generating a prediction model; b, predicting two parts of physical and chemical properties;
the step A comprises the following steps:
a1, acquiring experimental data of the physicochemical properties of the organic matters and molecular structure information of the organic matters, and capturing a large amount of data from various databases by using a web crawler technology;
a2 normalizing the structure of a single organic molecule (by a graph normalization algorithm), traversing each atom in the single organic molecule and generating a corresponding atom feature descriptor, ordering all the atom feature descriptors of the single organic molecule according to a lexicographic order, and taking the smallest atom feature descriptor as a molecule feature descriptor;
a3 generating molecular feature descriptors representing each representative molecular structure normalized graph and corresponding linear codes according to the obtained all organic molecular structures in the step A2;
a4 splitting all organic molecules into various chemical bonds, arranging character strings representing the chemical bonds according to each molecule, and generating word vectors for the character strings by adopting a word embedding algorithm;
a5 builds a neural network model based on Tree-LSTM, and loads the physicochemical data obtained from A1 and the molecular structure data processed by A2-A4, and the Tree-LSTM automatically adapts to the topological shape of the molecular structure normalized graph. Manually adjusting various hyper-parameters and training a model, and preferentially selecting parameters in the training process to obtain a Tree-LSTM-based organic matter physicochemical prediction model;
the step B comprises the following steps:
b1 processing the molecular structure of organic matter without experimental data of certain physicochemical properties by A2-A4 steps, loading the generated feature descriptor and code into the physicochemical property prediction model obtained by A5, and inputting the molecular feature descriptor to predict the data of unknown physicochemical properties.
As a further refinement, said step a5 comprises the following:
a51: building a Tree-LSTM model under a Linux system or a Windows system;
a52: setting the input dimension of Tree-LSTM and the length of input data; a53: setting the data quantity proportion of the Tree-LSTM training set and the test set; a54: setting a Tree-LSTM model optimizer and a learning rate; a55: setting the width of hidden layer neuron; a56: setting the iteration times of the model; a57: and continuously adjusting parameters, checking the convergence degree of the model according to the model loss, and preferentially selecting high-convergence parameters to form a physical and chemical property prediction model based on Tree-LSTM.
Drawings
FIG. 1 is a flow chart of the process for predicting the physicochemical properties of organic substances according to the present invention;
FIG. 2 is a computational graph of the Tree-LSTM recurrent neural network in predicting the properties of acetodoxime substance;
FIG. 3 is a graph of the prediction effect of the Tree-LSTM physicochemical property prediction model on the critical temperature of organic matters, wherein x represents a predicted value, and a straight line represents an actual value.
Figure 4 is a molecular feature descriptor generation exemplified by the acetodoxime substance.
FIG. 5 is a coding rule of molecular characterization illustrating the meaning of each bit of code.
Detailed Description
The invention will be described in detail with reference to the accompanying drawings and specific examples, it being understood that the examples described below are for the purpose of facilitating an understanding of the invention only and are not intended to limit the invention itself in any way.
The invention provides a method for predicting the physicochemical property of an organic matter based on Tree-LSTM, which comprises the following two steps as shown in figure 1: step A, generating a prediction model; b, predicting two parts of physical and chemical properties;
step A, generating a prediction model:
a1 obtains experimental data of the physicochemical properties of the organic matter and the molecular structure information of the organic matter, and captures a large amount of data from various databases by using a web crawler technology.
The physicochemical properties of the a11 organic substance mainly include: critical properties, normal boiling point, transfer properties, spontaneous combustion point, flash point, toxicity, octanol water partition coefficient, biochemical activity, etc.
The molecular structure information of A12 mainly takes SMILES expression, SMARTS expression, MOL file and SDF file as carriers.
A2 standardizes the structure of a single organic molecule, traverses each atom in the single organic molecule and generates corresponding atom feature descriptors, sorts all the atom feature descriptors of the single organic molecule according to a lexicographic order, and takes the smallest atom feature descriptor as a molecule feature descriptor and encodes the molecule feature descriptor.
A21, generating a specification graph from the two-dimensional topological graph of the organic molecule by a graph normalization algorithm in graph theory to realize isomorphic comparison of the molecular graph, for example, Nauty and Faulon graph normalization algorithms can be adopted.
The A22 encoding method is as follows:
the first method, which directly uses the molecular feature descriptor outputted by the Faulon normalization algorithm as the coding of the organic matter, is exemplified in fig. 4.
The second method, encoding the molecular feature descriptors in a linear encoded format, is exemplified in table 1.
And A3 generating a molecular feature descriptor and a corresponding code of each molecule according to the obtained molecular structure information of all the organic matters in the step A2.
A4 splits all organic molecules into various chemical bonds, arranges character strings representing the chemical bonds according to each molecule, and generates word vectors by using a word embedding algorithm for the character strings.
A5 builds a Tree-LSTM-based neural network model, loads physical and chemical data obtained from A1 and molecular structure data processed by A2-A4, continuously adjusts parameters, and preferentially selects the parameters to obtain a Tree-LSTM-based organic matter physical and chemical prediction model.
The step B comprises the following steps:
b1 processing the molecular structure of the organic matter without some physicochemical property experimental data by adopting the steps of A2-A4, loading the generated characteristic descriptors and codes into a physicochemical property prediction model to obtain the data of unknown physicochemical property;
step a4 further includes the following:
a41: traversing each molecule in the database, traversing the connected chemical bond and atom with each atom in each molecule as a starting point, forming a character string like 'A-B', and recording to form original data. Description of the drawings: "A" represents the symbol of the element of the atom A, "B" represents the symbol of the element of the atom B, and "-" represents the type of chemical bond between the atom A and the atom B.
A42: splitting a character string in the form of 'A-B' in original data to form a sub-character string set of three combination modes: the combination is as follows: "A" and "-B", in combination two: "A-" and "B", in combination three: "A", "-", and "B".
A43: and (3) building a neural network based on a skip-gram algorithm under a Linux system or a Windows system, and obtaining an embedded vector representing each character string in the character string set obtained by A42.
As a further refinement, said step a5 comprises the following:
a51: building a Tree-LSTM model under a Linux system or a Windows system;
a52: the feature descriptors or linear encodings of each numerator are parsed into a tree-like data structure and a corresponding embedded vector obtained by a4 is matched for each node in the tree-like structure (for each atom in the numerator).
A52: setting the input dimension of Tree-LSTM and the length of input data; the input dimension in the present invention is 1 and the length is 50.
A53: setting the data quantity proportion of the Tree-LSTM training set and the test set; the ratio in the present invention is 4: 1.
A54: setting a Tree-LSTM model optimizer and a learning rate; the method adopts an Adam algorithm optimizer, and the learning rate is 0.001:
a55: setting the width of each hidden layer neuron;
a56: setting the iteration times of the model;
a57: and adjusting the number of the cryptomelanic ganglion points under the same iteration number, adjusting the iteration number under the same number of the cryptomelanic ganglion points, checking the convergence degree of the model according to the overall loss and the iteration loss of the model, and preferentially selecting a high-convergence parameter to form a physical and chemical property prediction model based on Tree-LSTM.
The structure of the Tree-LSTM neural network is shown in FIG. 2.
Tree-LSTM has two mathematical models, one is a sub-node summation model, and the other is a sub-node independent model.
The core of Tree-LSTM is the control unit state c, the control includes forgetting gate fjAnd input gate ijAnd an output gate oj. Current node j, forget gate fjC in charge of controlling how much c of child node is saved to current nodej(ii) a Input door ijResponsible for controlling how much the instant state of the current node is input into the current unit state cj(ii) a Current input unit state ujControlling how much new node information is added to the output; output gate ojIs responsible for controlling the current cell state cjHow many hidden layer outputs h as current nodej. The calculation formulas of the child node summation model are respectively as follows:
fjk=σ(W(f)xj+U(f)hk+b(f)) (2)
cj=ij·uj+fj (6)
hj=oj·tanh(cj) (7)
wherein, W(f)、W(i)、W(o)Weight matrices for forgetting gate, input gate and output gate, respectively, b(f)、b(i)、b(o)Are respectively offset terms of a forgetting gate, an input gate and an output gate, and sigma is a sigmoid function. The following is a child node independent modelCalculating the formula:
cj=ij·uj+fj (14)
hj=oj·tanh(cj) (15)
the difference between the two models is whether or not h is a child nodejlThe addition is carried out, the sub-node independent model adds a parameter to hjl of each sub-node, and the sub-node addition model is h of the sub-nodejlSum ofProviding parameters for training.
The Tree-LSTM recurrent neural network structure is shown in fig. 3. The inputs to the LSTM include: cell state c of child nodejlHidden layer output value h of child nodejlInput value x of the current nodej(ii) a The output of the LSTM includes: cell state c at the present timejHidden layer output value h of LSTM at current timej。
Wherein the current input unit state ujBy input x of the current nodejHidden layer output value h of child nodejl(if it is a child node summation model, here the hidden layer output value h of the child nodejlSum of) The calculation formula is shown in formula (4) or (12).
Wherein, W(u)Is a weight matrix of the states of the input cells, b(u)Is a bias term for the input cell state, and tanh is the hyperbolic tangent function. Current cell state cjBy forgetting door fj(including child node Unit State c)jlChild node forgetting door fjl) And input gate ijAnd the currently input cell state ujThe joint decision is shown in the formula (4) or (12), wherein the symbol is multiplied by the element. Hidden layer output value h of current nodejFrom an output gate ojAnd the current cell state cjThe calculation formula is shown in (7) or (15).
The Tree-LSTM neural network output is determined by a single-layer or multi-layer neural network, for example, the single-layer neural network is used as an output layer, and the calculation formula is as follows:
pi=w*hij+b (16)
property p of the ith componentiThe Tree-LSTM neural network output hj of the root node of the Tree structure represented by the molecular signature of the component is related, w and b are trainable parameters.
In the invention, Mean Square Error (MSE) or Mean Absolute Error (MAE) is used as a loss function (loss), and the calculation formula is as follows:
wherein N is the number of samples, xexpAs an observed value, xprepIs a predicted value.
Examples of the experiments
The effect of the Tree-LSTM based physicochemical property prediction method will be exemplified below. Taking the critical temperature of an organic matter as an example, the property is used as basic data of various thermodynamic models and physical property estimation models, and has certain practical and representative significance for predicting the property.
Obtaining experimental data of critical temperature and molecular structure information of corresponding substances, wherein 1759 organic matters are counted, 1407 substances are used as a training set, and 352 substances are used as a testing set.
The construction principle of the molecular feature descriptor is illustrated by taking the substance acetaldoxime in the sample as an example, and the detailed description is shown in fig. 4. The molecular feature descriptor is a data structure for storing molecular structure information, which is developed by selecting a certain atom in a molecule as a starting point and developing the molecule according to a tree structure. The acetaldoxime in this example starts with the root atom, which is the carbon atom marked with a zero number. Starting from this root atom C0, a predetermined distance (or height) is searched down and the atoms encountered on the path and the type of chemical bond attached to the atom are recorded to record the characteristics of the molecule. Starting from the root atom, all atoms in the molecule are traversed to obtain an atom feature descriptor. If different root atoms are selected, different atom feature descriptors are generated and are arranged in descending order according to the dictionary order, wherein the first atom is a molecular feature descriptor. Fig. 4 depicts using acetodoxime as an example: (A) molecular structure (B) the tree expansion and atomic features of the molecular structure are described as atomic feature descriptors at different heights (C) from height 0 and height 1. Wherein the sub-atoms of an atom are indicated by nested brackets, and when the type of chemical bond is not specified, indicates that there is a single bond between the atoms in the atomic feature descriptor. Otherwise, the chemical bond is represented as follows ("═ is a double bond;" # "is a triple bond;": is an aromatic bond.)
In order to conveniently store the molecular feature descriptors, the invention develops a linear code to represent the tree-shaped expansion structure of the molecules, the linear code of the molecular feature descriptors with various depths, such as acetodoxime, is shown in a table 1, each atom is separated by an I in a character string, and the meaning of each number and letter in the atom is shown in a table 5. The first atom and the root atom have a current depth of 0, and are denoted by "S", and since there is no parent atom, the parent atom is encoded as "S", and since there is no chemical bond to the parent atom, the parent atom is also encoded as "S".
1759 organic substances are converted into molecular feature descriptors and are linearly encoded. Before being input into the neural network, the substances are analyzed into a tree structure, and each node (atom) is associated with the embedded vector obtained in the step A43. For each molecule in the sample, each atom corresponds to each node in the Tree-LSTM neural network, and the embedded vector of each atom is the input vector of the node. In the case of 300 iterations, the number of output layer nodes is continuously adjusted, and finally 128 output layer nodes are used as a better value in this example. The Tree-LSTM neural network structure is determined by the molecules of each organic substance, and is a dynamic neural network, adapting to the topological structure of different molecules. In this example, the learning rate was 0.008 in the first 300 trains, and then the learning rate was adjusted to 0.00001 for 5000 trains. To prevent overfitting, the calculation is ended in advance when the loss function value no longer decreases. And finally, obtaining the prediction results in the table 3, wherein the higher the coincidence degree of the experimental values and the predicted values in the table is, the better the prediction effect is. Table 2 shows the statistical evaluation parameters of the Tree-LSTM neural network for the training and prediction of critical temperature of organic matter. The predicted value in the x table of fig. 3, the straight line represents the experimental value, and it can be seen that for most data points, the invention obtains better prediction effect by using Tree-LSTM.
Table 1 example of linear encoding of molecular feature descriptors
TABLE 2 statistical parameters for organic matter critical temperature training and prediction
TABLE 3 partial prediction of critical temperature of organic substances
Comparing the present invention with representative methods of the group contribution method, the Joback and Constantinou-Gani (CG) methods, under the same list of materials, the following results were obtained, as shown in Table 4:
TABLE 4 comparison of the prediction ability of the present invention with classical group contribution method
The list of materials used for comparison in Table 4 contains 460 materials of which only 352 can be predicted by the Joback method, and when the prediction method of the present invention is used to predict the 352 materials, the prediction method of the present invention shows the method due to Joback. The amount of predictable substance predicted by the CG method is also less than and slightly less than the present invention. When the present invention predicts that the substance is all the substances in a single substance, 452 substances therein can be covered and an acceptable accuracy achieved. The superscript a indicates all predictable species and the superscript b indicates species with a number of carbon atoms greater than 3.
Claims (2)
1. A Tree-LSTM-based organic matter physicochemical property prediction method is characterized in that a molecular graph of an organic matter is converted into a canonical graph so as to be convenient for computer recognition and learning, so that a computer can capture structural features of molecules, and the computer can correlate the features with the organic matter and physical or chemical properties, and finally the prediction of the material properties is realized, and the process comprises the following steps: step A, producing a prediction model; b, predicting two parts of physical and chemical properties;
the step A comprises the following steps:
a1, acquiring experimental data of the physicochemical properties of the organic matters and molecular structure information of the organic matters, and capturing a large amount of data from various databases by using a web crawler technology;
a2 normalizing the structure of a single organic molecule (by a graph normalization algorithm), traversing each atom in the single organic molecule and generating a corresponding atom feature descriptor, ordering all the atom feature descriptors of the single organic molecule according to a lexicographic order, and taking the smallest atom feature descriptor as a molecule feature descriptor;
a3 generating molecular feature descriptors representing each representative molecular structure normalized graph and corresponding linear codes according to the obtained all organic molecular structures in the step A2;
a4 splitting all organic molecules into various chemical bonds, arranging character strings representing the chemical bonds according to each molecule, and generating word vectors for the character strings by adopting a word embedding algorithm;
a5 builds a neural network model based on Tree-LSTM, and loads the physicochemical data obtained from A1 and the molecular structure data processed by A2-A4, and the Tree-LSTM automatically adapts to the topological shape of the molecular structure normalized graph. Manually adjusting various hyper-parameters and training a model, and preferentially selecting parameters in the training process to obtain a Tree-LSTM-based organic matter physicochemical prediction model;
the step B comprises the following steps:
b1 processing the molecular structure of organic matter without experimental data of certain physicochemical properties by A2-A4 steps, loading the generated feature descriptor and code into the physicochemical property prediction model obtained by A5, and inputting the molecular feature descriptor to predict the data of unknown physicochemical properties.
2. The method for predicting the physicochemical property of organic substances based on Tree-LSTM according to claim 1, wherein the step A5 comprises the following steps:
a51: building a Tree-LSTM-based neural network under a Linux system or a Windows system; a52: setting the input dimension of Tree-LSTM and the length of input data; a53: setting the data quantity proportion of the Tree-LSTM training set and the test set; a54: setting a Tree-LSTM model optimizer and a learning rate; a55: setting the width of hidden layer neuron; a56: setting the iteration times of the model; a57: and continuously adjusting parameters, checking the convergence degree of the model according to the model loss, and preferentially selecting high-convergence parameters to form a physical and chemical property prediction model based on Tree-LSTM.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910500140.8A CN110600085B (en) | 2019-06-01 | 2019-06-01 | Tree-LSTM-based organic matter physicochemical property prediction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910500140.8A CN110600085B (en) | 2019-06-01 | 2019-06-01 | Tree-LSTM-based organic matter physicochemical property prediction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110600085A true CN110600085A (en) | 2019-12-20 |
CN110600085B CN110600085B (en) | 2024-04-09 |
Family
ID=68852617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910500140.8A Active CN110600085B (en) | 2019-06-01 | 2019-06-01 | Tree-LSTM-based organic matter physicochemical property prediction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110600085B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111524557A (en) * | 2020-04-24 | 2020-08-11 | 腾讯科技(深圳)有限公司 | Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence |
CN111710375A (en) * | 2020-05-13 | 2020-09-25 | 中国科学院计算机网络信息中心 | Molecular property prediction method and system |
CN111899807A (en) * | 2020-06-12 | 2020-11-06 | 中国石油天然气股份有限公司 | Molecular structure generation method, system, equipment and storage medium |
CN111899814A (en) * | 2020-06-12 | 2020-11-06 | 中国石油天然气股份有限公司 | Method, equipment and storage medium for calculating physical properties of single molecule and mixture |
CN115171807A (en) * | 2022-09-07 | 2022-10-11 | 合肥机数量子科技有限公司 | Molecular coding model training method, molecular coding method and molecular coding system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150017694A1 (en) * | 2008-11-06 | 2015-01-15 | Kiverdi, Inc. | Engineered CO2-Fixing Chemotrophic Microorganisms Producing Carbon-Based Products and Methods of Using the Same |
US20180137389A1 (en) * | 2016-11-16 | 2018-05-17 | Facebook, Inc. | Deep Multi-Scale Video Prediction |
CN108108836A (en) * | 2017-12-15 | 2018-06-01 | 清华大学 | A kind of ozone concentration distribution forecasting method and system based on space-time deep learning |
US20180211156A1 (en) * | 2017-01-26 | 2018-07-26 | The Climate Corporation | Crop yield estimation using agronomic neural network |
CN109033738A (en) * | 2018-07-09 | 2018-12-18 | 湖南大学 | A kind of pharmaceutical activity prediction technique based on deep learning |
CN109476721A (en) * | 2016-04-04 | 2019-03-15 | 英蒂分子公司 | CD8- specificity capturing agent, composition and use and preparation method |
US20190114320A1 (en) * | 2017-10-17 | 2019-04-18 | Tata Consultancy Services Limited | System and method for quality evaluation of collaborative text inputs |
-
2019
- 2019-06-01 CN CN201910500140.8A patent/CN110600085B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150017694A1 (en) * | 2008-11-06 | 2015-01-15 | Kiverdi, Inc. | Engineered CO2-Fixing Chemotrophic Microorganisms Producing Carbon-Based Products and Methods of Using the Same |
CN109476721A (en) * | 2016-04-04 | 2019-03-15 | 英蒂分子公司 | CD8- specificity capturing agent, composition and use and preparation method |
US20180137389A1 (en) * | 2016-11-16 | 2018-05-17 | Facebook, Inc. | Deep Multi-Scale Video Prediction |
US20180211156A1 (en) * | 2017-01-26 | 2018-07-26 | The Climate Corporation | Crop yield estimation using agronomic neural network |
US20190114320A1 (en) * | 2017-10-17 | 2019-04-18 | Tata Consultancy Services Limited | System and method for quality evaluation of collaborative text inputs |
CN108108836A (en) * | 2017-12-15 | 2018-06-01 | 清华大学 | A kind of ozone concentration distribution forecasting method and system based on space-time deep learning |
CN109033738A (en) * | 2018-07-09 | 2018-12-18 | 湖南大学 | A kind of pharmaceutical activity prediction technique based on deep learning |
Non-Patent Citations (1)
Title |
---|
秦琦枫等: "深度神经网络在化学中的应用研究", 《江西化工》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111524557A (en) * | 2020-04-24 | 2020-08-11 | 腾讯科技(深圳)有限公司 | Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence |
CN111524557B (en) * | 2020-04-24 | 2024-04-05 | 腾讯科技(深圳)有限公司 | Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence |
CN111710375A (en) * | 2020-05-13 | 2020-09-25 | 中国科学院计算机网络信息中心 | Molecular property prediction method and system |
CN111710375B (en) * | 2020-05-13 | 2023-07-04 | 中国科学院计算机网络信息中心 | Molecular property prediction method and system |
CN111899807A (en) * | 2020-06-12 | 2020-11-06 | 中国石油天然气股份有限公司 | Molecular structure generation method, system, equipment and storage medium |
CN111899814A (en) * | 2020-06-12 | 2020-11-06 | 中国石油天然气股份有限公司 | Method, equipment and storage medium for calculating physical properties of single molecule and mixture |
CN115171807A (en) * | 2022-09-07 | 2022-10-11 | 合肥机数量子科技有限公司 | Molecular coding model training method, molecular coding method and molecular coding system |
CN115171807B (en) * | 2022-09-07 | 2022-12-06 | 合肥机数量子科技有限公司 | Molecular coding model training method, molecular coding method and molecular coding system |
Also Published As
Publication number | Publication date |
---|---|
CN110600085B (en) | 2024-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110600085B (en) | Tree-LSTM-based organic matter physicochemical property prediction method | |
Zhang et al. | An end-to-end deep learning architecture for graph classification | |
Peel et al. | Detecting change points in the large-scale structure of evolving networks | |
CN113299354B (en) | Small molecule representation learning method based on transducer and enhanced interactive MPNN neural network | |
Burnaev et al. | Efficient design of experiments for sensitivity analysis based on polynomial chaos expansions | |
CN113722877A (en) | Method for online prediction of temperature field distribution change during lithium battery discharge | |
Chandra et al. | A multivariate time series clustering approach for crime trends prediction | |
Li et al. | Four Methods to Estimate Minimum Miscibility Pressure of CO2‐Oil Based on Machine Learning | |
CN114861928B (en) | Quantum measurement method and device and computing equipment | |
Wang et al. | Time-weighted kernel-sparse-representation-based real-time nonlinear multimode process monitoring | |
WO1997025676A1 (en) | Time-series signal predicting apparatus | |
CN115759461A (en) | Internet of things-oriented multivariate time sequence prediction method and system | |
Tuli et al. | FlexiBERT: Are current transformer architectures too homogeneous and rigid? | |
Li et al. | Deep reliability learning with latent adaptation for design optimization under uncertainty | |
CN116894180B (en) | Product manufacturing quality prediction method based on different composition attention network | |
CN113674807A (en) | Molecular screening method based on deep learning technology qualitative and quantitative model | |
CN116302088B (en) | Code clone detection method, storage medium and equipment | |
CN107220483B (en) | Earth temperature mode prediction method | |
Hellström et al. | High-dimensional neural network potentials for atomistic simulations | |
Li et al. | Using modified lasso regression to learn large undirected graphs in a probabilistic framework | |
McWilliams et al. | A PRESS statistic for two-block partial least squares regression | |
Ihme et al. | On the optimization of artificial neural networks for application to the approximation of chemical systems | |
CN116884536B (en) | Automatic optimization method and system for production formula of industrial waste residue bricks | |
CN111563623B (en) | Typical scene extraction method and system for wind power system planning | |
CN113779884B (en) | Detection method for service life of recovered chip |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |