WO2023133678A1 - Method for predicting chemical reaction - Google Patents

Method for predicting chemical reaction Download PDF

Info

Publication number
WO2023133678A1
WO2023133678A1 PCT/CN2022/071283 CN2022071283W WO2023133678A1 WO 2023133678 A1 WO2023133678 A1 WO 2023133678A1 CN 2022071283 W CN2022071283 W CN 2022071283W WO 2023133678 A1 WO2023133678 A1 WO 2023133678A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
data set
data
confidence
prediction
Prior art date
Application number
PCT/CN2022/071283
Other languages
French (fr)
Chinese (zh)
Inventor
陈德铭
马汝建
陈志刚
李革
Original Assignee
上海药明康德新药开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海药明康德新药开发有限公司 filed Critical 上海药明康德新药开发有限公司
Priority to PCT/CN2022/071283 priority Critical patent/WO2023133678A1/en
Publication of WO2023133678A1 publication Critical patent/WO2023133678A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/10Analysis or design of chemical reactions, syntheses or processes
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures

Definitions

  • the present application relates to the field of computer technology, in particular to a method and device for predicting chemical reaction products.
  • the present invention discloses a method for predicting chemical reaction products, the method comprising:
  • Step 1 Obtain one or more machine models that can generate reaction predictions and output their prediction credibility, calculate the corresponding credibility of the predicted products in each model through a given chemical reaction in the original data set, and count the overall of all models Reliability, screening the response data whose reliability is less than the threshold to obtain the first data set D1; wherein the threshold is any number from 0.3 to 0.9, preferably 0.4 to 0.8, more preferably 0.5 to 0.7. Such as about 0.5, 0.6, 0.7.
  • Step 2 Provide the second data set D2, calculate the similarity sim(w,v) between the chemical reaction W in D2 and the chemical reaction V in D1, and filter sim(w,v) in D2 to be greater than or equal to Supplementary data of similar responses to the threshold, the third data set D3 is obtained by means of collection, wherein the threshold is any value from 0.1 to 1, preferably 0.3 to 0.8, more preferably 0.5 to 0.8, such as about 0.6, .07 or 0.8;
  • Step 3 Merge the D3 data into the original dataset or use the D3 data to retrain the model.
  • the machine translation converter (Transformer) is selected as the original training model.
  • the machine model can be replaced by other models based on deep neural networks.
  • t represents the tth model snapshot
  • K is the number of collected model snapshots
  • X represents the reactant of the chemical reaction
  • Y represents the product of the reaction
  • p represents the probability of the model output Y when X and ⁇ t are known
  • Y max is The predicted product of the model, arg max means to take the maximum value of all Y i probabilities.
  • the overall confidence in step 1, can be characterized as mean(confidence(X, ⁇ t)), or maximum value max(confidence(X, ⁇ t)), or those skilled in the art can Other statistical operations that are easily mastered.
  • step 2 the amount of D3 data is less than the original training data D0, and for each response in D1, the number of responses supplemented by D3 can be controlled within one hundred, preferably
  • step 3 in the original training data D0, randomly sample R times the amount of data of D3, merge with D3, generate a new data set, and retrain the reinitialized machine model parameters; R Can be selected from the range [0.5,max(1,
  • step 3 D3 is used to generate a new data set, and the re-initialized machine model parameters are retrained.
  • Fig. 1 shows a schematic flow diagram of a method for predicting compound reaction products.
  • Figure 2 shows the number of neighbor responses similar to 7 false responses.
  • Figure 3 shows the number of similar neighbor responses among the 12 correct responses.
  • the present invention discloses a method for predicting chemical reaction products, said method comprising:
  • Step 1 Based on the original training model, predict the reaction products of different reactions and calculate the "under-learned" reactions whose reliability is lower than the threshold, screen these data and form the first data set D1.
  • Step 2 Screen similar reactions to the "under-learned" chemical reactions as the third data set D3.
  • Step 3 Merge the D3 data into the original data set and retrain the model.
  • step 1 first obtain one or more machine models that can generate reaction predictions and output their prediction credibility, and then calculate the corresponding credibility of the predicted products in each model through a given chemical reaction in the original data set, and The overall credibility of all models is counted, and finally the response data whose reliability is less than a threshold such as 0.5 is screened to obtain the first data set D1.
  • a threshold such as 0.5
  • confidence p(Y
  • the overall confidence can be characterized as mean(confidence(X, ⁇ t)), or maximum value max(confidence(X, ⁇ t)), or other statistical operations that can be easily grasped by those skilled in the art.
  • the machine model can be replaced by other deep neural networks, and the same Softmax is used to calculate the output layer, but the symbol form of the output element is changed.
  • step 2 first provide the second data set D2, then calculate the similarity sim(w,v) between the chemical reaction W in D2 and the chemical reaction V in D1, and finally screen sim(w,v) in D2 v) Supplementary data of similar reactions that are greater than or equal to the threshold value are aggregated to obtain the third data set D3, wherein the threshold value is any value between 0.1 and 1.
  • the threshold value is any value between 0.1 and 1.
  • the amount of D3 data is less than the original training data D0, and for each response in D1, the number of responses added by D3 can be controlled within one hundred, preferably,
  • sim(w,v) can be implemented as the normalized reciprocal of its Euclidean distance (+1 to avoid the divisor being 0), or the normalized similarity that can be grasped by those skilled in the art:
  • step 3 merge the D3 data into the original data set, and retrain the model.
  • R randomly sample R times the data volume of D3 in the original training data D0, merge with D3 to generate a new data set, and then retrain the re-initialized machine model parameters; R can be selected from [0.5, max(1,
  • D3 for fine-tuning learning, that is, the model ⁇ t is trained for F ⁇ 1 iterations on the D3 data, and the model parameters are continuously updated.
  • D0 comes from the 400,000 training data of the U.S. Patent Data Office (USPTO), which is a public data set, and the machine translation converter (Transformer) (Philippe Schwaller et al. Molecular transformer) is selected.
  • U.S. Patent Data Office U.S. Patent Data Office
  • Transformer Philippe Schwaller et al. Molecular transformer
  • the machine model can be replaced by other deep neural network (Coley, Connor W., et al. A graph-convolutional neural network model for the prediction of chemical reactivity. Chemical science 10.2(2019): 370-377.; John Bradshaw, Matt J. Kusner, Brooks Paige, Marwin H.S. Segler, Jose Miguel Hernández- Lobato, A Generative Model For Electron Paths, https://arxiv.org/abs/1805.10970), only the output element symbol form has changed.
  • the total number, ⁇ t corresponds to the model of the t-th epoch iteration.
  • an epoch may also be set as a certain number of iteration intervals, such as every 1000 iterations as an epoch.
  • X represents the reactant of the chemical reaction
  • Y represents the product of the reaction.
  • the credibility confidence p(Y
  • i is the i-th highest score output prediction obtained by the model through beam-search, i ⁇ 10.
  • the specific calculation method of Confidence can be the X part of the reaction data.
  • all possible M element symbols in the output product can be obtained.
  • the reliability threshold range may be 0.3 to 0.9, preferably 0.4 to 0.8, more preferably 0.5-0.7, such as 0.5, 0.6 or 0.7.
  • D2 comes from the supplementary database of USPTO, USPTO Stereo has about 1 million reactions.
  • sim(w,v) can be implemented as the normalized reciprocal of its Euclidean distance (+1 to avoid the divisor being 0), or the normalized similarity that can be mastered by those skilled in the art:
  • th2 is set from the range [0.1,1].
  • D3 fine-tuning (fine-tuning) learning
  • Example 2 This scheme is used in .
  • the accuracy rate is improved the most, and the prediction/verification response is the category with a confidence Confidence>0.9, or the category with a stricter Confidence>0.99, which has the largest number of improvements as the final model.
  • the reaction dataset D1 for the under-learning analysis test is about 1381 reactions extracted from basic organic chemistry books by in-house chemists.
  • the candidate response data set D2 to be supplemented comes from the USPTO data set that does not overlap with D0, with a total of 400,000 responses; it is worth noting that the background of the response data commercial service contains a large number of responses, but this type of service only provides a small amount of data query. Data cannot be fetched in bulk. For example, on the Reaxys www.reaxys.com page, it contains more than 55 million responses, but only about ten responses can be seen on one page of query results.
  • the baseline model (baseline) trained only through D0 was compared with each response in D1 through this inventive method, and the D2 data set containing 400,000 candidate supplementary responses was obtained from the D2 data set containing similarity ⁇ 0.6.
  • the D3 data set with about 14,000 responses was processed by fine-tune as described in the example to obtain a reliability improvement model.
  • the Top-k accuracy rate indicates the k different possible products with the highest reliability predicted by the model, one of which is completely consistent with the real product.
  • the top-1 accuracy rate is the most likely product predicted by the model, and the proportion of all reactions that are completely consistent with the real product.
  • the raw accuracy rate is 0, of which 33 have Confidence>0.9, and 16 have Confidence>0.8.
  • the experiment supplemented the data without difference, that is, no similar response was supplemented for its response prediction Confidence threshold.
  • the accuracy rate of Top-1 in this part of the test response after supplementation is 14%, that is, if supplementation is not based on the Confidence threshold and similarity, the accuracy rate improvement is limited.

Abstract

Disclosed is a method for predicting chemical reaction products, comprising: predicting reaction products of different reactions on the basis of an original training model trained using an original data set D0, calculating reactions of which the confidence is lower than a threshold value, and screening data to form a first data set D1; providing a second data set D2, and screening reactions similar to the chemical reactions in the first data set D1 as a third data set D3; and merging data of D3 into the original data set or independently using D3, and re-performing model training. The method of the present invention can improve the relation between the confidence and the true accuracy of prediction, so that high-confidence prediction has high accuracy, and the accuracy of reaction prediction is finally improved; moreover, the method also has the advantages of being small in data volume and short in time.

Description

一种预测化学反应的方法A method for predicting chemical reactions 技术领域technical field
本申请涉及计算机技术领域,特别是涉及一种化学反应产物的预测方法和装置。The present application relates to the field of computer technology, in particular to a method and device for predicting chemical reaction products.
背景技术Background technique
在药物化学应用领域中,新化学分子的有机合成,需要对有机化学家设想的或计算机算法虚拟产生的化学反应进行相关的预测判断,避免实验失败产生损失和浪费。In the field of medicinal chemistry applications, the organic synthesis of new chemical molecules requires the prediction and judgment of chemical reactions imagined by organic chemists or virtualized by computer algorithms to avoid losses and waste caused by experimental failures.
现有的反应预测模型预测准确度高度依赖训练数据,模型表现可能因为不全面的反应数据而受限。简单低补充反应数据重新训练模型,不能有效解决特定应用领域关注的重点反应。如有机合成设计中重要的环化反应,增加反应数据并非越多越好,无差别补充不能有效针对此类别进行提升,甚至对此类反应产生下降。The prediction accuracy of existing response prediction models is highly dependent on training data, and model performance may be limited by incomplete response data. Simply retraining models on low-supplementary response data cannot effectively address key responses of interest in specific application domains. For example, the important cyclization reaction in the design of organic synthesis, the more reaction data is not the better, the indiscriminate supplementation cannot effectively improve this category, or even reduce this type of reaction.
发明内容Contents of the invention
基于此,有必要针对目前反应预测模型预测准确度不高的技术问题,提供一种化学反应产物的预测方法和装置。Based on this, it is necessary to provide a method and device for predicting chemical reaction products in view of the technical problem that the prediction accuracy of the current reaction prediction model is not high.
在一方面中,本发明公开了一种化学反应产物的预测方法,所述方法包括:In one aspect, the present invention discloses a method for predicting chemical reaction products, the method comprising:
步骤1:获取一个或多个能产生反应预测及输出其预测可信度的机器模型,在原始数据集中通过给定的化学反应计算各模型中预测产物对应的可信度,统计全部模型的整体可信度,筛选可信度小于阈值的反应数据,得到第一数据集D1;其中所述阈值为0.3至0.9中的任意数,优选为0.4至0.8,更优选为0.5至0.7.诸如大约为0.5、0.6、0.7。Step 1: Obtain one or more machine models that can generate reaction predictions and output their prediction credibility, calculate the corresponding credibility of the predicted products in each model through a given chemical reaction in the original data set, and count the overall of all models Reliability, screening the response data whose reliability is less than the threshold to obtain the first data set D1; wherein the threshold is any number from 0.3 to 0.9, preferably 0.4 to 0.8, more preferably 0.5 to 0.7. Such as about 0.5, 0.6, 0.7.
步骤2:提供第二数据集D2,针对D2中的化学反应W,计算其与D1中的化学反应V的相似度sim(w,v),在D2中筛选sim(w,v)大于 或等于阈值的相似反应补充数据,取集合方式得到第三数据集D3,其中所述阈值为0.1至1中的任意值,优选为0.3至0.8,更优选为0.5至0.8,诸如大约为0.6、.07或0.8;Step 2: Provide the second data set D2, calculate the similarity sim(w,v) between the chemical reaction W in D2 and the chemical reaction V in D1, and filter sim(w,v) in D2 to be greater than or equal to Supplementary data of similar responses to the threshold, the third data set D3 is obtained by means of collection, wherein the threshold is any value from 0.1 to 1, preferably 0.3 to 0.8, more preferably 0.5 to 0.8, such as about 0.6, .07 or 0.8;
步骤3:将D3数据合并于原始数据集或使用D3数据,重新进行模型训练。Step 3: Merge the D3 data into the original dataset or use the D3 data to retrain the model.
在一个实施方案中,在步骤1中,K≥1个能产生反应预测及输出其预测可信度的机器模型的模型特征通过模型参数θt表征,其中t=1,2,…,K表示。其中t表示第t个模型快照,K为收集的模型快照个数。In one embodiment, in step 1, the model features of K≧1 machine models capable of generating reaction predictions and outputting their prediction confidences are characterized by model parameters θt, where t=1, 2, . . . , K. Where t represents the tth model snapshot, and K is the number of collected model snapshots.
在原始训练数据D0可得的情况下,选取机器翻译转换器(Transformer)作为原始训练模型,在其他实施例中,机器模型可替换为其他基于深度神经网络。When the original training data D0 is available, the machine translation converter (Transformer) is selected as the original training model. In other embodiments, the machine model can be replaced by other models based on deep neural networks.
在一个实施方案中,在步骤1中,当产物信息已知时,confidence=p(Y|X,θt);当反应产物信息未知时,Y max=arg max i(confidence=p(Yi|X,θt))得到(X,Y max),i为模型可提供的第i个输出预测,优选i≤10。t表示第t个模型快照,K为收集的模型快照个数,X代表化学反应的反应物,Y代表反应的产物,p表示当X和θt已知时,模型输出Y的概率,Y max为模型的预测产物,arg max表示取所有Y i概率的最大值。 In one embodiment, in step 1, when the product information is known, confidence=p(Y|X,θt); when the reaction product information is unknown, Y max =arg max i (confidence=p(Yi|X ,θt)) to get (X,Y max ), i is the i-th output prediction that the model can provide, preferably i≤10. t represents the tth model snapshot, K is the number of collected model snapshots, X represents the reactant of the chemical reaction, Y represents the product of the reaction, p represents the probability of the model output Y when X and θt are known, and Y max is The predicted product of the model, arg max means to take the maximum value of all Y i probabilities.
在一个实施方案中,在步骤1中,所述整体可信度可表征为平均值mean(confidence(X,θt)),或最大值max(confidence(X,θt)),或领域技术人员可轻易掌握的其他统计运算。In one embodiment, in step 1, the overall confidence can be characterized as mean(confidence(X,θt)), or maximum value max(confidence(X,θt)), or those skilled in the art can Other statistical operations that are easily mastered.
在一个实施例方案中,在步骤2中,任一反应W∈D1时,计算其与任一反应V∈D2在模型参数θt下的相似度sim(w,v),其中sim(w,v)=sim(w=encoding(W),v=encoding(V));其中w,v为模型θt分别对输入反应V,W的编码(encoding)。In one embodiment, in step 2, when any reaction W∈D1, calculate its similarity sim(w,v) with any reaction V∈D2 under the model parameter θt, where sim(w,v )=sim(w=encoding(W), v=encoding(V)); where w, v are the encoding (encoding) of the model θt’s response to the input V, W respectively.
在一个实施例方案中,在步骤2中,D3数据量比原始训练数据D0少,且针对D1中的每个反应,D3所补充的反应数可控制在一百个以内,优选为,|D3|≤|D0|,|D3|≤50×|D1|。In one embodiment, in step 2, the amount of D3 data is less than the original training data D0, and for each response in D1, the number of responses supplemented by D3 can be controlled within one hundred, preferably |D3 |≤|D0|, |D3|≤50×|D1|.
在一个实施例方案中,在步骤3中,在原始训练数据D0中随机取样R倍于D3的数据量,和D3合并,产生新的数据集,对重新初始化 的机器模型参数重新进行训练;R可选自[0.5,max(1,|D0|/|D3|)]的范围。In an embodiment scheme, in step 3, in the original training data D0, randomly sample R times the amount of data of D3, merge with D3, generate a new data set, and retrain the reinitialized machine model parameters; R Can be selected from the range [0.5,max(1,|D0|/|D3|)].
在一个实施例方案中,在步骤3,使用D3产生新的数据集,对重新初始化的机器模型参数重新进行训练。In one embodiment, in step 3, D3 is used to generate a new data set, and the re-initialized machine model parameters are retrained.
附图说明Description of drawings
图1示出化合物反应产物预测方法的流程示意图。Fig. 1 shows a schematic flow diagram of a method for predicting compound reaction products.
图2示出7个错误反应相似近邻反应个数。Figure 2 shows the number of neighbor responses similar to 7 false responses.
图3示出12个正确反应相似近邻反应个数。Figure 3 shows the number of similar neighbor responses among the 12 correct responses.
具体实施方式Detailed ways
以下根据实施例,并且结合附图,详细描述本发明。从下文的详细描述中,本发明的上述方面和本发明的其他方面将是明显的。本发明的范围不局限于下列实施例。The present invention will be described in detail below based on the embodiments and in conjunction with the accompanying drawings. The above aspects of the invention and other aspects of the invention will be apparent from the following detailed description. The scope of the present invention is not limited to the following examples.
如图1所示,本发明公开了一种化学反应产物的预测方法,所述方法包括:As shown in Figure 1, the present invention discloses a method for predicting chemical reaction products, said method comprising:
步骤1:基于原始训练模型,预测不同反应的反应产物并计算可信度低于阈值的“欠学习”反应,筛选这些数据并组成第一数据集D1。Step 1: Based on the original training model, predict the reaction products of different reactions and calculate the "under-learned" reactions whose reliability is lower than the threshold, screen these data and form the first data set D1.
步骤2:筛选与“欠学习”化学反应的相似反应,作为第三数据集D3。Step 2: Screen similar reactions to the "under-learned" chemical reactions as the third data set D3.
步骤3:将D3数据合并于原始数据集,重新进行模型训练。Step 3: Merge the D3 data into the original data set and retrain the model.
在步骤1中,首先获取一个或多个能产生反应预测及输出其预测可信度的机器模型,然后在原始数据集中通过给定的化学反应计算各模型中预测产物对应的可信度,并统计全部模型的整体可信度,最后筛选可信度小于阈值例如0.5的反应数据,得到第一数据集D1。In step 1, first obtain one or more machine models that can generate reaction predictions and output their prediction credibility, and then calculate the corresponding credibility of the predicted products in each model through a given chemical reaction in the original data set, and The overall credibility of all models is counted, and finally the response data whose reliability is less than a threshold such as 0.5 is screened to obtain the first data set D1.
在步骤1中,K≥1个能产生反应预测及输出其预测可信度的机器模型的模型特征通过模型参数θt表征,其中t=1,2,…,K表示。当产物信息已知时,confidence=p(Y|X,θt);当反应产物信息未知时,Y max=arg max i(confidence=p(Yi|X,θt))得到(X,Y max),i为模型可提供的第i个输出预测,i≤10。所述整体可信度可表征为平均值 mean(confidence(X,θt)),或最大值max(confidence(X,θt)),或领域技术人员可轻易掌握的其他统计运算。 In step 1, the model features of K≥1 machine models capable of producing reaction predictions and outputting their prediction reliability are characterized by model parameters θt, where t=1, 2, ..., K represent. When the product information is known, confidence=p(Y|X,θt); when the reaction product information is unknown, Y max =arg max i (confidence=p(Yi|X,θt)) to get (X,Y max ) , i is the i-th output prediction that the model can provide, i≤10. The overall confidence can be characterized as mean(confidence(X, θt)), or maximum value max(confidence(X, θt)), or other statistical operations that can be easily grasped by those skilled in the art.
可信度Confidence的具体计算方式可以是反应数据的X部分,经过已训练的机器模型Transformer的多层神经网络的各层权重计算后,在模型的输出层,得到输出产物中所有可能的M个元素符号的原始权重zi(>0),i=1,2,…,M,并通过如下的Softmax进行归一化概率计算作为每一个字符i的confidence,并输出概率最大的元素符号序列作为预测Y。The specific calculation method of Confidence can be the X part of the reaction data. After the weight calculation of each layer of the multi-layer neural network of the trained machine model Transformer, all possible M in the output product are obtained in the output layer of the model. The original weight zi(>0) of the element symbol, i=1,2,...,M, and the normalized probability calculation is performed by the following Softmax as the confidence of each character i, and the element symbol sequence with the highest probability is output as the prediction Y.
Figure PCTCN2022071283-appb-000001
Figure PCTCN2022071283-appb-000001
在其他实施方案中,机器模型可替换为其他基于深度神经网络,均使用同样的Softmax对输出层进行计算,只是输出的元素符号形式有所变化。In other embodiments, the machine model can be replaced by other deep neural networks, and the same Softmax is used to calculate the output layer, but the symbol form of the output element is changed.
在步骤2中,首先提供第二数据集D2,然后针对D2中的化学反应W,计算其与D1中的化学反应V的相似度sim(w,v),最后在D2中筛选sim(w,v)大于或等于阈值的相似反应补充数据,取集合方式得到第三数据集D3,其中所述阈值为0.1至1中的任意值。任一反应W∈D1时,计算其与任一反应V∈D2在模型参数θt下的相似度sim(w,v),其中sim(w,v)=sim(w=encoding(W),v=encoding(V));其中w,v为模型θt分别对输入反应V,W的编码(encoding)。D3数据量比原始训练数据D0少,且针对D1中的每个反应,D3所补充的反应数可控制在一百个以内,优选为,|D3|≤|D0|,|D3|≤50×|D1|。在一个具体实施方案中,反应W为例,w=f(W,θt)=[w1,w2,….wn],f(W,θt)具体为反应W输入到模型θt通过各层的参数计算,在输出预测元素之前一层的向量表示,其中n属于模型预设表示向量长度的参数;同样地,可对D2的每一个反应V获得v=f(V,θt)=[v1,v2,….vn];n可以在2 6=64到2 12=4096的长度范围中选取,优选选取n=256。 In step 2, first provide the second data set D2, then calculate the similarity sim(w,v) between the chemical reaction W in D2 and the chemical reaction V in D1, and finally screen sim(w,v) in D2 v) Supplementary data of similar reactions that are greater than or equal to the threshold value are aggregated to obtain the third data set D3, wherein the threshold value is any value between 0.1 and 1. When any reaction W∈D1, calculate the similarity sim(w,v) between it and any reaction V∈D2 under the model parameter θt, where sim(w,v)=sim(w=encoding(W),v =encoding(V)); where w, v are the encodings (encoding) of the input responses V, W of the model θt respectively. The amount of D3 data is less than the original training data D0, and for each response in D1, the number of responses added by D3 can be controlled within one hundred, preferably, |D3|≤|D0|, |D3|≤50× |D1|. In a specific embodiment, the reaction W is taken as an example, w=f(W,θt)=[w1,w2,...wn], f(W,θt) is specifically the parameter of the reaction W input to the model θt through each layer Calculate the vector representation of the layer before the output prediction element, where n belongs to the model preset parameter representing the length of the vector; similarly, v=f(V,θt)=[v1,v2 can be obtained for each response V of D2 ,....vn]; n can be selected in the length range from 2 6 =64 to 2 12 =4096, preferably n=256.
sim(w,v)可实现为其欧几里得距离的归一化倒数(+1避免被除数为0),或其他领域技术人员可掌握的归一化相似度:sim(w,v) can be implemented as the normalized reciprocal of its Euclidean distance (+1 to avoid the divisor being 0), or the normalized similarity that can be grasped by those skilled in the art:
Figure PCTCN2022071283-appb-000002
Figure PCTCN2022071283-appb-000002
在步骤3中,将D3数据合并于原始数据集,重新进行模型训练。首先在原始训练数据D0中随机取样R倍于D3的数据量,和D3合并,产生新的数据集,之后对重新初始化的机器模型参数重新进行训练;R可选自[0.5,max(1,|D0|/|D3|)]的范围。In step 3, merge the D3 data into the original data set, and retrain the model. First, randomly sample R times the data volume of D3 in the original training data D0, merge with D3 to generate a new data set, and then retrain the re-initialized machine model parameters; R can be selected from [0.5, max(1, |D0|/|D3|)] range.
或者使用D3进行微调学习,即将模型θt在D3数据上再进行F≥1次迭代的训练,持续更新模型参数。Or use D3 for fine-tuning learning, that is, the model θt is trained for F≥1 iterations on the D3 data, and the model parameters are continuously updated.
实施例1化学反应产物的预测方法The prediction method of embodiment 1 chemical reaction product
1.基于使用原始数据集D0训练的原始训练模型,预测不同反应的反应产物并计算可信度低于阈值的反应,筛选这些数据并组成第一数据集D11. Based on the original training model trained using the original data set D0, predict the reaction products of different reactions and calculate the reactions whose reliability is lower than the threshold, screen these data and form the first data set D1
本实施例在原始训练数据D0可得的情况下,D0来自于为公开数据集美国专利数据局(USPTO)的40万训练数据,选取机器翻译转换器(Transformer)(Philippe Schwaller et al.Molecular transformer:A model for uncertainty-calibrated chemical reaction prediction,2019 Sep 25;5(9):1572-1583;Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N Gomez,Lukasz Kaiser,and Illia Polosukhin.Attention is all you need.In Advances in neural information processing systems,pp.5998–6008,2017)作为原始训练模型,在其他实施例中,机器模型可替换为其他基于深度神经网络(Coley,Connor W.,et al.A graph-convolutional neural network model for the prediction of chemical reactivity.Chemical science 10.2(2019):370-377.;John Bradshaw,Matt J.Kusner,Brooks Paige,Marwin H.S.Segler,José Miguel Hernández-Lobato,A Generative Model For Electron Paths,https://arxiv.org/abs/1805.10970),只是输出的元素符号形式有所变化。并记录其使用D0进行训练迭代过程中的K>=1个模型快照;模型快照的特征可通过模型参数θt刻画,t=1,2,…,K,其中t表示第t个模型快照,K为收集的模型快照个数。In this embodiment, when the original training data D0 is available, D0 comes from the 400,000 training data of the U.S. Patent Data Office (USPTO), which is a public data set, and the machine translation converter (Transformer) (Philippe Schwaller et al. Molecular transformer) is selected. : A model for uncertainty-calibrated chemical reaction prediction, 2019 Sep 25; 5(9):1572-1583; Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Ill ia Polosukhin. Attention is all you need.In Advances in neural information processing systems, pp.5998–6008,2017) as the original training model, in other embodiments, the machine model can be replaced by other deep neural network (Coley, Connor W., et al. A graph-convolutional neural network model for the prediction of chemical reactivity. Chemical science 10.2(2019): 370-377.; John Bradshaw, Matt J. Kusner, Brooks Paige, Marwin H.S. Segler, José Miguel Hernández- Lobato, A Generative Model For Electron Paths, https://arxiv.org/abs/1805.10970), only the output element symbol form has changed. And record K>=1 model snapshots in the training iteration process using D0; the characteristics of the model snapshots can be described by the model parameters θt, t=1,2,...,K, where t represents the tth model snapshot, K is the number of model snapshots collected.
θt,t=1,2,…,K的选取根据模型训练的不同次迭代中选取;每完成一轮模型根据训练数据每一个样本的参数θt更新,称为一次epoch迭代;K可选取为epoch总数,θt对应第t个epoch迭代的模型。当K=1,具体选取训练到最后一次epoch迭代的模型。在其他实施例中,一个epoch也可以设定为一定次数的迭代间隔,如每1000次迭代作为一个epoch。The selection of θt, t=1,2,...,K is selected according to different iterations of model training; each round of model is updated according to the parameter θt of each sample of training data, which is called an epoch iteration; K can be selected as epoch The total number, θt corresponds to the model of the t-th epoch iteration. When K=1, specifically select the model trained to the last epoch iteration. In other embodiments, an epoch may also be set as a certain number of iteration intervals, such as every 1000 iterations as an epoch.
给定待分析的化学反应数据(X,Y),X代表化学反应的反应物,Y代表反应的产物。通过模型快照的参数θt可计算可信度confidence=p(Y|X,θt),p表示当X和θt已知时,模型输出Y的概率;如只给定X,模型可通过Y max=arg max i(confidence=p(Y i|X,θ))得到(X,Y max),Y max为模型的预测产物,arg max表示取所有Y i概率的最大值。i为模型通过beam-search得到的第i个最高分输出预测,i≤10。 Given the chemical reaction data (X,Y) to be analyzed, X represents the reactant of the chemical reaction and Y represents the product of the reaction. The credibility confidence=p(Y|X,θt) can be calculated through the parameter θt of the model snapshot, where p represents the probability of the model outputting Y when X and θt are known; if only X is given, the model can pass Y max = arg max i (confidence=p(Y i |X, θ)) to get (X, Y max ), Y max is the predicted product of the model, and arg max means to take the maximum value of all Y i probabilities. i is the i-th highest score output prediction obtained by the model through beam-search, i≤10.
Confidence的具体计算方式可以是反应数据的X部分,经过已训练的机器模型Transformer的多层神经网络的各层权重计算后,在模型的输出层,得到输出产物中所有可能的M个元素符号的原始权重zi(>0),i=1,2,…,M,并通过如下的Softmax进行归一化概率计算作为每一个字符i的confidence,并输出概率最大的元素符号序列作为预测Y。The specific calculation method of Confidence can be the X part of the reaction data. After the weight calculation of each layer of the multi-layer neural network of the trained machine model Transformer, in the output layer of the model, all possible M element symbols in the output product can be obtained. The original weight zi(>0), i=1,2,...,M, and the following Softmax is used for normalized probability calculation as the confidence of each character i, and the element symbol sequence with the highest probability is output as the prediction Y.
Figure PCTCN2022071283-appb-000003
Figure PCTCN2022071283-appb-000003
对上述待分析的化学反应数据集合,筛选confidence<th的“欠学习”反应数据集D1;th表示的是可信度阈值,th选自0.5。For the above-mentioned chemical reaction data set to be analyzed, select the "under-learned" reaction data set D1 with confidence<th; th represents the reliability threshold, and th is selected from 0.5.
在其他实施例中,可信度阈值范围可以为0.3至0.9,优选为0.4至0.8,更优选为0.5-0.7,例如为0.5、0.6或0.7。In other embodiments, the reliability threshold range may be 0.3 to 0.9, preferably 0.4 to 0.8, more preferably 0.5-0.7, such as 0.5, 0.6 or 0.7.
2.提供第二数据集D2,并筛选与第一数据集D1中的化学反应的相似反应,作为第三数据集D32. Provide the second data set D2, and screen for similar reactions to the chemical reactions in the first data set D1 as the third data set D3
提供筛选反应的候选补充反应数据集D2={(X’,Y’)},D2来源于USPTO的补充数据库,USPTO Stereo约100万反应。对D1任一反应W∈D1,本发明计算其与任一反应V∈D2在模型参数θ下的相似度sim(W,V)=sim(w=encoding(W),v=encoding(V));其中w,v为模型θ对输入反应V,W分别的编码(encoding)函数。Provide the candidate supplementary reaction data set D2={(X’,Y’)} of screening reactions, D2 comes from the supplementary database of USPTO, USPTO Stereo has about 1 million reactions. For any response W∈D1 of D1, the present invention calculates its similarity with any response V∈D2 under the model parameter θ sim(W, V)=sim(w=encoding(W), v=encoding(V) ); where w and v are the encoding functions of the model θ to the input responses V and W respectively.
相似度sim(w,v)的计算实施说明如下,以反应W为例,其encoding的向量表示为w=f(W,θ)=[w 1,w 2,….w n],f(W,θ)具体为反应W输入到模型θ通过各层的参数计算,在输出预测元素之前一层的向量表示,其中n属于模型预设表示向量长度的参数;同样地,可对D2的每一个反应V获得encoding的向量v=f(V,θ)=[v 1,v 2,….v n];n可以在2 6=64到2 12=4096的长度范围中选取,本实施例选取n=256。 The implementation of the calculation of the similarity sim(w,v) is described as follows. Taking the response W as an example, the encoding vector is expressed as w=f(W,θ)=[w 1 ,w 2 ,….w n ], f( W, θ) is specifically the response W input to the model θ through the parameter calculation of each layer, and the vector representation of the layer before the output prediction element, where n belongs to the model preset parameter indicating the length of the vector; similarly, each of D2 can be One reaction V obtains encoding vector v=f(V,θ)=[v 1 ,v 2 ,….v n ]; n can be selected in the length range from 2 6 =64 to 2 12 =4096, this embodiment Choose n=256.
sim(w,v)可实现为其欧几里得距离的归一化倒数(+1避免被除数为0),或其他领域技术人员可掌握的归一化相似度:sim(w,v) can be implemented as the normalized reciprocal of its Euclidean distance (+1 to avoid the divisor being 0), or the normalized similarity that can be mastered by those skilled in the art:
Figure PCTCN2022071283-appb-000004
Figure PCTCN2022071283-appb-000004
设定相关的相似度阈值th2,th2从[0.1,1]范围中设定。Set the relevant similarity threshold th2, th2 is set from the range [0.1,1].
筛选符合sim(encoding(W),encoding(V))≥th2的相似反应补充数据集,th2∈[0.1,1]为相似度阈值;通过取集合方式得到相似反应补充数据的集合D3。在实施例中,我们使用th2=0.6,并分析0.7,0.8的相关补充结果样本。Screen similar response supplementary data sets that meet sim(encoding(W), encoding(V))≥th2, and th2∈[0.1,1] is the similarity threshold; obtain the set D3 of similar response supplementary data by taking a collection method. In the examples, we use th2 = 0.6, and analyze the relevant supplementary result samples of 0.7, 0.8.
3.将D3数据合并于原始数据集或单独使用D3,重新进行模型训练3. Merge D3 data into the original data set or use D3 alone to retrain the model
本实施例的其中一种实验中,模型θ使用D3进行fine-tuning(微调))学习,即将模型θ在D3数据上再进行F>=1次迭代的训练,持续更新模型参数,实施例2中使用的是这一方案。In one of the experiments of this embodiment, the model θ uses D3 for fine-tuning (fine-tuning) learning, that is, the model θ is trained on the D3 data for F>=1 iterations, and the model parameters are continuously updated. Example 2 This scheme is used in .
本实施例的另一种变体中,可将模型θ及其对应的原始训练数据(表示为D0)均存在时,使用D0和D3按照|D0|:|D3|的比例选取两集合的反应数据,即直接合并D0和D3作为新数据集作为选项1,对重新初始化的机器模型重新进行训练(retrain)。In another variant of this embodiment, when the model θ and its corresponding original training data (denoted as D0) both exist, use D0 and D3 to select the responses of two sets according to the ratio of |D0|:|D3| Data, that is, directly merge D0 and D3 as a new data set as option 1, and retrain the reinitialized machine model.
可选地,本实施例变体可将fine-tuning和retrain得到的结果作为N=2个选项模型,在提测试反应数据集D1(或其他额外提供的测试反应数据集),选取选项模型中准确率提升最大,且预测/验证反应在可信度Confidence>0.9的类别,或更严格Confidence>0.99的类别中数目提升最多的作为最终模型。Optionally, the variant of this embodiment can use the results obtained by fine-tuning and retrain as N=2 option models, in the test response data set D1 (or other additional test response data sets provided), select the option model The accuracy rate is improved the most, and the prediction/verification response is the category with a confidence Confidence>0.9, or the category with a stricter Confidence>0.99, which has the largest number of improvements as the final model.
实施例2准确率的检测The detection of embodiment 2 accuracy rate
以下展示基于实施例的实验效果。根据实施例1中的步骤进行,其中,机器学习模型θ采用的是Transformer,其encoding向量维度选择n=256,训练迭代次数为50万次,并在每次迭代处理一小批次(batch)4096个的字符(tokens)。在训练反应数据为公开数据集美国专利数据局(USPTO)的40万训练数据训练,θ取第最后一个迭代输出的模型。40万USPTO训练数据记为D0。The experimental results based on the examples are shown below. Carry out according to the step in embodiment 1, wherein, what machine learning model θ adopts is Transformer, and its encoding vector dimension selects n=256, and the number of training iterations is 500,000 times, and a small batch (batch) is processed in each iteration 4096 characters (tokens). The training response data is 400,000 training data from the public data set US Patent Data Office (USPTO), and θ is the model output from the last iteration. The 400,000 USPTO training data is recorded as D0.
进行欠学习分析测试的反应数据集D1为内部化学家从基础有机化学书本中抽取的约1381个反应。待补充的候选反应数据集D2来自和D0不重叠的USPTO数据集,共有40万反应;值得说明的是,反应数据商业服务的后台包含大量的反应,但该类服务只提供少量的数据查询,不能大量获取数据。如Reaxys www.reaxys.com页面上介绍其内部包含超过5500万反应,但查询结果一个页面可以看到的仅为约十个反应。The reaction dataset D1 for the under-learning analysis test is about 1381 reactions extracted from basic organic chemistry books by in-house chemists. The candidate response data set D2 to be supplemented comes from the USPTO data set that does not overlap with D0, with a total of 400,000 responses; it is worth noting that the background of the response data commercial service contains a large number of responses, but this type of service only provides a small amount of data query. Data cannot be fetched in bulk. For example, on the Reaxys www.reaxys.com page, it contains more than 55 million responses, but only about ten responses can be seen on one page of query results.
在验证实验中,对比了只通过D0训练的基线模型(baseline)和通过此发明方法对D1中每个反应,从D2包含的40万候选补充反应的D2数据集,获得包含相似度≥0.6的约14000个反应的D3数据集,并经过如实施例描述的fine-tune处理,得到可信度改善模型。In the verification experiment, the baseline model (baseline) trained only through D0 was compared with each response in D1 through this inventive method, and the D2 data set containing 400,000 candidate supplementary responses was obtained from the D2 data set containing similarity ≥ 0.6. The D3 data set with about 14,000 responses was processed by fine-tune as described in the example to obtain a reliability improvement model.
Top-k准确率表示模型预测的可信度最高的k种不同可能产物,其中有一种和真实产物完全吻合。Top-1准确率即模型预测的最可能产物,和真实产物完全吻合的所有反应比例。The Top-k accuracy rate indicates the k different possible products with the highest reliability predicted by the model, one of which is completely consistent with the real product. The top-1 accuracy rate is the most likely product predicted by the model, and the proportion of all reactions that are completely consistent with the real product.
如表1所示,本实验结果证明,使用本发明的筛选方法,针对D1每个反应仅补充约十个相似反应,即可在原模型上显著提升预测效果,无论是总体Top-1准确率还是高可信度的正确预测覆盖率,都有了显著的提升,Top-1准确率提升22.6%,Confidence>0.9的覆盖率分别增加了20.86%,且该可信度区间的预测达到93.9%的Top-1准确率。As shown in Table 1, the results of this experiment prove that, using the screening method of the present invention, only about ten similar responses for each response of D1 can significantly improve the prediction effect on the original model, whether it is the overall Top-1 accuracy rate or The coverage rate of correct prediction with high reliability has been significantly improved, the accuracy rate of Top-1 has increased by 22.6%, the coverage rate of Confidence>0.9 has increased by 20.86% respectively, and the prediction of this confidence interval has reached 93.9%. Top-1 accuracy rate.
表1Table 1
Figure PCTCN2022071283-appb-000005
Figure PCTCN2022071283-appb-000005
在进一步实验中,针对低可信度Confidence<0.5筛选的200个测试 反应,在使用此发明进行筛选改善前测试,该集合的反应预测Top-1准确率仅为8.5%,验证为“欠学习”反应。经过此发明筛选并对基线模型进行fine-tune后,测试平均Confidence从0.378提升到0.796,Top-1准确率提升至60.5%,验证了此发明方法对反应预测准确率和可信度的改善。In further experiments, for the 200 test responses screened with low confidence Confidence<0.5, before using this invention for screening and improvement, the response prediction accuracy of this set is only 8.5%, and the verification is "under-learning "reaction. After this invention screened and fine-tuned the baseline model, the average test Confidence increased from 0.378 to 0.796, and the Top-1 accuracy rate increased to 60.5%, which verified the improvement of the inventive method on the accuracy and reliability of response prediction.
表2Table 2
Confidence<0.5筛选200测试反应Confidence<0.5 screens 200 test responses 基线模型baseline model 可信度改善模型Credibility Improvement Model
平均ConfidenceAverage Confidence 0.3780.378 0.7960.796
Top-1准确率Top-1 accuracy 8.5%8.5% 60.5%60.5%
另一方面,对另外随机抽取的100个错误测试反应,原始准确率为0,其中33个Confidence>0.9,16个Confidence>0.8。实验无差别地补充数据,即无针对其反应预测Confidence阈值进行相似反应补充。该部分测试反应补充后Top-1准确率为14%,即不根据Confidence阈值及相似度进行补充,其准确率提升有限。On the other hand, for another randomly selected 100 wrong test responses, the raw accuracy rate is 0, of which 33 have Confidence>0.9, and 16 have Confidence>0.8. The experiment supplemented the data without difference, that is, no similar response was supplemented for its response prediction Confidence threshold. The accuracy rate of Top-1 in this part of the test response after supplementation is 14%, that is, if supplementation is not based on the Confidence threshold and similarity, the accuracy rate improvement is limited.
表3table 3
Figure PCTCN2022071283-appb-000006
Figure PCTCN2022071283-appb-000006
对于经过无差别补充后,Confidence>0.9(图2中high conf)且仍然预测错误的7个反应,分析其原因是由于欠缺相似的训练或补充反应数据,即sim_threshold>=0.6或0.7或0.8的相似近邻反应个数均非常小(图2)。对照地,Confidence>0.9且预测正确的12个反应,其相似近邻反应个数显著较多(图3)。此实验结果进一步说明此发明结合可信度和筛选相似反应提高反应预测结果的必要性。For the 7 responses with Confidence>0.9 (high conf in Figure 2) and still predicting errors after indifferent supplementation, the reason for the analysis is the lack of similar training or supplementary response data, that is, sim_threshold>=0.6 or 0.7 or 0.8 The number of similar neighbor reactions is very small (Figure 2). In contrast, for the 12 responses with Confidence>0.9 and correct prediction, the number of similar neighbor responses was significantly more (Figure 3). The experimental results further illustrate the necessity of the invention to improve the reaction prediction results by combining reliability and screening similar reactions.
本领域的技术人员应当明了,尽管为了举例说明的目的,本文描述了本发明的具体实施方式,但可以对其进行各种修改而不偏离本发明的精神和范围。因此,本发明的具体实施方式和实施例不应当视为限制本发明的范围。本发明仅受所附权利要求的限制。本文中引用的所有文献均完整地并入本文作为参考。Those skilled in the art will appreciate that, although specific embodiments of the invention have been described herein for purposes of illustration, various modifications can be made thereto without departing from the spirit and scope of the invention. Therefore, the detailed description and examples of the present invention should not be considered as limiting the scope of the present invention. The invention is limited only by the appended claims. All documents cited herein are hereby incorporated by reference in their entirety.

Claims (15)

  1. 一种化学反应产物的预测方法,包括以下步骤:A method for predicting chemical reaction products, comprising the following steps:
    步骤1:基于使用原始数据集D0训练的原始训练模型,预测不同反应的反应产物并计算可信度低于阈值的反应,筛选这些数据并组成第一数据集D1。Step 1: Based on the original training model trained with the original data set D0, predict the reaction products of different reactions and calculate the responses whose reliability is lower than the threshold, screen these data and form the first data set D1.
    步骤2:提供第二数据集D2,并筛选与第一数据集D1中的化学反应的相似反应,作为第三数据集D3。Step 2: Provide a second data set D2, and screen reactions similar to the chemical reactions in the first data set D1 as a third data set D3.
    步骤3:将D3数据合并于原始数据集或单独使用D3,重新进行模型训练。Step 3: Merge the D3 data into the original dataset or use D3 alone to retrain the model.
  2. 如权利要求1所述的方法,其中所述步骤1包括:The method of claim 1, wherein said step 1 comprises:
    获取一个或多个能产生反应预测及输出其预测可信度的机器模型;obtain one or more machine models that generate response predictions and output confidence in their predictions;
    在原始数据集中通过给定的化学反应计算各模型中预测产物对应的可信度,并统计全部模型的整体可信度;Calculate the corresponding credibility of the predicted product in each model through a given chemical reaction in the original data set, and count the overall credibility of all models;
    筛选可信度小于阈值的反应数据,得到第一数据集D1;Screening the response data whose reliability is less than the threshold to obtain the first data set D1;
    其中,所述阈值为0.3至0.9中的任意数,优选为0.4至0.8,更优选为0.5至0.7.诸如大约为0.5。Wherein, the threshold is any number from 0.3 to 0.9, preferably 0.4 to 0.8, more preferably 0.5 to 0.7, such as about 0.5.
  3. 如权利要求2所述的方法,其中,当产物信息已知时,confidence=p(Y|X,θt);当反应产物信息未知时,Ymax=arg max i(confidence=p(Yi|X,θt))得到(X,Ymax),i为模型可提供的第i个输出预测,t表示第t个模型快照,t=1,2,…,K,K为收集的模型快照个数,X代表化学反应的反应物,Y代表反应的产物,p表示当X和θt已知时,模型输出Y的概率,Y max为模型的预测产物,arg max表示取所有Y i概率的最大值。 The method according to claim 2, wherein, when the product information is known, confidence=p(Y|X, θt); when the reaction product information is unknown, Ymax=arg max i (confidence=p(Yi|X, θt)) to get (X, Ymax), i is the i-th output prediction that the model can provide, t represents the t-th model snapshot, t=1,2,...,K, K is the number of collected model snapshots, X Represents the reactant of the chemical reaction, Y represents the product of the reaction, p represents the probability of the model output Y when X and θt are known, Y max is the predicted product of the model, and arg max represents the maximum value of all Y i probabilities.
  4. 如权利要求1所述的方法,其中,所述步骤2包括:The method according to claim 1, wherein said step 2 comprises:
    提供第二数据集D2;providing a second data set D2;
    针对D2中的化学反应W,计算其与D1中的化学反应V的相似 度sim(w,v);For the chemical reaction W in D2, calculate its similarity sim(w, v) with the chemical reaction V in D1;
    在D2中筛选sim(w,v)大于或等于阈值的相似反应补充数据,取集合方式得到第三数据集D3;In D2, filter the supplementary data of similar responses whose sim(w,v) is greater than or equal to the threshold, and obtain the third data set D3 by means of collection;
    所述阈值为0.1至1中的任意数,优选为0.3至0.8,更优选为0.5至0.8,诸如大约为0.6、0.7或0.8。The threshold is any number from 0.1 to 1, preferably 0.3 to 0.8, more preferably 0.5 to 0.8, such as about 0.6, 0.7 or 0.8.
  5. 如权利要求4所述的方法,其中sim(w,v)=sim(w=encoding(W),v=encoding(V));其中w,v为模型θt分别对输入反应V,W的编码(encoding)。The method as claimed in claim 4, wherein sim(w, v)=sim(w=encoding(W), v=encoding(V)); wherein w, v are the codes of model θt to input responses V and W respectively (encoding).
  6. 如权利要求5所述的方法,其中The method of claim 5, wherein
    Figure PCTCN2022071283-appb-100001
    Figure PCTCN2022071283-appb-100001
    w=f(W,θt)=[w1,w2,….wn],f(W,θt)具体为反应W输入到模型θt通过各层的参数计算,在输出预测元素之前一层的向量表示,其中n属于模型预设表示向量长度的参数;反应V获得v=f(V,θt)=[v1,v2,….vn];优选地,n可以在2 6=64到2 12=4096的长度范围中选取任意数。 w=f(W,θt)=[w1,w2,...wn], f(W,θt) is specifically the response W input to the model θt through the calculation of the parameters of each layer, and the vector representation of the layer before the output prediction element , where n belongs to the model preset parameter representing the length of the vector; the response V obtains v=f(V,θt)=[v1,v2,….vn]; preferably, n can be in the range of 2 6 =64 to 2 12 =4096 Choose any number in the length range of .
  7. 如权利要求1所述的方法,其中,所述步骤3包括:The method according to claim 1, wherein said step 3 comprises:
    在原始训练数据D0中随机取样R倍于D3的数据量,和D3合并,产生新的数据集,之后对重新初始化的机器模型参数重新进行训练,优选地,R可选自0.5至max(1,|D0|/|D3|)中的任意数;或者In the original training data D0, randomly sample R times the data volume of D3, merge with D3 to generate a new data set, and then retrain the reinitialized machine model parameters. Preferably, R can be selected from 0.5 to max(1 ,|D0|/|D3|); or
    使用D3进行微调学习,即将模型θt在D3数据上再进行F≥1次迭代的训练,持续更新模型参数。Use D3 for fine-tuning learning, that is, the model θt is trained for F≥1 iterations on the D3 data, and the model parameters are continuously updated.
  8. 一种化学反应产物的预测装置,所述装置包括:A device for predicting chemical reaction products, the device comprising:
    第一预测模块,用于基于原始训练模型,预测不同反应的反应产物并计算可信度低于阈值的反应,筛选这些数据并组成第一数据集D1;The first prediction module is used to predict the reaction products of different reactions based on the original training model and calculate the reactions whose reliability is lower than the threshold, screen these data and form the first data set D1;
    第二预测模块,用于提供第二数据集D2,并筛选与第一数据集D1中的化学反应的相似反应,作为第三数据集D3;The second prediction module is used to provide the second data set D2, and screen similar reactions to the chemical reactions in the first data set D1 as the third data set D3;
    第三预测模块,用于将D3数据合并于原始数据集或单独使用D3,重新进行模型训练。The third prediction module is used to merge D3 data into the original data set or use D3 alone to retrain the model.
  9. 如权利要求8所述的装置,其中,所述第一预测模块具体用于:The device according to claim 8, wherein the first prediction module is specifically used for:
    获取一个或多个能产生反应预测及输出其预测可信度的机器模型;obtain one or more machine models that generate response predictions and output confidence in their predictions;
    在原始数据集中通过给定的化学反应计算各模型中预测产物对应的可信度,并统计全部模型的整体可信度;Calculate the corresponding credibility of the predicted product in each model through a given chemical reaction in the original data set, and count the overall credibility of all models;
    筛选可信度小于阈值的反应数据,得到第一数据集D1;Screening the response data whose reliability is less than the threshold to obtain the first data set D1;
    其中,所述阈值为0.3至0.9中的任意数,优选为0.4至0.8,更优选为0.5至0.7.诸如大约为0.5。Wherein, the threshold is any number from 0.3 to 0.9, preferably 0.4 to 0.8, more preferably 0.5 to 0.7, such as about 0.5.
  10. 如权利要求9所述的装置,其中,当产物信息已知时,confidence=p(Y|X,θt);当反应产物信息未知时,Ymax=arg maxi(confidence=p(Yi|X,θt))得到(X,Ymax),i为模型可提供的第i个输出预测,t表示第t个模型快照,t=1,2,…,K,K为收集的模型快照个数,X代表化学反应的反应物,Y代表反应的产物,p表示当X和θt已知时,模型输出Y的概率,Y max为模型的预测产物,arg max表示取所有Y i概率的最大值。 The device according to claim 9, wherein, when the product information is known, confidence=p(Y|X, θt); when the reaction product information is unknown, Ymax=arg maxi(confidence=p(Yi|X, θt )) to get (X, Ymax), i is the i-th output prediction that the model can provide, t represents the t-th model snapshot, t=1,2,...,K, K is the number of collected model snapshots, X represents The reactant of the chemical reaction, Y represents the product of the reaction, p represents the probability of the model output Y when X and θt are known, Y max is the predicted product of the model, and arg max represents the maximum value of all Y i probabilities.
  11. 如权利要求8所述的装置,其中,所述第二预测模块具体用于:The device according to claim 8, wherein the second prediction module is specifically used for:
    提供第二数据集D2;providing a second data set D2;
    针对D1中的化学反应W,计算其与D2中的化学反应V的相似度sim(w,v);For the chemical reaction W in D1, calculate the similarity sim(w,v) between it and the chemical reaction V in D2;
    在D2中筛选sim(w,v)大于或等于阈值的相似反应补充数据,取集合方式得到第三数据集D3;In D2, filter the supplementary data of similar responses whose sim(w,v) is greater than or equal to the threshold, and obtain the third data set D3 by means of collection;
    所述阈值为0.1至1中的任意数,优选为0.3至0.8,更优选为0.5至0.8,诸如大约为0.6、.07或0.8。The threshold is any number from 0.1 to 1, preferably 0.3 to 0.8, more preferably 0.5 to 0.8, such as about 0.6, .07 or 0.8.
  12. 如权利要求11所述的装置,其中,sim(w,v)= sim(w=encoding(W),v=encoding(V));其中w,v为模型θt分别对输入反应V,W的编码(encoding),优选The apparatus according to claim 11, wherein, sim(w, v)=sim(w=encoding(W), v=encoding(V)); wherein w, v are model θt responses to input V, W respectively encoding, preferably
    Figure PCTCN2022071283-appb-100002
    Figure PCTCN2022071283-appb-100002
    w=f(W,θt)=[w1,w2,….wn],f(W,θt)具体为反应W输入到模型θt通过各层的参数计算,在输出预测元素之前一层的向量表示,其中n属于模型预设表示向量长度的参数;反应V获得v=f(V,θt)=[v1,v2,….vn];优选地,n可以在2 6=64到2 12=4096的长度范围中的任意数。 w=f(W,θt)=[w1,w2,...wn], f(W,θt) is specifically the response W input to the model θt through the calculation of the parameters of each layer, and the vector representation of the layer before the output prediction element , where n belongs to the model preset parameter representing the length of the vector; the response V obtains v=f(V,θt)=[v1,v2,….vn]; preferably, n can be in the range of 2 6 =64 to 2 12 =4096 Any number in the length range of .
  13. 如权利要求8所述的装置,其中,所述第三预测模块具体用于:The device according to claim 8, wherein the third prediction module is specifically used for:
    在原始训练数据D0中随机取样R倍于D3的数据量,和D3合并,产生新的数据集,之后对重新初始化的机器模型参数重新进行训练,优选地,R可选自0.5至max(1,|D0|/|D3|)]中的任意数;或者In the original training data D0, randomly sample R times the data volume of D3, merge with D3 to generate a new data set, and then retrain the reinitialized machine model parameters. Preferably, R can be selected from 0.5 to max(1 ,|D0|/|D3|)]; or
    使用D3进行微调学习,即将模型θt在D3数据上再进行F≥1次迭代的训练,持续更新模型参数。Use D3 for fine-tuning learning, that is, the model θt is trained for F≥1 iterations on the D3 data, and the model parameters are continuously updated.
  14. 一种设备,所述设备包括处理器即储存器,所述储存器用于储存计算机程序,所述处理器用于根据所述计算机程序执行权利要求1-7中任一项所述的化合物反应产物预测方法。A kind of equipment, described equipment comprises processor namely storage, and described storage is used for storing computer program, and described processor is used for carrying out the compound reaction product prediction according to any one of claim 1-7 according to said computer program method.
  15. 一种计算机可读存储介质,所述计算机可读存储介质用于储存计算机程序,所述计算机程序用于执行权利要求1-7中任一项所述的化合物反应产物预测方法。A computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the compound reaction product prediction method according to any one of claims 1-7.
PCT/CN2022/071283 2022-01-11 2022-01-11 Method for predicting chemical reaction WO2023133678A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/071283 WO2023133678A1 (en) 2022-01-11 2022-01-11 Method for predicting chemical reaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/071283 WO2023133678A1 (en) 2022-01-11 2022-01-11 Method for predicting chemical reaction

Publications (1)

Publication Number Publication Date
WO2023133678A1 true WO2023133678A1 (en) 2023-07-20

Family

ID=87279889

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071283 WO2023133678A1 (en) 2022-01-11 2022-01-11 Method for predicting chemical reaction

Country Status (1)

Country Link
WO (1) WO2023133678A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6485131A (en) * 1987-09-28 1989-03-30 Kimito Funatsu Apparatus for predicting chemical reaction
CN110021373A (en) * 2017-09-19 2019-07-16 上海交通大学 A kind of legitimacy prediction technique of chemical reaction
WO2019156872A1 (en) * 2018-01-30 2019-08-15 Peter Madrid Computational generation of chemical synthesis routes and methods
CN113160902A (en) * 2021-04-09 2021-07-23 大连理工大学 Method for predicting enantioselectivity of chemical reaction product
CN113838536A (en) * 2021-09-13 2021-12-24 烟台国工智能科技有限公司 Translation model construction method, product prediction model construction method and prediction method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6485131A (en) * 1987-09-28 1989-03-30 Kimito Funatsu Apparatus for predicting chemical reaction
CN110021373A (en) * 2017-09-19 2019-07-16 上海交通大学 A kind of legitimacy prediction technique of chemical reaction
WO2019156872A1 (en) * 2018-01-30 2019-08-15 Peter Madrid Computational generation of chemical synthesis routes and methods
CN113160902A (en) * 2021-04-09 2021-07-23 大连理工大学 Method for predicting enantioselectivity of chemical reaction product
CN113838536A (en) * 2021-09-13 2021-12-24 烟台国工智能科技有限公司 Translation model construction method, product prediction model construction method and prediction method

Similar Documents

Publication Publication Date Title
Qu et al. Question answering over freebase via attentive RNN with similarity matrix based CNN
Pimentel et al. A meta-learning approach for recommending the number of clusters for clustering algorithms
Gupta et al. Learning temporal point processes with intermittent observations
Hwang et al. Adversarial training for disease prediction from electronic health records with missing data
Peng et al. An extreme learning machine for unsupervised online anomaly detection in multivariate time series
Meena et al. A novel framework for filtering the PCOS attributes using data mining techniques
Fursov et al. Sequence embeddings help detect insurance fraud
Wu et al. Decor: Degree-corrected social graph refinement for fake news detection
Wang et al. Few-shot node classification with extremely weak supervision
McDermott et al. A Closer Look at AUROC and AUPRC under Class Imbalance
Gupta et al. Modeling continuous time sequences with intermittent observations using marked temporal point processes
WO2023133678A1 (en) Method for predicting chemical reaction
Tiwari et al. Empirical analysis of chronic disease dataset for multiclass classification using optimal feature selection based hybrid model with spark streaming
Joly Exploiting random projections and sparsity with random forests and gradient boosting methods--Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity
Montori et al. A metadata-assisted cascading ensemble classification framework for automatic annotation of open IoT data
Fursov et al. Sequence embeddings help to identify fraudulent cases in healthcare insurance
Wang et al. Dc-nas: Divide-and-conquer neural architecture search
Wang et al. Dct-net: A deep co-interactive transformer network for video temporal grounding
CN112735532B (en) Metabolite identification system based on molecular fingerprint prediction and application method thereof
Liu et al. Temporal segment transformer for action segmentation
Wang et al. A graph based methodology for temporal signature identification from EHR
Gujral et al. Utilization of time series tools in life-sciences and neuroscience
WO2023193259A1 (en) Multi-model ensemble learning-based method for improving confidence of retrosynthesis
CN115547423A (en) Method for predicting chemical reaction
Zisser et al. Transformer-based time-to-event prediction for chronic kidney disease deterioration

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22919344

Country of ref document: EP

Kind code of ref document: A1