CN117393075A - Model training method and task execution method based on molecular energy information - Google Patents
Model training method and task execution method based on molecular energy information Download PDFInfo
- Publication number
- CN117393075A CN117393075A CN202311702625.8A CN202311702625A CN117393075A CN 117393075 A CN117393075 A CN 117393075A CN 202311702625 A CN202311702625 A CN 202311702625A CN 117393075 A CN117393075 A CN 117393075A
- Authority
- CN
- China
- Prior art keywords
- molecular
- information
- compound molecule
- graph
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 96
- 238000012549 training Methods 0.000 title claims abstract description 45
- 150000001875 compounds Chemical class 0.000 claims abstract description 236
- 238000012512 characterization method Methods 0.000 claims abstract description 50
- 230000008569 process Effects 0.000 claims abstract description 36
- 238000010586 diagram Methods 0.000 claims abstract description 27
- 238000004590 computer program Methods 0.000 claims description 18
- 230000007246 mechanism Effects 0.000 claims description 18
- 238000003860 storage Methods 0.000 claims description 16
- 239000013598 vector Substances 0.000 claims description 14
- 238000013461 design Methods 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 description 15
- 238000005516 engineering process Methods 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 3
- 239000000126 substance Substances 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000035515 penetration Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000000547 structure data Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/27—Regression, e.g. linear or logistic regression
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/042—Knowledge-based neural networks; Logical representations of neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Databases & Information Systems (AREA)
- Chemical & Material Sciences (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域Technical field
本说明书涉计算机技术领域,尤其涉及一种模型训练方法和基于分子能量信息的任务执行方法。This description relates to the field of computer technology, and in particular, to a model training method and a task execution method based on molecular energy information.
背景技术Background technique
随着科技的发展,分子研究已经成为社会及科学发展过程中的重要一环,通过构建化合物分子结构-能量关系,可以对与化合物分子的构型变化相关的下游任务(分子调控、分子设计等)的研究提供有利的技术保障,这使得对化合物分子的能量进行高效准确的预测成为分子研究过程中的一个新的探索方向。With the development of science and technology, molecular research has become an important part of the social and scientific development process. By constructing the structure-energy relationship of compound molecules, downstream tasks related to the configuration changes of compound molecules (molecular regulation, molecular design, etc.) ) research provides favorable technical support, which makes efficient and accurate prediction of the energy of compound molecules a new exploration direction in the process of molecular research.
然而,目前通常采用简单的浅层神经网络对化合物分子进行刻画,模型预测出的化合物分子能量的准确性较低,难以用于执行下游的分子研究任务。However, currently, simple shallow neural networks are usually used to characterize compound molecules. The accuracy of the compound molecule energy predicted by the model is low, making it difficult to be used to perform downstream molecular research tasks.
因此,如何准确的确定出化合物分子的能量信息,是一个亟待解决的问题。Therefore, how to accurately determine the energy information of compound molecules is an urgent problem to be solved.
发明内容Contents of the invention
本说明书提供一种模型训练方法和基于分子能量信息的任务执行方法,以部分的解决现有技术存在的上述问题。This specification provides a model training method and a task execution method based on molecular energy information to partially solve the above problems existing in the existing technology.
本说明书采用下述技术方案:This manual adopts the following technical solutions:
本说明书提供了一种模型训练的方法,包括:This manual provides a method for model training, including:
获取指定化合物分子的表征数据,所述表征数据用于表征所述指定化合物分子内各原子的位置信息以及属性信息;Obtain characterization data of the specified compound molecule, the characterization data being used to characterize the position information and attribute information of each atom in the specified compound molecule;
对所述表征数据进行处理,确定所述指定化合物分子的三维分子图信息;Process the characterization data to determine the three-dimensional molecular map information of the designated compound molecule;
将所述三维分子图信息输入待训练的预测模型中,以通过所述预测模型,基于所述位置信息与所述位置信息对应嵌入特征之间的等变性,以及所述属性信息与所述属性信息对应嵌入特征之间的不变性,根据所述三维分子图信息确定所述指定化合物分子对应的三维分子图特征;The three-dimensional molecular graph information is input into a prediction model to be trained, so that through the prediction model, based on the equivariance between the position information and the embedded features corresponding to the position information, and the attribute information and the attribute The information corresponds to the invariance between the embedded features, and the three-dimensional molecular graph features corresponding to the specified compound molecules are determined based on the three-dimensional molecular graph information;
根据所述三维分子图特征,预测所述指定化合物分子对应的分子能量信息;Predict the molecular energy information corresponding to the specified compound molecule according to the three-dimensional molecular graph characteristics;
以最小化预测出的所述分子能量信息与所述指定化合物分子对应的实际分子能量信息之间的偏差为优化目标,对所述预测模型进行训练。The prediction model is trained with an optimization goal of minimizing the deviation between the predicted molecular energy information and the actual molecular energy information corresponding to the specified compound molecule.
可选地,获取指定化合物分子的表征数据,具体包括:Optionally, obtain the characterization data of the specified compound molecule, including:
在分子化合物的数据集中选取出所述指定化合物分子的初始数据;Select the initial data of the designated compound molecules from the data set of molecular compounds;
根据所述指定化合物分子的初始数据,确定所述表征数据。The characterization data is determined based on initial data for the designated compound molecule.
可选地,所述位置信息包括:所述指定化合物分子内各原子在指定坐标系下的坐标;Optionally, the position information includes: the coordinates of each atom in the specified compound molecule in a specified coordinate system;
所述属性信息包括:所述指定化合物分子内各原子的类型、所述指定化合物分子内任意两个原子之间的方向向量以及所述指定化合物分子内任意两个原子之间的连接信息。The attribute information includes: the type of each atom in the designated compound molecule, the direction vector between any two atoms in the designated compound molecule, and the connection information between any two atoms in the designated compound molecule.
可选地,所述预测模型中设置有图注意力机制网络;Optionally, a graph attention mechanism network is provided in the prediction model;
根据所述三维分子图信息确定所述指定化合物分子对应的三维分子图特征,具体包括:Determining the three-dimensional molecular map characteristics corresponding to the specified compound molecule based on the three-dimensional molecular map information specifically includes:
通过所述图注意力机制网络,确定所述指定化合物分子对应的注意力权重;Determine the attention weight corresponding to the specified compound molecule through the graph attention mechanism network;
根据所述注意力权重以及所述图注意力机制网络基于所述三维分子图信息所确定出的各嵌入特征,确定所述指定化合物分子对应的不变量特征以及等变量特征;According to the attention weight and each embedded feature determined by the graph attention mechanism network based on the three-dimensional molecular graph information, determine the invariant features and equal variable features corresponding to the designated compound molecule;
根据所述不变量特征以及所述等变量特征,确定所述三维分子图特征。The three-dimensional molecular graph features are determined based on the invariant features and the equivariable features.
本说明书提供了一种基于分子能量信息的任务执行方法,包括:This instruction sheet provides a task execution method based on molecular energy information, including:
接收针对原始化合物的任务执行请求,并获取所述原始化合物分子的表征数据;Receive a task execution request for the original compound and obtain characterization data of the original compound molecule;
对所述原始化合物分子的表征数据进行处理,确定所述原始化合物分子对应的三维分子图信息;Process the characterization data of the original compound molecules to determine the three-dimensional molecular map information corresponding to the original compound molecules;
将所述原始化合物分子的三维分子图信息输入预先训练的预测模型中,以通过所述预测模型,确定所述原始化合物分子对应的三维分子图特征,并根据所述原始化合物分子对应的三维分子图特征,确定所述原始化合物分子对应的分子能量信息,其中,所述预测模型是通过上述模型训练的方法训练得到的;The three-dimensional molecular graph information of the original compound molecule is input into the pre-trained prediction model, so that the three-dimensional molecular graph characteristics corresponding to the original compound molecule are determined through the prediction model, and the three-dimensional molecular graph corresponding to the original compound molecule is determined based on the three-dimensional molecular graph information of the original compound molecule. graph features to determine the molecular energy information corresponding to the original compound molecule, wherein the prediction model is trained by the above model training method;
根据所述原始化合物分子对应的分子能量信息执行任务。Perform tasks based on the molecular energy information corresponding to the original compound molecules.
可选地,所述任务包括:分子调控任务或分子设计任务;Optionally, the tasks include: molecular regulation tasks or molecular design tasks;
根据所述原始化合物分子对应的分子能量信息执行任务,具体包括:Perform tasks based on the molecular energy information corresponding to the original compound molecules, specifically including:
根据所述原始化合物分子对应的分子能量信息以及所述原始化合物分子对应的表征数据,确定所述原始化合物分子的分子结构与分子能量之间的对应关系;Determine the corresponding relationship between the molecular structure and molecular energy of the original compound molecule according to the molecular energy information corresponding to the original compound molecule and the characterization data corresponding to the original compound molecule;
基于所述对应关系执行所述分子设计任务或所述分子调控任务。The molecular design task or the molecular regulation task is performed based on the corresponding relationship.
本说明书提供了一种模型训练装置,包括:This instruction manual provides a model training device, including:
获取模块,用于获取指定化合物分子的表征数据,所述表征数据用于表征所述指定化合物分子内各原子的位置信息以及属性信息;An acquisition module, used to obtain characterization data of a designated compound molecule, where the characterization data is used to characterize the position information and attribute information of each atom in the designated compound molecule;
确定模块,对所述表征数据进行处理,确定所述指定化合物分子的三维分子图信息;a determination module, which processes the characterization data and determines the three-dimensional molecular diagram information of the specified compound molecule;
输入模块,用于将所述三维分子图信息输入待训练的预测模型中,以通过所述预测模型,基于所述位置信息与所述位置信息对应嵌入特征之间的等变性,以及所述属性信息与所述属性信息对应嵌入特征之间的不变性,根据所述三维分子图信息确定所述指定化合物分子对应的三维分子图特征;An input module for inputting the three-dimensional molecular graph information into a prediction model to be trained, so that through the prediction model, based on the equivariance between the position information and the embedded features corresponding to the position information, and the attributes The invariance between the information and the embedded features corresponding to the attribute information, and determining the three-dimensional molecular graph features corresponding to the specified compound molecules based on the three-dimensional molecular graph information;
预测模块,用于根据所述三维分子图特征,预测所述指定化合物分子对应的分子能量信息;A prediction module, used to predict the molecular energy information corresponding to the specified compound molecule based on the three-dimensional molecular graph characteristics;
训练模块,用于以最小化预测出的所述分子能量信息与所述指定化合物分子对应的实际分子能量信息之间的偏差为优化目标,对所述预测模型进行训练。A training module, configured to train the prediction model with an optimization goal of minimizing the deviation between the predicted molecular energy information and the actual molecular energy information corresponding to the specified compound molecule.
本说明书提供了一种基于分子能量信息的任务执行装置,包括:This description provides a task execution device based on molecular energy information, including:
接收模块,用于接收针对原始化合物的任务执行请求,并获取原始化合物分子的表征数据;The receiving module is used to receive the task execution request for the original compound and obtain the characterization data of the original compound molecule;
构建模块,用于对所述原始化合物分子的表征数据进行处理,确定所述原始化合物分子对应的三维分子图信息;A building module for processing the characterization data of the original compound molecules and determining the three-dimensional molecular diagram information corresponding to the original compound molecules;
确定模块,用于将所述原始化合物分子的三维分子图信息输入预先训练的预测模型中,以通过所述预测模型,确定所述原始化合物分子对应的三维分子图特征,并根据所述原始化合物分子对应的三维分子图特征,确定所述原始化合物分子对应的分子能量信息,其中,所述预测模型是通过上述模型训练的方法训练得到的;A determination module, configured to input the three-dimensional molecular graph information of the original compound molecule into a pre-trained prediction model, so as to determine the three-dimensional molecular graph characteristics corresponding to the original compound molecule through the prediction model, and determine the characteristics of the three-dimensional molecular graph corresponding to the original compound molecule according to the original compound molecule. The three-dimensional molecular graph characteristics corresponding to the molecules determine the molecular energy information corresponding to the original compound molecules, wherein the prediction model is trained by the above model training method;
执行模块,用于根据所述原始化合物分子对应的分子能量信息执行任务。An execution module is used to execute tasks according to the molecular energy information corresponding to the original compound molecules.
本说明书提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述模型训练的方法或基于分子能量信息的任务执行方法。This specification provides a computer-readable storage medium. The storage medium stores a computer program. When the computer program is executed by a processor, the above-mentioned model training method or task execution method based on molecular energy information is implemented.
本说明书提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述模型训练的方法或基于分子能量信息的任务执行方法。This specification provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the above-mentioned model training method is implemented or based on molecular energy information. task execution method.
本说明书采用的上述至少一个技术方案能够达到以下有益效果:At least one of the above technical solutions adopted in this manual can achieve the following beneficial effects:
在本说明书提供的模型训练的方法中,获取指定化合物分子的表征数据,并构建指定化合物分子的三维分子图信息;将三维分子图信息输入预测模型中,确定指定化合物分子对应的三维分子图特征;根据三维分子图特征,预测指定化合物分子对应的分子能量信息;以最小化预测出的分子能量信息与指定化合物分子对应的实际分子能量信息之间的偏差为优化目标,对模型进行训练。使得后续在以分子结构特征预测整体分子的能量的过程中,可以通过该预测模型实现快速、准确地预测任意结构下的化合物分子的整体能量,提高分子能量预测的准确性。In the model training method provided in this manual, the characterization data of the specified compound molecule is obtained, and the three-dimensional molecular graph information of the specified compound molecule is constructed; the three-dimensional molecular graph information is input into the prediction model to determine the three-dimensional molecular graph characteristics corresponding to the specified compound molecule. ; Predict the molecular energy information corresponding to the specified compound molecule based on the characteristics of the three-dimensional molecular graph; train the model with the optimization goal of minimizing the deviation between the predicted molecular energy information and the actual molecular energy information corresponding to the specified compound molecule. In the subsequent process of predicting the energy of the entire molecule based on the molecular structure characteristics, this prediction model can be used to quickly and accurately predict the overall energy of the compound molecule under any structure, improving the accuracy of molecular energy prediction.
从上述方法中可以看出,本方案可以通过化合物分子的表征数据构建其三维分子图信息,充分的表征出化合物分子中各原子的结构及属性,将三维分子图输入到预测模型后,可以基于位置信息的等变性,以及属性信息的不变性来进行特征提取,从而准确的对化合物分子的特征进行表达,充分保证了化合物分子在属性和结构上的原有特性,之后根据提取到的特征对化合物分子的能量信息进行预测并根据预测结果对模型进行训练,充分提高了训练完成后的预测模型所预测出的化合物分子能量信息的准确性,为下游的分子研究任务提供了有效保障。It can be seen from the above method that this solution can construct three-dimensional molecular graph information through the characterization data of compound molecules, fully characterizing the structure and properties of each atom in the compound molecule. After inputting the three-dimensional molecular graph into the prediction model, it can be based on Feature extraction is performed by using the isovariance of position information and the invariance of attribute information to accurately express the characteristics of compound molecules, fully ensuring the original characteristics of compound molecules in terms of attributes and structure, and then extracting features based on the extracted features. The energy information of compound molecules is predicted and the model is trained based on the prediction results, which fully improves the accuracy of the energy information of compound molecules predicted by the prediction model after training, and provides effective guarantee for downstream molecular research tasks.
附图说明Description of the drawings
此处所说明的附图用来提供对本说明书的进一步理解,构成本说明书的一部分,本说明书的示意性实施例及其说明用于解释本说明书,并不构成对本说明书的不当限定。在附图中:The drawings described here are used to provide a further understanding of this specification and constitute a part of this specification. The illustrative embodiments and descriptions of this specification are used to explain this specification and do not constitute an improper limitation of this specification. In the attached picture:
图1为本说明书中提供的一种模型训练的方法的流程示意图;Figure 1 is a schematic flow chart of a model training method provided in this specification;
图2为本说明书提供的一种基于分子能量信息的任务执行方法的流程示意图;Figure 2 is a schematic flow chart of a task execution method based on molecular energy information provided in this specification;
图3为本说明书提供的一种化合物分子结构-能量关系的探索系统的架构示意图;Figure 3 is a schematic diagram of the architecture of a system for exploring the molecular structure-energy relationship of compounds provided in this specification;
图4为本说明书提供的一种模型训练装置的示意图;Figure 4 is a schematic diagram of a model training device provided in this specification;
图5为本说明书提供的一种基于分子能量信息的任务执行装置的示意图;Figure 5 is a schematic diagram of a task execution device based on molecular energy information provided in this specification;
图6为本说明书中提供的一种对应于图1或图2的电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device corresponding to FIG. 1 or 2 provided in this specification.
具体实施方式Detailed ways
为使本说明书的目的、技术方案和优点更加清楚,下面将结合本说明书具体实施例及相应的附图对本说明书技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本说明书保护的范围。In order to make the purpose, technical solutions and advantages of this specification more clear, the technical solutions of this specification will be clearly and completely described below in conjunction with specific embodiments of this specification and the corresponding drawings. Obviously, the described embodiments are only some of the embodiments of this specification, but not all of the embodiments. Based on the embodiments in this specification, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this specification.
以下结合附图,详细说明本说明书各实施例提供的技术方案。The technical solutions provided by each embodiment of this specification will be described in detail below with reference to the accompanying drawings.
图1为本说明书中提供的一种模型训练的方法的流程示意图,包括以下步骤:Figure 1 is a schematic flow chart of a model training method provided in this manual, including the following steps:
S101:获取指定化合物分子的表征数据,所述表征数据用于表征所述指定化合物分子内各原子的位置信息以及属性信息。S101: Obtain the characterization data of the specified compound molecule. The characterization data is used to characterize the position information and attribute information of each atom in the specified compound molecule.
S102:对所述表征数据进行处理,确定所述指定化合物分子的三维分子图信息。S102: Process the characterization data to determine the three-dimensional molecular map information of the specified compound molecule.
本说明书提供的模型训练的方法的执行主体可以是诸如台式电脑、笔记本电脑等终端设备,当然也可以是服务器,为了便于说明,以下仅以终端设备作为执行主体为例,对所提供的模型训练方法进行说明。The execution subject of the model training method provided in this manual can be a terminal device such as a desktop computer or a laptop computer, or of course a server. For the convenience of explanation, the following only takes the terminal device as the execution subject as an example to describe the model training provided. The method is explained.
在本说明书中,终端设备可以获取各种指定化合物分子的结构的原始数据,以构建用于对模型进行训练的数据集,其中,这些数据可以从外部网络中进行爬取获得,也可以通过文件录入的形式来获取。In this specification, the terminal device can obtain raw data on the structure of various specified compound molecules to build a data set for training the model. These data can be obtained by crawling from the external network or through files. Enter the form to obtain.
在得到上述数据集后,终端设备需要从中搜索出适合对预测模型进行训练的数据,也就是说,收录在数据集中的各种指定化合物分子并不是都适合作为样本进行训练,有些数据可能没有很好的标签分子能量信息,有些数据则可以属于“脏数据”。After obtaining the above data set, the terminal device needs to search for data suitable for training the prediction model. In other words, not all the specified compound molecules included in the data set are suitable for training as samples, and some data may not be very suitable for training. Good label molecule energy information, some data can be classified as "dirty data".
所以,终端设备需要从数据集中搜索出适合作为训练样本的指定化合物分子,进而确定出指定化合物分子的初始数据。具体的实现方式可以是,创建的可扩展三维分子图结构数据生成器,对类指定化合物分子化合物数据进行清洗、重构和优化,从而搜索出作为训练样本的指定化合物分子,并确定出指定化合物分子的三维分子图信息。Therefore, the terminal device needs to search for designated compound molecules suitable as training samples from the data set, and then determine the initial data of the designated compound molecules. The specific implementation method can be to create an extensible three-dimensional molecular graph structure data generator to clean, reconstruct and optimize the molecular compound data of the specified compound class, thereby searching for the specified compound molecules as training samples and determining the specified compound. Three-dimensional molecular map information of molecules.
在本说明书中,终端设备从上述数据集中确定出指定化合物分子的初始数据后,可以进一步的确定出指定化合物分子的表征数据,其中,这里提到的表征数据用于表征指定化合物分子内各原子的位置信息以及属性信息,位置信息包括:指定化合物分子内各原子在指定坐标系下的坐标,属性信息包括:指定化合物分子内各原子的类型、指定化合物分子内任意两个原子之间的方向向量以及指定化合物分子内任意两个原子之间的连接信息。当然,还可以包含有其他表征数据,如化学式、分子量、化学键、结构、物理性质等等,本说明书对此不做具体限定。In this specification, after the terminal device determines the initial data of the specified compound molecule from the above data set, it can further determine the characterization data of the specified compound molecule. The characterization data mentioned here is used to characterize each atom in the specified compound molecule. Position information and attribute information. The position information includes: specifying the coordinates of each atom in the compound molecule in the specified coordinate system. Attribute information includes: specifying the type of each atom in the compound molecule, and specifying the direction between any two atoms in the compound molecule. Vector and connection information between any two atoms in the specified compound molecule. Of course, other characterization data may also be included, such as chemical formula, molecular weight, chemical bonds, structure, physical properties, etc., which are not specifically limited in this specification.
在确定出指定化合物分子的表征数据后,终端设备可以将这些表征数据中的部分(包括上述属性信息和位置信息)进行重构和优化等处理,从而得到指定化合物分子的三维分子图信息所对应的图数据。After determining the characterization data of the specified compound molecule, the terminal device can reconstruct and optimize part of the characterization data (including the above attribute information and position information), thereby obtaining the three-dimensional molecular map information corresponding to the specified compound molecule. graph data.
具体的,指定化合物分子内各原子在指定坐标系下的坐标可以参照国际标准来确定以及指定化合物分子内各原子类别向量采用one-hot形式的向量编码。Specifically, the coordinates of each atom in the specified compound molecule in the specified coordinate system can be determined with reference to international standards, and the category vector of each atom in the specified compound molecule adopts a one-hot form of vector encoding.
而对于化合物分子内任意两个原子之间的方向向量以及指定化合物分子内任意两个原子之间的连接信息可以表示为矩阵数据。The direction vector between any two atoms in a compound molecule and the connection information between any two atoms in a specified compound molecule can be expressed as matrix data.
指定化合物分子的两两原子之间方向向量可以表示为分子内所有原子全连接方式得到的带有指向的向量,向量维度可以为3*N(N=,其中n为分子中原子个数)。其中,终端设备可以根据不同原子定义不同连接信息,如,A-A两个原子之间的连接信息可以用“0”表示,A-B之间则可以用“1”表示。The direction vector between two atoms of a specified compound molecule can be expressed as a directional vector obtained by fully connecting all atoms in the molecule. The vector dimension can be 3*N (N= , where n is the number of atoms in the molecule). Among them, the terminal device can define different connection information according to different atoms. For example, the connection information between two atoms AA can be represented by "0", and the connection information between two atoms AB can be represented by "1".
用于表征指定化合物分子中原子之间距离以及距离单位向量可以通过如下公式进行确定:The distance between atoms in a specified compound molecule and the distance unit vector used to characterize it can be determined by the following formula:
。 .
而从上述内容中可以看出,由于本说明书中在确定指定化合物分子的三维分子图信息全面考虑指定化合物分子中各原子的属性及位置,所以,保证最终确定出的三维分子图信息能够全方位的表征出指定化合物分子的特征,进而保证后续预测模型在预测结果上的准确性以及合理性。As can be seen from the above content, since this specification fully considers the attributes and positions of each atom in the specified compound molecule when determining the three-dimensional molecular map information of the designated compound molecule, it is guaranteed that the finally determined three-dimensional molecular map information can be comprehensive Characterize the characteristics of the specified compound molecule, thereby ensuring the accuracy and rationality of the prediction results of the subsequent prediction model.
S103:将所述三维分子图输入待训练的预测模型中,以通过所述预测模型,基于所述位置信息与所述位置信息对应嵌入特征之间的等变性,以及所述属性信息与所述属性信息对应嵌入特征之间的不变性,根据所述三维分子图的分子图信息确定所述指定化合物分子对应的三维分子图特征。S103: Input the three-dimensional molecular graph into the prediction model to be trained, so that through the prediction model, based on the equivariance between the position information and the embedded features corresponding to the position information, and the attribute information and the The attribute information corresponds to the invariance between the embedded features, and the three-dimensional molecular graph features corresponding to the specified compound molecules are determined based on the molecular graph information of the three-dimensional molecular graph.
S104:根据所述三维分子图特征,预测所述指定化合物分子对应的分子能量信息。S104: Predict the molecular energy information corresponding to the specified compound molecule based on the three-dimensional molecular graph features.
终端设备可以将指定化合物分子的三维分子图输入待训练的预测模型中,以通过所述预测模型,基于上述位置信息与位置信息对应嵌入特征之间的等变性,以及属性信息与属性信息对应嵌入特征之间的不变性,根据三维分子图的分子图信息确定指定化合物分子的位置信息和属性信息所对应的嵌入特征,进而根据该嵌入特征确定指定化合物分子对应的三维分子图特征。The terminal device can input the three-dimensional molecular diagram of the specified compound molecule into the prediction model to be trained, so that through the prediction model, based on the above-mentioned position information and the equivariance between the position information corresponding embedding features, and the attribute information and the attribute information corresponding embedding Invariance between features, the embedded features corresponding to the position information and attribute information of the specified compound molecules are determined based on the molecular graph information of the three-dimensional molecular graph, and then the three-dimensional molecular graph features corresponding to the specified compound molecules are determined based on the embedded features.
对等变性和所述不变性,具体说明如下:The specific description of equivariability and invariance is as follows:
对于函数,其中x/>X,f (x)/>Y,存在一个E(3)等变群G作用于X与Y上,若对于所有x/>X和所有的g/>G,有f (G(x))=G(f (x)),则称函数f关于G具有等变性,若对于所有x/>X和所有的g/>G,有f (G(x))=f (x),则称函数f关于G具有不变性。for functions ,where x/> X, f (x)/> Y, there is an E(3) equivariant group G acting on X and Y, if for all x/> X and all g/> G, if f (G(x))=G(f (x)), then the function f is said to be equivariant with respect to G, if for all x/> X and all g/> G, if f (G(x))=f (x), then the function f is said to be invariant with respect to G.
在本说明书中,上述预测模型中可以设置有于基于三维等变性构建的图注意力机制网络和多层感知器,而预测模型根据输入的三维分子图信息最终预测出指定化合物分子的能量信息的过程,实际上可以看做一个图嵌入-特征回归的过程。In this specification, the above prediction model can be equipped with a graph attention mechanism network and a multi-layer perceptron based on three-dimensional equivariance, and the prediction model finally predicts the energy information of the specified compound molecule based on the input three-dimensional molecular graph information. The process can actually be viewed as a graph embedding-feature regression process.
具体的,终端设备将上述指定化合物分子的三维分子图信息输入到待训练的预测模型后,图注意力机制网络可以根据三维分子图信息,确定化合物分子内原子类型、坐标信息,以及原子间的方向向量、连接信息所对应的嵌入特征,之后在各嵌入特征中确定出化合物分子对应的不变量特征以及等变量特征,进而根据不变量特征以及等变量特征确定出指定化合物分子的三维分子图特征。Specifically, after the terminal device inputs the three-dimensional molecular graph information of the specified compound molecule into the prediction model to be trained, the graph attention mechanism network can determine the atom type and coordinate information within the compound molecule, as well as the inter-atom relationship based on the three-dimensional molecular graph information. The embedded features corresponding to the direction vector and connection information are then determined in each embedded feature to determine the invariant features and equal variable features corresponding to the compound molecule, and then the three-dimensional molecular graph features of the specified compound molecule are determined based on the invariant features and equal variable features. .
之后终端设备可以将三维分子图特征输入到多层感知器中,通过多层感知器确定指定化合物分子对应的能量信息。The terminal device can then input the three-dimensional molecular graph features into the multi-layer perceptron, and determine the energy information corresponding to the specified compound molecule through the multi-layer perceptron.
其中,在确定该指定化合物分子图的嵌入特征(包括原子坐标和类型,两两原子间的方向向量和连接信息对应的嵌入特征)时,可以通过以下方式来进行确定:Among them, when determining the embedded features of the molecular graph of the specified compound (including atomic coordinates and types, embedded features corresponding to the direction vectors and connection information between two atoms), it can be determined in the following ways:
其中,a和b代表不同原子,f为嵌入特征值,C为Clebsch-Gordan系数,Y为实球谐函数,R为关于距离的可学习函数。上述分子的嵌入方式能使得嵌入特征对于原子坐标具有SE(3)等变性,对于原子类型、任意两个原子之间的方向向量和连接信息等满足SE(3)不变性。Among them, a and b represent different atoms, f is the embedded eigenvalue, C is the Clebsch-Gordan coefficient, Y is the real spherical harmonic function, and R is the learnable function about distance. The embedding method of the above molecules can make the embedded features have SE(3) invariance for atomic coordinates, and satisfy SE(3) invariance for atom types, direction vectors and connection information between any two atoms.
当然,在实际应用中,还可以通过其他的能够实现的方式,来确定上述嵌入特征,本说明书就不一一举例说明了。Of course, in practical applications, the above-mentioned embedded features can also be determined through other implementable methods, and this specification will not illustrate them one by one.
在得到上述嵌入特征后,实际上还可以通过注意力机制,对嵌入特征作进一步地处理,以得到更为准确的三维分子图特征。具体的,图注意力机制网络可以确定针对上述指定化合物分子的注意力权重,而后,根据该注意力权重以及确定出的嵌入特征,确定针对指定化合物分子的不变量特征以及等变量特征,进而根据该不变量特征以及等变量特征,确定指定化合物分子的三维分子图特征。After obtaining the above embedded features, the embedded features can actually be further processed through the attention mechanism to obtain more accurate three-dimensional molecular graph features. Specifically, the graph attention mechanism network can determine the attention weight for the specified compound molecule, and then, based on the attention weight and the determined embedding features, determine the invariant features and equivariable features for the specified compound molecule, and then based on This invariant feature and the equivariable feature determine the three-dimensional molecular graph features of the specified compound molecule.
下面将以具体的示例来描述上述通过注意力机制确定指定化合物分子的三维分子图特征的过程。The above-mentioned process of determining the three-dimensional molecular graph characteristics of a specified compound molecule through the attention mechanism will be described below with a specific example.
通过三维分子图注意力机制可以得到query(Q)和key(K)的嵌入形式如下:The embedding form of query (Q) and key (K) can be obtained through the three-dimensional molecular graph attention mechanism as follows:
其中,用于表示指定化合物分子中原子a的Q特征,/>用于表示指定化合物分子中原子a←b的K特征。其中,/>为/>缩写表达,/>为的缩写表达。此时,注意力函数表达为:in, Used to represent the Q characteristic of atom a in the specified compound molecule,/> Used to represent the K characteristics of atoms a←b in the specified compound molecule. Among them,/> for/> abbreviation expression,/> for abbreviation expression. At this time, the attention function is expressed as:
添加到分子图嵌入公式中引入注意力机制。An attention mechanism is introduced into the molecular graph embedding formula.
进一步的,终端设备可以通过预测网络的池化函数以及多层感知器,预测指定化合物分子对应的分子能量信息。Furthermore, the terminal device can predict the molecular energy information corresponding to the specified compound molecules through the pooling function of the prediction network and the multi-layer perceptron.
进而以最小化预测出的分子能量信息与指定化合物分子对应的实际分子能量信息之间的偏差为优化目标,对所述预测模型进行训练,该过程可以表示为:Then, with the optimization goal of minimizing the deviation between the predicted molecular energy information and the actual molecular energy information corresponding to the specified compound molecule, the prediction model is trained. This process can be expressed as:
argmin{sum(∆E=|Ep-Et|)}argmin{sum(ΔE=|E p -E t |)}
这样一来,可以使得后续在预测分子结构的过程中,通过该预测模型实现快速、准确的分子能量的预测,从而提高了构建分子结构-能量关系探索的效率以及准确性。In this way, in the subsequent process of predicting the molecular structure, the prediction model can be used to quickly and accurately predict the molecular energy, thereby improving the efficiency and accuracy of exploring the molecular structure-energy relationship.
在本说明书中,可以通过图注意力机制确定出指定化合物分子的三维分子图特征。正如上面内容所提到的,根据指定化合物分子的三维分子图特征,来预测能量的过程,可以视作多层感知器要据指定化合物分子中每个原子的不变量特征以及等变量特征,来预测出能够与指定化合物分子连接的新原子或是新的分子结构,所以,在确定出上述三维分子图特征时,实际上也可以通过感知器的连接得到指定化合物分子中每个原子更新后的不变量特征以及等变量特征,来进行能量的预测。In this specification, the three-dimensional molecular graph characteristics of a specified compound molecule can be determined through the graph attention mechanism. As mentioned above, the process of predicting energy based on the three-dimensional molecular graph characteristics of a specified compound molecule can be regarded as a multi-layer perceptron that predicts energy based on the invariant characteristics and equivariable characteristics of each atom in the specified compound molecule. Predict new atoms or new molecular structures that can be connected to the specified compound molecule. Therefore, when determining the characteristics of the above three-dimensional molecular graph, you can actually obtain the updated value of each atom in the specified compound molecule through the connection of the perceptron. Invariant features and equal variable features are used to predict energy.
需要说明的是,在本说明书中,上述预测模型实际上可以采用联合训练的方式一同进行训练。即,在确定出化合物分子坐标嵌入特征后,以最小化损失值为优化目标,对等变性图注意力机制神经网络和多层感知器进行一同调整。其中,这个损失值实际上都可以通过目标预测能量信息与指定化合物分子对应的真实能量信息之间的偏差来确定。It should be noted that in this specification, the above prediction models can actually be trained together using joint training. That is, after the molecular coordinate embedding characteristics of the compound are determined, the equivariant graph attention mechanism neural network and the multi-layer perceptron are adjusted together with minimizing the loss value as the optimization goal. Among them, this loss value can actually be determined by the deviation between the target predicted energy information and the real energy information corresponding to the specified compound molecule.
在训练完上述预测模型后,即可通过训练完的预测模型,来预测分子结构信息,以实现分子结构信息的推荐。具体过程如下图所示。After training the above prediction model, the molecular structure information can be predicted through the trained prediction model to achieve recommendation of molecular structure information. The specific process is shown in the figure below.
图2为本说明书提供的一种基于分子能量信息的任务执行方法的流程示意图,包括以下步骤:Figure 2 is a schematic flow chart of a task execution method based on molecular energy information provided in this specification, including the following steps:
S201:接收针对原始化合物的任务执行请求,并获取所述原始化合物分子的表征数据。S201: Receive a task execution request for an original compound, and obtain characterization data of the original compound molecule.
S202:对所述原始化合物分子的表征数据进行处理,确定所述原始化合物分子对应的三维分子图信息。S202: Process the characterization data of the original compound molecules to determine the three-dimensional molecular diagram information corresponding to the original compound molecules.
当预测模型满足训练目标(如达到预设训练次数或收敛至预设范围)后,终端设备可以将其进行部署以用于执行下游任务。After the prediction model meets the training goal (such as reaching a preset training number or converging to a preset range), the terminal device can deploy it to perform downstream tasks.
在本说明书中,终端设备在接收到针对原始化合物的任务执行请求后,可以获取用户输入的分子结构的预测指令,以根据该预测指令,获取到原始指定化合物分子的表征数据。In this specification, after receiving a task execution request for the original compound, the terminal device can obtain the prediction instruction of the molecular structure input by the user, so as to obtain the characterization data of the original designated compound molecule based on the prediction instruction.
之后对其表征数据进行重构和优化等处理,得到指定化合物分子的三维分子图信息,其中,这里三维分子图信息的确定与上述模型训练中的三维分子图信息的确定过程基本一致,在此就不详细赘述了。这里提到的终端设备可以是指台式电脑、笔记本电脑等设备。Afterwards, the characterization data is reconstructed and optimized to obtain the three-dimensional molecular graph information of the specified compound molecule. The determination of the three-dimensional molecular graph information here is basically the same as the determination process of the three-dimensional molecular graph information in the above model training. Here I won’t go into details. The terminal devices mentioned here may refer to desktop computers, laptops and other devices.
S203:将所述原始化合物分子的三维分子图信息输入预先训练的预测模型中,以通过所述预测模型,确定所述原始化合物分子对应的三维分子图特征,并根据所述原始化合物分子对应的三维分子图特征,确定所述原始化合物分子对应的分子能量信息,其中,所述预测模型是通过上述模型训练的方法训练得到的。S203: Input the three-dimensional molecular graph information of the original compound molecule into the pre-trained prediction model, so as to determine the three-dimensional molecular graph characteristics corresponding to the original compound molecule through the prediction model, and according to the corresponding three-dimensional molecular graph characteristics of the original compound molecule Three-dimensional molecular graph features are used to determine the molecular energy information corresponding to the original compound molecules, where the prediction model is trained by the above model training method.
终端设备可以将原始指定化合物分子的三维分子图信息输入到终端设备中部署的预测模型,预测模型则将会输出与原始指定化合物分子组合成具有预设功能的分子的能量信息。The terminal device can input the three-dimensional molecular graph information of the original designated compound molecule into the prediction model deployed in the terminal device, and the prediction model will output energy information that is combined with the original designated compound molecule to form a molecule with preset functions.
需要指出的是,由于在上述模型训练过程中,已经通过有监督的训练方式对预测模型进行了训练,所以,在实际应用过程中,可以不再使用强化学习模型,即,预测模型输出的分子最终能量信息,不再需要做进一步地筛选。It should be pointed out that since in the above model training process, the prediction model has been trained through supervised training, so in the actual application process, the reinforcement learning model can no longer be used, that is, the molecules output by the prediction model The final energy information does not require further screening.
S204:根据所述原始化合物分子对应的分子能量信息执行任务。S204: Perform a task according to the molecular energy information corresponding to the original compound molecule.
确定出指定化合物分子的分子能量信息后,终端设备可以根据原始化合物分子对应的分子能量信息以及所述原始化合物分子对应的表征数据,确定原始化合物分子的分子结构与分子能量之间的对应关系,之后基于该对应关系执行分子设计任务或所述分子调控任务。After determining the molecular energy information of the specified compound molecule, the terminal device can determine the corresponding relationship between the molecular structure and molecular energy of the original compound molecule based on the molecular energy information corresponding to the original compound molecule and the characterization data corresponding to the original compound molecule. Then, the molecular design task or the molecular regulation task is performed based on the corresponding relationship.
当然,终端设备也可以将原始化合物分子的分子结构与分子能量之间的对应关系作为待推荐信息推荐给用户。Of course, the terminal device can also recommend the corresponding relationship between the molecular structure and molecular energy of the original compound molecule as the information to be recommended to the user.
进一步的,本说明书还提供了一种用于化合物分子结构-能量关系探索的系统,如图3所示。Furthermore, this specification also provides a system for exploring the molecular structure-energy relationship of compounds, as shown in Figure 3.
图3为本说明书提供的一种化合物分子结构-能量关系的探索系统的架构示意图。Figure 3 is a schematic diagram of the architecture of a system for exploring the molecular structure-energy relationship of compounds provided in this specification.
从图3中可以看出,该系统主要由以下几个部分组成:As can be seen from Figure 3, the system mainly consists of the following parts:
存储子系统,用于对上述数据集进行存储,以及对在实际应用中通过预测模型预测出的分子能量信息与给定的结构信息进行存储。The storage subsystem is used to store the above-mentioned data sets, as well as the molecular energy information and given structural information predicted by the prediction model in practical applications.
控制子系统,用于根据输入到该子系统中的原始指定化合物分子的三维分子图信息,预测出化合物分子能量信息。The control subsystem is used to predict the energy information of the compound molecules based on the three-dimensional molecular graph information of the original specified compound molecules input into the subsystem.
在控制子系统中包含有分子特征提取单元、分子能量预测训练单元以及分子能量预测输出单元这三个单元,这三个单元依次分别用于得到三维分子图特征、分子结构-预测模型训练结果信息、和根据新输入分子结构信息得到分子能量信息。The control subsystem contains three units: a molecular feature extraction unit, a molecular energy prediction training unit, and a molecular energy prediction output unit. These three units are used in turn to obtain three-dimensional molecular graph features and molecular structure-prediction model training result information. , and obtain molecular energy information based on the newly input molecular structure information.
从上述方法可以看出,与现有方案相比,本申请中的预测模型将极大地提高关系构建效率和准确性,随着人工智能和机器学习等领域的不断发展以及向分子科学技术方向的不断渗透,分子的性能预测与设计也相较于早先地全连接网络也变得更加复杂。基于人工智能或机器学习的分子结构-能量关系预测模型依赖于分子特征的表示方法,采用分子图表示可以将分子内原子与化学键视为结点和边用来刻画分子的结构特征。通过结合上述深度网络模型方法,开发者可以有效地构建化合物分子的结构-能量关系以为研究分子能量相关动力学特性以及按照用户要求设计具有一定功能分子等下游任务提供强有力的技术方法和手段。It can be seen from the above methods that compared with existing solutions, the prediction model in this application will greatly improve the efficiency and accuracy of relationship construction. With the continuous development of fields such as artificial intelligence and machine learning and the trend towards molecular science and technology, With continuous penetration, the performance prediction and design of molecules have become more complex than the earlier fully connected networks. Molecular structure-energy relationship prediction models based on artificial intelligence or machine learning rely on the representation of molecular characteristics. Using molecular graph representation, atoms and chemical bonds within the molecule can be regarded as nodes and edges to describe the structural characteristics of the molecule. By combining the above deep network model methods, developers can effectively construct the structure-energy relationship of compound molecules to provide powerful technical methods and means for downstream tasks such as studying molecular energy-related dynamic properties and designing molecules with certain functions according to user requirements.
但是目前所采用的方式并不能更加全面的刻画分子的特征以及不能高效地预测指定化合物分子结构-能量关系信息。However, the methods currently used cannot more comprehensively describe the characteristics of molecules and cannot efficiently predict the molecular structure-energy relationship information of a specified compound.
所以,本说明书所提供的模型训练方法在确定指定化合物分子的三维分子图信息的过程中,就全面的参考了指定化合物分子的各种分子结构的表征数据,这可以使得最终确定出的三维分子图信息,能够全面表征指定化合物分子的分子结构特征。Therefore, in the process of determining the three-dimensional molecular diagram information of the specified compound molecule, the model training method provided in this specification comprehensively refers to the characterization data of various molecular structures of the specified compound molecule, which can make the final determined three-dimensional molecule Graph information can comprehensively characterize the molecular structural characteristics of a specified compound molecule.
并且,在确定指定化合物分子的三维分子图特征时,由于是根据指定化合物分子的不变量特征和等变量特征进行确定的,通过这种方式可以充分的表征出指定化合物分子在分子结构上的特性,从而保证预测模型后续能够通过指定化合物分子的三维分子图特征,进行准确、合理的能量信息的预测。Moreover, when determining the three-dimensional molecular graph characteristics of a designated compound molecule, it is determined based on the invariant characteristics and equivariable characteristics of the designated compound molecule. In this way, the molecular structure characteristics of the designated compound molecule can be fully characterized. , thereby ensuring that the prediction model can subsequently predict accurate and reasonable energy information by specifying the three-dimensional molecular graph characteristics of compound molecules.
以上为本说明书的一个或多个实施的方法,基于同样的思路,本说明书还提供了相应的模型训练装置以及基于分子能量信息的任务执行装置示意图,如图4、图5所示。The above are one or more implementation methods of this specification. Based on the same idea, this specification also provides a schematic diagram of a corresponding model training device and a task execution device based on molecular energy information, as shown in Figures 4 and 5.
图4为本说明书提供的一种模型训练装置的示意图,包括:Figure 4 is a schematic diagram of a model training device provided in this specification, including:
获取模块401,用于获取指定化合物分子的表征数据,所述表征数据用于表征所述指定化合物分子内各原子的位置信息以及属性信息;The acquisition module 401 is used to obtain the characterization data of the specified compound molecule, where the characterization data is used to characterize the position information and attribute information of each atom in the specified compound molecule;
确定模块402,用于对所述表征数据进行处理,确定所述指定化合物分子的三维分子图信息;The determination module 402 is used to process the characterization data and determine the three-dimensional molecular diagram information of the specified compound molecule;
输入模块403,用于将所述三维分子图信息输入待训练的预测模型中,以通过所述预测模型,基于所述位置信息与所述位置信息对应嵌入特征之间的等变性,以及所述属性信息与所述属性信息对应嵌入特征之间的不变性,根据所述三维分子图信息确定所述指定化合物分子对应的三维分子图特征;Input module 403 is used to input the three-dimensional molecular graph information into a prediction model to be trained, so that through the prediction model, based on the equivariance between the position information and the embedded features corresponding to the position information, and the The invariance between the attribute information and the embedded features corresponding to the attribute information, and determining the three-dimensional molecular graph features corresponding to the specified compound molecules based on the three-dimensional molecular graph information;
预测模块404,用于根据所述三维分子图特征,预测所述指定化合物分子对应的分子能量信息;The prediction module 404 is used to predict the molecular energy information corresponding to the specified compound molecules based on the three-dimensional molecular graph features;
训练模块405,用于以最小化预测出的所述分子能量信息与所述指定化合物分子对应的实际分子能量信息之间的偏差为优化目标,对所述预测模型进行训练。The training module 405 is used to train the prediction model with the optimization goal of minimizing the deviation between the predicted molecular energy information and the actual molecular energy information corresponding to the specified compound molecule.
可选地,所述获取模块401具体用于,在分子化合物的数据集中选取出所述指定化合物分子的初始数据;根据所述指定化合物分子的初始数据,确定所述表征数据。Optionally, the acquisition module 401 is specifically configured to select the initial data of the designated compound molecule from the data set of molecular compounds; and determine the characterization data based on the initial data of the designated compound molecule.
可选地,所述位置信息包括:所述指定化合物分子内各原子在指定坐标系下的坐标;Optionally, the position information includes: the coordinates of each atom in the specified compound molecule in a specified coordinate system;
所述属性信息包括:所述指定化合物分子内各原子的类型、所述指定化合物分子内任意两个原子之间的方向向量以及所述指定化合物分子内任意两个原子之间的连接信息。The attribute information includes: the type of each atom in the designated compound molecule, the direction vector between any two atoms in the designated compound molecule, and the connection information between any two atoms in the designated compound molecule.
可选地,所述预测模型中设置有图注意力机制网络;Optionally, a graph attention mechanism network is provided in the prediction model;
所述输入模块403具体用于,通过所述图注意力机制网络,确定所述指定化合物分子对应的注意力权重;根据所述注意力权重以及所述图注意力机制网络基于所述三维分子图信息所确定出的各嵌入特征,确定所述指定化合物分子对应的不变量特征以及等变量特征;根据所述不变量特征以及所述等变量特征,确定所述三维分子图特征。The input module 403 is specifically configured to determine the attention weight corresponding to the specified compound molecule through the graph attention mechanism network; based on the attention weight and the graph attention mechanism network based on the three-dimensional molecular graph Each embedded feature determined by the information determines the invariant features and equivariable features corresponding to the designated compound molecule; based on the invariant features and the equivariable features, the three-dimensional molecular graph features are determined.
图5为本说明书提供的一种基于分子能量信息的任务执行装置示意图,包括:Figure 5 is a schematic diagram of a task execution device based on molecular energy information provided in this specification, including:
接收模块501,用于接收针对原始化合物的任务执行请求,并获取原始化合物分子的表征数据;The receiving module 501 is used to receive a task execution request for the original compound and obtain the characterization data of the original compound molecule;
构建模块502,用于对所述原始化合物分子的表征数据进行处理,确定所述原始化合物分子对应的三维分子图信息;The construction module 502 is used to process the characterization data of the original compound molecules and determine the three-dimensional molecular map information corresponding to the original compound molecules;
确定模块503,用于将所述原始化合物分子的三维分子图信息输入预先训练的预测模型中,以通过所述预测模型,确定所述原始化合物分子对应的三维分子图特征,并根据所述原始化合物分子对应的三维分子图特征,确定所述原始化合物分子对应的分子能量信息,其中,所述预测模型是通过上述模型训练的方法训练得到的;The determination module 503 is used to input the three-dimensional molecular graph information of the original compound molecule into a pre-trained prediction model, so as to determine the three-dimensional molecular graph characteristics corresponding to the original compound molecule through the prediction model, and determine the three-dimensional molecular graph characteristics of the original compound molecule according to the original compound molecule. The three-dimensional molecular graph characteristics corresponding to the compound molecules determine the molecular energy information corresponding to the original compound molecules, wherein the prediction model is trained by the above-mentioned model training method;
执行模块504,用于根据所述原始化合物分子对应的分子能量信息执行任务。The execution module 504 is used to execute tasks according to the molecular energy information corresponding to the original compound molecules.
可选地,所述任务包括:分子调控任务或分子设计任务;Optionally, the tasks include: molecular regulation tasks or molecular design tasks;
所述执行模块504具体用于,根据所述原始化合物分子对应的分子能量信息以及所述原始化合物分子对应的表征数据,确定所述原始化合物分子的分子结构与分子能量之间的对应关系;基于所述对应关系执行所述分子设计任务或所述分子调控任务。The execution module 504 is specifically configured to determine the corresponding relationship between the molecular structure and molecular energy of the original compound molecule according to the molecular energy information corresponding to the original compound molecule and the characterization data corresponding to the original compound molecule; based on The corresponding relationship performs the molecular design task or the molecular regulation task.
本说明书还提供了一种计算机可读存储介质,该存储介质存储有计算机程序,计算机程序可用于执行上述图1提供的一种模型训练方法或是图2提供的一种基于分子能量信息的任务执行方法。This specification also provides a computer-readable storage medium that stores a computer program. The computer program can be used to perform a model training method provided in Figure 1 or a task based on molecular energy information provided in Figure 2 execution method.
本说明书还提供了图6所示的一种对应于图1或图2的电子设备的示意结构图。如图6所示,在硬件层面,该电子设备包括处理器、内部总线、网络接口、内存以及非易失性存储器,当然还可能包括其他业务所需要的硬件。处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,以实现上述图1所述的模型训练的方法或图2所述的分子结构信息的推荐方法。This specification also provides a schematic structural diagram of the electronic device shown in FIG. 6 corresponding to FIG. 1 or FIG. 2 . As shown in Figure 6, at the hardware level, the electronic device includes a processor, internal bus, network interface, memory and non-volatile memory, and of course may also include other hardware required for business. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to implement the model training method described in Figure 1 or the recommended method of molecular structure information described in Figure 2.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本说明书时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this specification, the functions of each unit can be implemented in the same or multiple software and/or hardware.
本领域内的技术人员应明白,本说明书的实施例可提供为方法、系统、或计算机程序产品。因此,本说明书可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本说明书可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will understand that embodiments of the present specification may be provided as methods, systems, or computer program products. Thus, the present description may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本说明书是参照根据本说明书实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This specification is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in a process or processes in a flowchart and/or a block or blocks in a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes in the flowchart and/or in a block or blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-permanent storage in computer-readable media, random access memory (RAM), and/or non-volatile memory in the form of read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.
本领域技术人员应明白,本说明书的实施例可提供为方法、系统或计算机程序产品。因此,本说明书可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本说明书可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present specification may be provided as methods, systems, or computer program products. Thus, the present description may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本说明书可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本说明书,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。This specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.
以上所述仅为本说明书的实施例而已,并不用于限制本说明书。对于本领域技术人员来说,本说明书可以有各种更改和变化。凡在本说明书的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本说明书的权利要求范围之内。The above descriptions are only examples of this specification and are not intended to limit this specification. Various modifications and variations may occur to those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this specification shall be included in the scope of the claims of this specification.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311702625.8A CN117393075A (en) | 2023-12-12 | 2023-12-12 | Model training method and task execution method based on molecular energy information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311702625.8A CN117393075A (en) | 2023-12-12 | 2023-12-12 | Model training method and task execution method based on molecular energy information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117393075A true CN117393075A (en) | 2024-01-12 |
Family
ID=89463540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311702625.8A Pending CN117393075A (en) | 2023-12-12 | 2023-12-12 | Model training method and task execution method based on molecular energy information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117393075A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116524998A (en) * | 2023-06-15 | 2023-08-01 | 之江实验室 | Model training method and molecular property information prediction method and device |
CN116597892A (en) * | 2023-05-15 | 2023-08-15 | 之江实验室 | Model training method and molecular structure information recommending method and device |
CN116798536A (en) * | 2023-06-12 | 2023-09-22 | 杭州碳硅智慧科技发展有限公司 | A training method and device for a molecular multi-conformation prediction model |
CN116959605A (en) * | 2023-01-12 | 2023-10-27 | 腾讯科技(深圳)有限公司 | Molecular property prediction method, training method and device of molecular property prediction model |
-
2023
- 2023-12-12 CN CN202311702625.8A patent/CN117393075A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116959605A (en) * | 2023-01-12 | 2023-10-27 | 腾讯科技(深圳)有限公司 | Molecular property prediction method, training method and device of molecular property prediction model |
CN116597892A (en) * | 2023-05-15 | 2023-08-15 | 之江实验室 | Model training method and molecular structure information recommending method and device |
CN116798536A (en) * | 2023-06-12 | 2023-09-22 | 杭州碳硅智慧科技发展有限公司 | A training method and device for a molecular multi-conformation prediction model |
CN116524998A (en) * | 2023-06-15 | 2023-08-01 | 之江实验室 | Model training method and molecular property information prediction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7009433B2 (en) | Methods and devices for neural network generation | |
CN109241412B (en) | A recommendation method, system and electronic device based on network representation learning | |
CN107944629B (en) | A recommendation method and device based on heterogeneous information network representation | |
Chang et al. | A graph-based QoS prediction approach for web service recommendation | |
Guo et al. | A parallel attractor finding algorithm based on Boolean satisfiability for genetic regulatory networks | |
JP5881048B2 (en) | Information processing system and information processing method | |
JP2013511084A (en) | Clustering method and system | |
El Mohadab et al. | Predicting rank for scientific research papers using supervised learning | |
CN112906865B (en) | Neural network architecture search method, device, electronic device and storage medium | |
EP4099333A2 (en) | Method and apparatus for training compound property pediction model, storage medium and computer program product | |
Hao et al. | The implementation of a deep recurrent neural network language model on a Xilinx FPGA | |
CN109241442B (en) | Item recommendation method, readable storage medium and terminal based on predicted value filling | |
Xu et al. | Towards machine-learning-driven effective mashup recommendations from big data in mobile networks and the Internet-of-Things | |
Singh et al. | Analyzing embedding models for embedding vectors in vector databases | |
CN116186295B (en) | Attention-based knowledge map link prediction method, device, equipment and medium | |
CN116541433A (en) | A Diverse API Recommendation Method Based on Association Graph | |
CN115274003A (en) | Molecular characterization model training method, molecular structure prediction method and device | |
CN118467793B (en) | Heterogeneous graph-oriented graph matching method, device and medium | |
CN111914083A (en) | Statement processing method, device and storage medium | |
Zhang et al. | Personalized quality centric service recommendation | |
CN113611354A (en) | Protein torsion angle prediction method based on lightweight deep convolutional network | |
Nguyen et al. | Online learning-based clustering approach for news recommendation systems | |
CN118710754A (en) | Method, device, equipment and storage medium of Wensheng diagram based on diffusion probability model | |
CN117393075A (en) | Model training method and task execution method based on molecular energy information | |
CN118069814A (en) | Text processing method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20240112 |