WO2023216834A1 - 用于管理分子预测的方法、装置、设备和介质 - Google Patents

用于管理分子预测的方法、装置、设备和介质 Download PDF

Info

Publication number
WO2023216834A1
WO2023216834A1 PCT/CN2023/089548 CN2023089548W WO2023216834A1 WO 2023216834 A1 WO2023216834 A1 WO 2023216834A1 CN 2023089548 W CN2023089548 W CN 2023089548W WO 2023216834 A1 WO2023216834 A1 WO 2023216834A1
Authority
WO
WIPO (PCT)
Prior art keywords
molecular
model
sample
prediction
training
Prior art date
Application number
PCT/CN2023/089548
Other languages
English (en)
French (fr)
Inventor
高翔
高伟豪
肖文之
王智睿
项亮
王崇
Original Assignee
北京字节跳动网络技术有限公司
脸萌有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司, 脸萌有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2023216834A1 publication Critical patent/WO2023216834A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • Exemplary implementations of the present disclosure relate generally to the field of computing, and in particular to methods, apparatus, devices, and computer-readable storage media for managing molecular predictions.
  • machine learning technology has been widely used in various technical fields. Molecular research is an important task in materials science, energy applications, biotechnology, pharmaceutical research and other fields. Machine learning has become widely used in such fields and can predict the characteristics of other molecules based on the characteristics of known molecules.
  • machine learning technology relies on a large amount of training data.
  • the collection of training data sets requires a lot of experiments and consumes a lot of manpower, material resources and time. At this time, how to improve the accuracy of the prediction model when training data is insufficient has become a difficult and hot topic in the field of molecular research.
  • a method for managing molecular predictions is provided.
  • the upstream model is obtained from a part of the network layers in the pre-trained model, which describes the correlation between molecular structure and molecular energy.
  • the downstream model is determined based on the molecular prediction target, and the output layer of the downstream model is determined based on the molecular prediction target.
  • a molecular prediction model is generated based on the upstream model and downstream model.
  • the molecular prediction model describes the molecular structure and Correlations between molecular prediction targets associated with molecular structures.
  • an apparatus for managing molecular predictions includes: an acquisition module configured to acquire an upstream model from a part of the network layer in a pre-trained model, where the pre-trained model describes the correlation between molecular structure and molecular energy; a determination module configured to predict based on molecules a target determining downstream model, an output layer of the downstream model is determined based on the molecular prediction target; and a generation module configured to generate a molecular prediction model based on the upstream model and the downstream model, the molecular prediction model describes the molecular structure and the molecular structure associated Molecular prediction of relationships between targets.
  • an electronic device in a third aspect of the present disclosure, includes: at least one processing unit; and at least one memory, the at least one memory is coupled to the at least one processing unit and stores instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit cause the electronic device to The method according to the first aspect of the disclosure is performed.
  • a computer-readable storage medium having a computer program stored thereon.
  • the computer program when executed by a processor, causes the processor to implement the method according to the first aspect of the present disclosure.
  • FIG. 1 illustrates a block diagram of an example environment in which implementations of the present disclosure can be implemented
  • FIG. 2 illustrates a block diagram of a process for managing molecular predictions in accordance with some implementations of the present disclosure
  • FIG. 3 illustrates a block diagram of a process for generating a molecular prediction model based on a pre-trained model, in accordance with some implementations of the present disclosure
  • FIG. 4 illustrates a block diagram of a process for obtaining a pre-trained model in accordance with some implementations of the present disclosure
  • Figure 5 shows a block diagram of a loss function for a pre-trained model in accordance with some implementations of the present disclosure
  • FIG. 6 illustrates a block diagram of a process for obtaining a molecular prediction model in accordance with some implementations of the present disclosure
  • Figure 7 shows a block diagram of a loss function for a molecular prediction model in accordance with some implementations of the present disclosure
  • Figure 8 illustrates a flowchart of a method for managing molecular predictions in accordance with some implementations of the present disclosure
  • Figure 9 illustrates a block diagram of an apparatus for managing molecular predictions in accordance with some implementations of the present disclosure.
  • Figure 10 illustrates a block diagram of a device capable of implementing various implementations of the present disclosure.
  • the term “including” and similar expressions should be understood as an open-ended inclusion, ie, "including but not limited to.”
  • the term “based on” should be understood to mean “based at least in part on.”
  • the term “one implementation” or “the implementation” shall be understood to mean “at least one implementation”.
  • the term “some implementations” should be understood to mean “at least some implementations”.
  • Other explicit and implicit definitions may be included below.
  • the term “model” may represent an association between various data. For example, the above-mentioned correlation relationships can be obtained based on various technical solutions that are currently known and/or will be developed in the future.
  • a prompt message is sent to the user to clearly remind the user that the operation requested will require the acquisition and use of the user's personal information. Therefore, users can autonomously choose whether to provide personal information to software or hardware such as electronic devices, applications, servers or storage media that perform the operations of the technical solution of the present disclosure based on the prompt information.
  • the method of sending prompt information to the user can be, for example, a pop-up window, and the prompt information can be presented in the form of text in the pop-up window.
  • the pop-up window can also host a selection control for the user to choose "agree” or "disagree” to provide personal information to the electronic device.
  • Figure 1 illustrates a block diagram of an example environment 100 in which implementations of the present disclosure can be implemented.
  • a model i.e., predictive model 130
  • molecular properties e.g., molecular force fields, molecular properties (e.g., solubility)
  • the environment 100 includes a model training system 150 and a model application system 152.
  • the upper part of Figure 1 shows the process of the model training phase, and the lower part shows the process of the model application phase.
  • the parameter values of the prediction model 130 may have initial values, or may have pre-trained parameter values obtained through a pre-training process.
  • the parameter values of the prediction model 130 may be updated and adjusted.
  • the prediction model 130' can be obtained.
  • the parameter values of the prediction model 130' have been updated, and based on the updated parameter values,
  • the prediction model 130 can be used to implement prediction tasks during the model application phase.
  • the predictive model 130 may be trained using the model training system 150 based on the training data set 110 including the plurality of training data 112 .
  • each training data 112 may relate to a binary tuple format and include molecular structure 120 and molecular properties 122 .
  • molecular properties 122 may include molecular force fields, molecular properties (eg, solubility, stability, etc.), and/or other properties.
  • the prediction model 130 may be trained using the training data 112 including the molecular structure 120 and the molecular properties 122 .
  • the training process can be performed iteratively using large amounts of training data.
  • the predictive model 130 can determine the molecular properties associated with different molecular structures.
  • the model application stage the model application system 152 can be used to call the prediction model 130' (the prediction model 130' at this time has the trained parameter values). For example, input data 140 (including a target molecular structure 142) may be received, and prediction results 144 of molecular properties of the target molecular structure 142 may be output.
  • the model training system 150 and the model application system 152 may include any computing system with computing capabilities, such as various computing devices/systems, terminal devices, servers, etc.
  • the terminal device may involve any type of mobile terminal, fixed terminal or portable terminal, including mobile phone, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, media computer, multimedia tablet, or any combination of the foregoing. , including accessories and peripherals for these devices or any combination thereof.
  • Servers include, but are not limited to, mainframes, edge computing nodes, computing devices in cloud environments, and so on.
  • model training system 150 and model application system 152 may be integrated in the same system or device. Implementations of the present disclosure are not limited in this regard. The following will continue to refer to the accompanying drawings to describe exemplary implementations of model training and model application respectively.
  • the molecular properties 122 in the training data 112 should be consistent with the prediction goals (ie, the goals that the prediction model 130 output is expected to be).
  • the prediction model 130 can receive the molecular structure and output the predicted value of the corresponding molecular force field; when it is expected to predict the molecular properties (eg, solubility), the molecular properties 122 in the training data 112 should be the measured data of solubility.
  • the prediction Model 130 may receive a molecular structure and output corresponding solubility predictions.
  • the first stage is a pre-training process, which focuses on the basic physical properties (for example, molecular energy) provided by a specific molecular structure, and a pre-trained model can be obtained first.
  • the second stage focuses on fine-tuning, that is, focusing on the correlation between the basic physical properties of the molecule and other prediction targets.
  • fine-tuning can be Adjust the pre-trained model to obtain a prediction model with higher accuracy.
  • a pre-trained model can be generated based on a large amount of known public data in the pre-training stage. Afterwards, a molecular prediction model that achieves a specific prediction goal is established based on the pre-trained model, and a small amount of dedicated training data that achieves the specific prediction goal is used to fine-tune the molecular prediction model. In this way, the accuracy of molecular prediction models can be improved when dedicated training data is limited.
  • Figure 2 illustrates a block diagram 200 of a process for managing molecular predictions in accordance with some implementations of the present disclosure.
  • a pre-trained model 240 can be determined first, and the pre-trained model 240 can describe the correlation between molecular structure and molecular energy.
  • the pre-trained model 240 may include multiple network layers, and the pre-trained model 240 may be utilized to generate a molecule prediction model 210 for a specific molecule prediction target 250.
  • the molecule prediction model 210 may include an upstream model 220 and a downstream model 230, and a part of the network layers 242 may be selected from a plurality of network layers of the pre-trained model 240 to form the upstream model 220.
  • molecular structure is based on spectroscopic data describing the three-dimensional arrangement of atoms in the molecule. It will be understood that molecular structure is the intrinsic basis of the molecule and determines to a large extent its other properties. Molecules with a specific molecular structure will have similar properties, and these properties are often determined by the energy of the molecule. According to an exemplary implementation of the present disclosure, since molecular structure and molecular energy are the basis for other characteristics related to molecules, it is proposed to use a pre-trained model 240 (describing the correlation between molecular structure and molecular energy) to construct and implement specific predictions. Molecular predictive models of targets 210.
  • the multiple network layers of the pre-trained model 240 have accumulated rich knowledge about the intrinsic factors of the molecule, and some of the multiple network layers can be directly used to build the molecule prediction model 210. In this way, the training sample requirements for training the molecular prediction model 210 from scratch can be greatly reduced, and the accuracy of the molecular prediction model 210 can be maintained. It will be appreciated that as there are currently numerous publicly available molecular data sets, these data sets can be utilized to generate pre-trained models 240 .
  • the downstream model 230 may be determined based on the specific molecule prediction target 250 , and the output layer of the downstream model 230 is determined based on the molecule prediction target 250 .
  • Molecular prediction target 250 represents the target for which the output of molecular prediction model 210 is desired.
  • the molecular prediction model 210 may be generated based on the upstream model 220 and the downstream model 230 to describe the association between the molecular structure and the molecular prediction target 250 associated with the molecular structure.
  • the molecular prediction target 250 may represent a target of desired output, such as a molecular force field, molecular properties, or other targets.
  • the amount of dedicated training data required to train the molecular prediction model 210 can be reduced on the one hand, and can be shared among different prediction targets (e.g., molecular force fields, molecular properties, etc.) on the other hand. Pre-train the model 240, thereby improving the efficiency of generating the molecular prediction model 210.
  • different prediction targets e.g., molecular force fields, molecular properties, etc.
  • Figure 3 illustrates a block diagram 300 of a process for generating a molecular prediction model 210 based on a pre-trained model 240, in accordance with some implementations of the present disclosure.
  • the pre-trained model 240 can describe the correlation between the molecular structure 310 and the molecular energy 314 .
  • the pre-trained model 240 may include N network layers, specifically, the 1st layer serves as an input layer for receiving the input molecular structure 310 , and the Nth layer serves as an output layer 312 to output the molecular energy 314 .
  • the upstream model 220 may be determined from a set of network layers other than the output layer 312 among the plurality of network layers in the pre-trained model 240 .
  • the first N-1 network layers in the pre-trained model 240 can be directly used as the upstream model 220 of the molecule prediction model 210.
  • a downstream model 230 may be generated based on the molecular prediction target 250 .
  • the molecule prediction model 210 can directly utilize the multifaceted knowledge about molecules obtained in layers 1 to N and then apply it to perform prediction tasks associated with a specific molecule prediction target 250 .
  • the molecule prediction model 210 can receive the molecular structure 320 and output a target value 322 corresponding to the molecule prediction target 250.
  • selection may be based on a molecular prediction target 250
  • the backbone model used to implement the pre-trained model 240 can be implemented based on a Geometric Message Passing Neural Network (GemNet) model.
  • GemNet Geometric Message Passing Neural Network
  • the pre-training model 240 can be implemented based on an equivariant graph neural network (E(n)-Equivariant Graph Neural Network, abbreviated as EGNN) model.
  • E(n)-Equivariant Graph Neural Network abbreviated as EGNN
  • any of the following models may also be selected: a Symmetric Gradient Domain Machine Learning (sGDML) model, a NequIP model, a GemNet-T model, and so on.
  • other numbers of network layers may be selected from the pre-trained model 240, for example, 1st through N-2th network layers may be selected, or fewer network layers may be selected. Although the number of selected network layers is smaller at this time, the selected network layers still include many aspects of knowledge about the molecules. At this time, the number of training samples required to train the molecular prediction model 210 can still be reduced.
  • FIG. 4 illustrates a block diagram 400 of a process for obtaining a pre-trained model 240 in accordance with some implementations of the present disclosure.
  • the pre-training model 240 can be trained using the pre-training data 420 in the pre-training data set 410, so that the loss function 430 associated with the pre-training model 240 meets a predetermined condition, and the pre-training data 420 can include samples.
  • Molecular structure 422 and sample molecule energy 424 can be used to include samples.
  • the PubChemQC PM6 dataset is a public dataset that includes hundreds of millions of molecular structures and their corresponding electronic properties.
  • the Quantum Machine 9 (QM9) data set provides information on the geometric structure, energy, electronic and thermodynamic properties of molecules.
  • the pre-training data set 410 may include a plurality of training data 420, and And the training data 420 may include sample molecular structures 422 and sample molecular energies 424.
  • the PubChemQC PM6 data set includes a large number of molecular structures and their corresponding electronic properties. For example, this data set includes approximately 86 million optimized 3D molecular structures and their associated molecular energies. These molecular structures and molecular energies can be used as training data.
  • the backbone model of the pre-training model 240 can be selected, and the loss function 430 of the pre-training model 240 can be constructed.
  • the loss function 430 can represent the difference between the true value and the predicted value of the sample data, so that the pre-training process can be directed towards Iteratively optimize the pre-trained model 240 in a direction that gradually reduces the difference.
  • various publicly available data sets can be directly used as the pre-training data set 410.
  • these publicly available data sets include huge amounts of sample data, making it possible to obtain basic knowledge of molecular structures and molecular energies without the need to prepare specialized training data.
  • the sample data in these data sets have been studied for a long time and have been proven to be accurate or relatively accurate data.
  • a more accurate pre-training model 240 can be obtained.
  • the molecule prediction model 210 that achieves the specific molecule prediction target 250 includes a part of the pre-trained model 240, this in turn can ensure that the subsequently generated molecule prediction model 210 is also reliable.
  • the loss function 430 may include various aspects.
  • FIG. 5 shows a block diagram 500 of the loss function 430 for the pre-trained model 240 according to some implementations of the present disclosure.
  • the loss function 430 may include an energy loss 510 , where the energy loss 510 represents the difference between the sample molecule energy 424 and the predicted value of the sample molecule energy 424 obtained based on the sample molecule structure 422 .
  • the energy loss 510 may be determined based on Formula 1 below.
  • the symbol represents the energy loss 510
  • the symbol R represents the molecular structure
  • the symbol E represents the molecular energy of the molecule with the molecular structure R
  • Z represents the pre-trained model 240
  • d represents E and difference between.
  • different formats may be used to describe molecular structures.
  • the molecular structure can be represented in SMILES or other formats; for another example, the molecular structure in the form of atomic coordinates can be further obtained through tools such as RDKIT; for another example, the molecular structure can be represented in the form of a molecular diagram.
  • Equation 1 can express the pre-training target in a quantitative manner.
  • the parameters of each network layer of the pre-trained model 240 can be adjusted in a manner that minimizes the energy loss 510 based on each pre-trained data 420 in the pre-trained data set 410, so that the pre-trained model 240 can accurately describe The correlation between molecular structure 310 and molecular energy 314.
  • the loss function 430 may include an estimated energy loss 520 that represents a difference between the sample molecule energy 424 and a predicted value of the sample molecule energy 424 obtained based on the sample molecular structure 422 , where the sample molecular structure is estimated of.
  • the estimated energy loss 520 may be determined based on Formula 2 below.
  • Equation 2 the symbol represents the estimated energy loss 520
  • the symbol R noisy represents the estimated molecular structure
  • the symbol E represents the molecular energy of the molecule with the molecular structure R noisy
  • Z represents the pre-trained model 240
  • d represents E and difference between.
  • the estimated molecular structure can be determined from SMILES based on tools such as RDKIT.
  • Equation 2 can express the pre-training target in a quantitative manner. At this time, the expression of the estimated molecular structure R noisy is consistent with the input molecular structure of the downstream task, which can improve the accuracy of the prediction results.
  • the loss function 430 may include a force loss 530, which represents a The predicted value of the sample molecule energy 424 obtained by the substructure 422 is relative to the difference between the gradient of the sample molecule structure 422 and a predetermined gradient (eg, 0). It will be appreciated that the PubChemQC PM6 data set was created with the purpose of optimizing the geometry of the molecules so that the molecular energy can be minimized.
  • Molecular force represents the gradient of energy relative to atomic coordinates. Since the molecule is relatively stable at this time, the gradient should have a value close to 0.
  • data augmentation can be implemented based on the pre-training data 420 in the pre-training data set 410, that is, the potential force exerted on the atoms is a gradient of energy. This is equivalent to a supervised learning loss assuming the label for force is 0. That is, the force loss 530 may be determined based on Equation 3 below.
  • data augmentation can be performed on the pre-trained data set 410 to include more knowledge about molecular forces in the pre-trained model 240 . In this way, the accuracy of the pre-trained model 240 can be improved, thereby providing more accurate prediction results when the molecular prediction target 250 involves a molecular force field.
  • the loss function 430 may be determined based on any of Equations 1 to 3. Further, two or more formulas in 1 to 3 may be comprehensively considered. For example, the loss function 430 for pre-training may be determined based on any one of the following formulas 4 to 7.
  • the loss function 430 may be determined based on a specific prediction goal. For example, when it is desired to predict a molecular force field, Equations 3, 4, 6, or 7 can be used. When downstream data involve estimated molecular structures, equations 2, 5, 6, or 7, etc. may be used.
  • a predetermined stopping condition may be specified, so that when the pre-training model 240 meets the stopping condition, the pre-training process is stopped.
  • the complex pre-training process can be converted into simple mathematical operations implemented based on Equations 1 to 7. In this way, a higher-accuracy pre-trained model 240 can be obtained using the public training data set 610 without preparing special training data.
  • the 1st to N-1 network layers in the pre-trained model 240 can be directly used as the upstream model 220 of the molecule prediction model 210.
  • a downstream model 230 of the molecule prediction model 210 may be determined based on the molecule prediction target 250 .
  • the downstream model 230 may include one or more network layers.
  • the molecular prediction target 250 may include a molecular force field and/or a molecular property.
  • a single network layer can be used to implement the downstream model 230, that is, the downstream model 230 only includes a single output layer.
  • the downstream model 230 may also include two or more network layers. At this time, the last network layer among the plurality of network layers in the downstream model 230 is the output layer of the downstream model 230 .
  • the upstream model 220 and the downstream model 230 may be connected to obtain the final molecular prediction model 210.
  • various parameters in the upstream model 220 are directly obtained from the pre-trained model 240, and the parameters of the downstream model 230 can be set to any initial values and/or values obtained through other means.
  • random initial values may be used.
  • Downstream tasks may require the final output layer to have outputs of different dimensions than the pre-trained one, or even if the dimensions are the same, the output layer may be randomly initialized due to the less bias loss gradient provided when fine-tuning. parameters can often achieve higher accuracy of molecular prediction models210.
  • the molecule prediction model 210 can then be used as an overall prediction model and trained using a dedicated data set associated with the molecule prediction target 250 .
  • a higher-accuracy molecule prediction model 210 can be obtained using a small amount of dedicated training data at this time.
  • training data 620 may include sample molecular structures 622 and sample target measurements 624 corresponding to molecule prediction targets 250 .
  • the sample target measurement value 624 may be a measurement value of the molecular force field; assuming that the molecule prediction target 250 is soluble, the sample target measurement value 624 may be a solubility measurement value.
  • a training data set 610 corresponding to the molecule prediction target 250 may be obtained.
  • the training data set 610 may be a dedicated data set prepared for the molecule prediction target 250 (for example, through experiments, etc. ).
  • the training data set 610 typically includes less training data (eg, thousands or less) relative to pre-training data set 410 that includes large amounts of pre-training data (eg, millions or more). In this way, instead of collecting massive amounts of dedicated training data, a more accurate molecular prediction model can be obtained using limited dedicated training data 210 .
  • a loss function 630 may be constructed for the molecular prediction model 210 .
  • 7 illustrates a block diagram 700 of a loss function 630 for a molecular prediction model 210 in accordance with some implementations of the present disclosure.
  • the loss function 630 of the molecular prediction model 210 may include an energy loss 710 , that is, the difference between the sample target measurement 624 and the predicted value of the sample target measurement 624 obtained based on the sample molecular structure 622 .
  • the energy loss 710 can be determined based on Equation 8 below.
  • Equation 8 represents the property loss 710 of the molecule prediction model 210
  • y represents the sample target measurement 624 in the training data 620 (corresponding to the molecular structure R)
  • the predicted value obtained based on the molecular structure R and the molecular prediction model 210 represents y and difference between.
  • the loss function 630 can be determined by Equation 8, and fine-tuning can be performed in a direction that minimizes the loss function 630.
  • the complex process of fine-tuning the molecular prediction model 210 can be converted into a simple and efficient mathematical operation.
  • the loss function 630 of the molecular prediction model 210 may further include a force field loss 720.
  • the force field loss 720 includes the difference between the predicted value of the sample molecule energy 624 obtained based on the sample molecular structure 622 relative to the gradient of the sample molecular structure 622 and the predetermined gradient. Specifically, the force field loss 720 may be determined based on Equation 9 below.
  • Equation 8 represents the force field loss 720 of the molecular prediction model 210, the meaning of each symbol is the same as described in the above formula, and ⁇ represents a predetermined value between [0,1].
  • the loss function can be determined by Equation 0, thereby converting the complex process of fine-tuning the molecular prediction model 210 into a simple and efficient mathematical operation.
  • the molecular prediction model 210 can be obtained in a more accurate and efficient manner.
  • pre-trained models 240 can be obtained based on large amounts of data in known public datasets. Further, the molecular prediction model 210 can be further fine-tuned based on a smaller dedicated training data set that includes a limited amount of training data. In this way, an effective balance can be performed between training accuracy and the various overheads of preparing large amounts of dedicated training data, thereby obtaining a higher-accuracy molecular prediction model 210 at a smaller cost.
  • the training of the molecular prediction model 210 has been described above. In the following, it will be described. Describes how to use the molecular prediction model 210 to determine predicted values associated with the molecular prediction target 250.
  • the received input data may be processed using the already trained molecular prediction model 210 with the trained parameter values. If a target molecule structure is received, a predicted value corresponding to the molecule prediction target may be determined based on the molecule prediction model 210 .
  • a target molecular structure to be processed may be input to the molecular prediction model 210 .
  • the target molecular structure can be represented based on SMILES format or atomic coordinate form.
  • the molecular prediction model 210 can output the predicted value corresponding to the template molecular structure.
  • the predicted value may include a predicted value of the corresponding target.
  • the molecular prediction model 210 may output a predicted value of the molecular force field. In this way, the trained molecular prediction model 210 can have higher accuracy, thereby providing a basis for judgment for subsequent processing operations.
  • the prediction results using the molecular prediction model 210 achieve higher accuracy in both in-domain testing and out-of-domain testing.
  • Table 1 below shows in-domain test data.
  • the rows represent the backbone models on which the different prediction models are based, and the columns represent the error data on the predicted values of the molecular force fields derived based on the different prediction models.
  • the data in row 2 "Aspirin" indicate: the correlation error of using the sGDML model to predict the molecular force field of aspirin is 33.0, the correlation error data of using the NequIP model is 14.7, and the correlation error data of using the GemNet-T model for 12.6 and using root
  • the relevant error data of GemNet-T improved according to the method of the present disclosure is 10.2. It can be seen that the relative improvement reaches 19.0%.
  • the other columns in Table 1 show relevant data for molecular force field predictions for other molecules.
  • the error of molecular force field prediction can be greatly reduced and provide higher accuracy.
  • the improved GemNet-T also achieved higher accuracy in out-of-domain testing.
  • the molecule prediction model 210 may output a predicted value of solubility.
  • the methods of the present disclosure can be utilized to improve EGNN models for use in predicting molecular properties.
  • the improved EGNN model achieves better prediction results.
  • solubility is used as an example of a molecular property above, the molecular properties here may include various properties of the molecule, such as solubility, stability, reactivity, polarity, phase, color, magnetism, and biology. Activity, etc.
  • an accurate and reliable molecular prediction model 210 can be obtained and utilized to predict molecular properties using only less dedicated training data.
  • Figure 8 illustrates a flow diagram of a method 800 for managing molecular predictions in accordance with some implementations of the present disclosure.
  • the upstream model is obtained from a part of the network layers in the pre-trained model, and the pre-trained model describes the correlation between the molecular structure and the molecular energy
  • the downstream model is determined based on the molecular prediction target, The output layer of the downstream model is determined based on the molecular prediction target
  • a molecular prediction model is generated based on the upstream model and the downstream model, the molecular prediction model describes the association between the molecular structure and the molecular prediction target associated with the molecular structure relation.
  • obtaining the upstream model includes: obtaining a pre-trained model, where the pre-trained model includes a plurality of network layers; and obtaining a set of network layers other than an output layer of the pre-trained model in the plurality of network layers. , select the upstream model.
  • obtaining the pre-training model includes: using pre-training data in the pre-training data set to train the pre-training model so that the loss function associated with the pre-training model satisfies a predetermined condition, and the pre-training data includes sample molecular structure and Sample molecular energy.
  • the loss function includes at least any one of the following: energy loss, the energy loss represents the difference between the sample molecule energy and the predicted value of the sample molecule energy obtained based on the sample molecular structure; the estimated energy loss, The estimated energy loss represents the difference between the sample molecule energy and the prediction of the sample molecule energy obtained based on the sample molecular structure, which is estimated; and the force loss represents the prediction of the sample molecule energy obtained based on the sample molecular structure.
  • the molecular prediction target includes at least any one of the following: molecular properties and molecular force fields, and the pre-trained model is selected based on the molecular prediction target.
  • the downstream model includes at least one downstream network layer, and the last downstream network layer of the at least one downstream network layer is an output layer of the downstream model.
  • generating a molecule prediction model based on the upstream model and the downstream model includes: connecting the upstream model and the downstream model to form a molecule prediction model; and using training data in the training data set, training the molecule prediction model such that the molecule The loss function of the prediction model meets predetermined conditions, and the training data includes the sample molecular structure and the sample target measurement value corresponding to the molecular prediction target.
  • the loss function of the molecular prediction model includes a difference between a sample target measurement value and a predicted value of the sample target measurement value obtained based on the sample molecular structure.
  • the loss function of the molecular prediction model in response to determining that the molecular prediction target is a molecular force field, the loss function of the molecular prediction model further includes: a predicted value of the sample molecule energy obtained based on the sample molecular structure relative to the gradient of the sample molecular structure and The difference between predetermined gradients.
  • the method 800 further includes: in response to receiving the target molecular structure, determining a predicted value corresponding to the molecular prediction target based on the molecular prediction model.
  • FIG. 9 shows a block diagram of an apparatus 900 for managing molecular predictions in accordance with some implementations of the present disclosure.
  • the device 900 includes: an acquisition module 910, configured to acquire an upstream model from a part of the network layer in the pre-training model, which describes the correlation between molecular structure and molecular energy; a determination module 920, configured to determining a downstream model based on the molecule prediction target, and the output layer of the downstream model is determined based on the molecule prediction target; and a generation module 930 configured to generate a molecule prediction model based on the upstream model and the downstream model, the molecule prediction model describes the molecular structure and is related to the molecule Correlation relationships between structurally related molecular prediction targets.
  • the acquisition module 910 includes: a pre-acquisition module configured to acquire a pre-trained model, where the pre-trained model includes multiple network layers; and a selection module configured to acquire the pre-trained model from the multiple network layers. Select the upstream model from a set of network layers other than the output layer of the pretrained model.
  • the pre-acquisition module includes: a pre-training module configured to train a pre-training model using pre-training data in the pre-training data set, such that a loss function associated with the pre-training model satisfies Predetermined conditions, pre-training data include sample molecular structure and sample molecular energy.
  • the loss function includes at least any one of the following: energy loss, the energy loss represents the difference between the sample molecule energy and the predicted value of the sample molecule energy obtained based on the sample molecular structure; the estimated energy loss, The estimated energy loss represents the difference between the sample molecule energy and the prediction of the sample molecule energy obtained based on the sample molecular structure, which is estimated; and the force loss represents the prediction of the sample molecule energy obtained based on the sample molecular structure.
  • the molecular prediction target includes at least any one of the following: molecular properties and molecular force fields, and the pre-trained model is selected based on the molecular prediction target.
  • the downstream model includes at least one downstream network layers, and the last of at least one downstream network layer is the output layer of the downstream model.
  • the generation module 930 includes: a connection module configured to connect the upstream model and the downstream model to form a molecular prediction model; and a training module configured to utilize training data in the training data set,
  • the molecular prediction model is trained so that the loss function of the molecular prediction model satisfies a predetermined condition, and the training data includes a sample molecular structure and a sample target measurement value corresponding to the molecular prediction target.
  • the loss function of the molecular prediction model includes a difference between a sample target measurement value and a predicted value of the sample target measurement value obtained based on the sample molecular structure.
  • the loss function of the molecular prediction model in response to determining that the molecular prediction target is a molecular force field, the loss function of the molecular prediction model further includes: a predicted value of the sample molecule energy obtained based on the sample molecular structure relative to the gradient of the sample molecular structure and The difference between predetermined gradients.
  • the apparatus 900 further includes: a prediction value determination module configured to, in response to receiving the target molecule structure, determine a prediction value corresponding to the molecule prediction target based on the molecule prediction model.
  • Figure 10 illustrates a block diagram of a device 1000 capable of implementing various implementations of the present disclosure. It should be understood that the computing device 1000 shown in Figure 10 is exemplary only and should not constitute any limitation on the functionality and scope of the implementations described herein. The computing device 1000 shown in FIG. 10 may be used to implement the method 600 shown in FIG. 6 .
  • computing device 1000 is in the form of a general purpose computing device.
  • the components of computing device 1000 may include, but are not limited to, one or more processors or processing units 1010, memory 1020, storage devices 1030, one or more communication units 1040, one or more input devices 1050, and one or more output devices. 1060.
  • the processing unit 1010 may be a real or virtual processor and can perform various processes according to a program stored in the memory 1020 . In a multi-processor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capability of the computing device 1000 .
  • Computing device 1000 typically includes a plurality of computer storage media. Such media can Any available media that is accessible to computing device 1000, including but not limited to volatile and nonvolatile media, removable and non-removable media.
  • Memory 1020 may be volatile memory (e.g., registers, cache, random access memory (RAM)), nonvolatile memory (e.g., read only memory (ROM), electrically erasable programmable read only memory (EEPROM) , flash memory) or some combination thereof.
  • Storage device 1030 may be a removable or non-removable medium and may include machine-readable media such as a flash drive, a magnetic disk, or any other medium that may be capable of storing information and/or data (such as training data for training ) and can be accessed within computing device 1000.
  • Computing device 1000 may further include additional removable/non-removable, volatile/non-volatile storage media.
  • a disk drive may be provided for reading from or writing to a removable, non-volatile disk (eg, a "floppy disk") and for reading from or writing to a removable, non-volatile optical disk. Read or write to optical disc drives.
  • each drive may be connected to the bus (not shown) by one or more data media interfaces.
  • Memory 1020 may include a computer program product 1025 having one or more program modules configured to perform various methods or actions of various implementations of the disclosure.
  • the communication unit 1040 implements communication with other computing devices through communication media. Additionally, the functionality of the components of computing device 1000 may be implemented as a single computing cluster or as multiple computing machines capable of communicating over a communications connection. Accordingly, computing device 1000 may operate in a networked environment using logical connections to one or more other servers, networked personal computers (PCs), or another network node.
  • PCs networked personal computers
  • Input device 1050 may be one or more input devices, such as a mouse, a keyboard, a trackball, etc.
  • Output device 1060 may be one or more output devices, such as a display, speakers, printer, etc.
  • the computing device 1000 may also communicate via the communication unit 1040 with one or more external devices (not shown), such as storage devices, display devices, etc., as needed, and with one or more devices that enable a user to interact with the computing device 1000 Communicate with or with any device (e.g., network card, modem, etc.) that enables computing device 1000 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
  • I/O input/output
  • a computer-readable storage medium is provided with computer-executable instructions stored thereon, wherein the computer-executable instructions are executed by a processor to implement the method described above.
  • a computer program product is also provided, the computer program product is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above.
  • a computer program product is provided, a computer program is stored thereon, and when the program is executed by a processor, the method described above is implemented.
  • These computer-readable program instructions may be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, thereby producing a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, the computer-readable program instructions , resulting in an apparatus that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium. These instructions cause the computer, programmable data processing device and/or other equipment to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other equipment, causing a series of operating steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process, Thereby, instructions executed on a computer, other programmable data processing apparatus, or other equipment implement the functions/actions specified in one or more blocks of the flowcharts and/or block diagrams.
  • each box in the flowchart or block diagram may represent a module, segment, or portion of an instruction.
  • a module, program segment, or part of an instruction contains one or more executable instructions that are used to implement specified logical functions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

根据本公开的实现方式,提供了用于管理分子预测的方法、装置、设备和介质。在一种方法中,从预训练模型中的一部分网络层中,获取上游模型,预训练模型描述分子结构和分子能量之间的关联关系。基于分子预测目标确定下游模型,下游模型的输出层是基于分子预测目标确定的。基于上游模型和下游模型生成分子预测模型,分子预测模型描述分子结构和与分子结构相关联的分子预测目标之间的关联关系。由于上游模型可以包括分子相关的大量知识,可以降低训练基于上游模型和下游模型所生成的分子预测模型所需的训练数据数量。

Description

用于管理分子预测的方法、装置、设备和介质
本申请要求2022年05月13日递交的,标题为“用于管理分子预测的方法、装置、设备和介质”、申请号为202210524875.6的中国发明专利申请的优先权。
技术领域
本公开的示例性实现方式总体涉及计算机领域,特别地涉及用于管理分子预测的方法、装置、设备和计算机可读存储介质。
背景技术
随着机器学习技术的发展,机器学习技术已经被广泛地用于各个技术领域。分子研究是材料科学、能源应用、生物技术、药物研究等领域的重要任务。机器学习已成为被广泛应用于此类领域,并且可以基于已知分子的特征来预测其他分子的特征。然而,机器学习技术依赖于数量众多的训练数据,然而训练数据集的采集需要大量实验并且耗费大量人力、物力和时间。此时,如何在训练数据不足的情况下提高预测模型的精度,成为分子研究领域的难点和热点。
发明内容
根据本公开的示例性实现方式,提供了一种用于管理分子预测的方案。
在本公开的第一方面,提供了一种用于管理分子预测的方法。在该方法中,从预训练模型中的一部分网络层中获取上游模型,预训练模型描述分子结构和分子能量之间的关联关系。基于分子预测目标确定下游模型,下游模型的输出层是基于分子预测目标确定的。基于上游模型和下游模型生成分子预测模型,分子预测模型描述分子结构和 与分子结构相关联的分子预测目标之间的关联关系。
在本公开的第二方面,提供了一种用于管理分子预测的装置。该装置包括:获取模块,被配置用于从预训练模型中的一部分网络层中获取上游模型,预训练模型描述分子结构和分子能量之间的关联关系;确定模块,被配置用于基于分子预测目标确定下游模型,下游模型的输出层是基于分子预测目标确定的;以及生成模块,被配置用于基于上游模型和下游模型生成分子预测模型,分子预测模型描述分子结构和与分子结构相关联的分子预测目标之间的关联关系。
在本公开的第三方面,提供了一种电子设备。该电子设备包括:至少一个处理单元;以及至少一个存储器,至少一个存储器被耦合到至少一个处理单元并且存储用于由至少一个处理单元执行的指令,指令在由至少一个处理单元执行时使电子设备执行根据本公开第一方面的方法。
在本公开的第四方面,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序在被处理器执行时使处理器实现根据本公开第一方面的方法。
应当理解,本发明内容部分中所描述的内容并非旨在限定本公开的实现方式的关键特征或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的描述而变得容易理解。
附图说明
在下文中,结合附图并参考以下详细说明,本公开各实现方式的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标记表示相同或相似的元素,其中:
图1示出了本公开的实现方式能够在其中实现的示例环境的框图;
图2示出了根据本公开的一些实现方式的用于管理分子预测的过程的框图;
图3示出了根据本公开的一些实现方式的用于基于预训练模型来生成分子预测模型的过程的框图;
图4示出了根据本公开的一些实现方式的用于获取预训练模型的过程的框图;
图5示出了根据本公开的一些实现方式的用于预训练模型的损失函数的框图;
图6示出了根据本公开的一些实现方式的用于获取分子预测模型的过程的框图;
图7示出了根据本公开的一些实现方式的用于分子预测模型的损失函数的框图;
图8示出了根据本公开的一些实现方式的用于管理分子预测的方法的流程图;
图9示出了根据本公开的一些实现方式的用于管理分子预测的装置的框图;以及
图10示出了能够实施本公开的多个实现方式的设备的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实现方式。虽然附图中示出了本公开的某些实现方式,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实现方式,相反,提供这些实现方式是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实现方式仅用于示例性作用,并非用于限制本公开的保护范围。
在本公开的实现方式的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实现方式”或“该实现方式”应当理解为“至少一个实现方式”。术语“一些实现方式”应当理解为“至少一些实现方式”。下文还可能包括其他明确的和隐含的定义。如本文中所使用的,术语“模型”可以表示各个数据之间的关联关系。例如,可以基于目前已知的和/或将在未来开发的多种技术方案来获取上述关联关系。
可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。
可以理解的是,在使用本公开各实施例公开的技术方案之前,均应当根据相关法律法规通过适当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。
例如,在响应于接收到用户的主动请求时,向用户发送提示信息,以明确地提示用户,其请求执行的操作将需要获取和使用到用户的个人信息。从而,使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。
作为一种可选的但非限制性的实现方式,响应于接收到用户的主动请求,向用户发送提示信息的方式,例如可以是弹窗的方式,弹窗中可以以文字的方式呈现提示信息。此外,弹窗中还可以承载供用户选择“同意”或“不同意”向电子设备提供个人信息的选择控件。
可以理解的是,上述通知和获取用户授权过程仅是示意性的,不对本公开的实现方式构成限定,其他满足相关法律法规的方式也可应用于本公开的实现方式中。
示例环境
图1示出了本公开的实现方式能够在其中实现的示例环境100的框图。在图1的环境100中,期望训练和使用这样的模型(即,预测模型130),该模型被配置用于预测具有特定分子结构的分子特性(例如,分子力场、分子性质(如,可溶性、稳定性,等),等等。如图1所示,环境100包括模型训练系统150和模型应用系统152。图1上部示出了模型训练阶段的过程,并且下部示出模型应用阶段的过程。在训练前,预测模型130的参数值可以具有初始值,或者可以具有通过预训练过程获得经预训练的参数值。经过训练过程,预测模型130的参数值可以被更新和调整。在训练完成后可以获得预测模型130’。此时,预测模型130’的参数值已经被更新,并且基于已更新的参数值, 预测模型130在模型应用阶段可以被用于实现预测任务。
在模型训练阶段,可以基于包括多个训练数据112的训练数据集110,并利用模型训练系统150来训练预测模型130。在此,每个训练数据112可以涉及二元组格式,并且包括分子结构120和分子特性122。在本公开的上下文中,在不同的训练数据112中,分子特性122可以包括分子力场、分子性质(如可溶性、稳定性等)、和/或其他特性。
此时,可以利用包括分子结构120和分子特性122的训练数据112来训练预测模型130。具体地,可以利用大量训练数据迭代地执行训练过程。在训练完成之后,预测模型130可以确定与不同分子结构相关联的分子特性。在模型应用阶段,可以利用模型应用系统152来调用预测模型130’(此时的预测模型130’具有训练后的参数值)。例如,可以接收输入数据140(包括目标分子结构142),并且输出该目标分子结构142的分子特性的预测结果144。
在图1中,模型训练系统150和模型应用系统152可以包括具有计算能力的任何计算系统,例如各种计算设备/系统、终端设备、服务器等。终端设备可以涉及任意类型的移动终端、固定终端或便携式终端,包括移动手机、台式计算机、膝上型计算机、笔记本计算机、上网本计算机、平板计算机、媒体计算机、多媒体平板、或者前述各项的任意组合,包括这些设备的配件和外设或者其任意组合。服务器包括但不限于大型机、边缘计算节点、云环境中的计算设备,等等。
应当理解,图1示出的环境100中的部件和布置仅仅是示例,适于用于实现本公开所描述的示例性实现方式的计算系统可以包括一个或多个不同的部件、其他部件和/或不同的布置方式。例如,虽然被示出为是分离的,但模型训练系统150和模型应用系统152可以集成在相同系统或设备中。本公开的实现方式在此方面不受限制。以下将继续参考附图,分别描述模型训练和模型应用的示例性实现方式。
将会理解,训练数据112中的分子特性122应当与预测目标(也即,期望预测模型130输出的目标)一致。换言之,当期望预测分子力场时,训练数据112中的分子特性122应当是分子力场的测量数据, 此时预测模型130可以接收分子结构并且输出相应的分子力场的预测值;当期望预测分子性质(例如,可溶性)时,训练数据112中的分子特性122应当是可溶性的测量数据,此时预测模型130可以接收分子结构并且输出相应的可溶性的预测值。
为了保证预测精度,不得不采集大量训练数据来训练预测模型130。然而,在大多数请情况下,仅仅存在少量训练数据,这可能会需要大量实验。进一步,在分子研究领域中涉及数以百万计(甚至更多)的常用分子结构,这导致需要针对各个分子结构设计专用实验来获得其分子特性。同时,在分子研究领域中存在众多预测目标,此时不得不针对众多预测目标单独采集训练数据。
目前已经提出了预训练-微调的技术方案,这些技术方案侧重于自我监督学习策略。然而,在分子相关的预测模型中,输入(分子结构)和输出(分子特性)对于分子建模具有不同的内在要求。自监督学习任务只能表示分子结构,但缺乏连接输入和输出的中间知识。自学预训练可以在一定程度上填补这一空白,然而由于缺乏大规模的标记数据,可能会损害下游任务的性能。
此外,目前已经提出了有监督的预训练技术方案,该技术方案可以基于分子结构来针对大量分子进行多任务预测。然而,该技术方案可能会导致下游任务的负迁移,也即基于该技术方案获得的预测模型与下游任务没有“真正相关”,这导致预测的精度不能令人满意。此时,期望能够利用用于特定预测目标的有限训练数据,来获得更为精确的预测模型。
分子预测模型的架构
为了解决上述技术方案中的不足,根据本公开的一个示例性实现方式,提出了两阶段训练的技术方案。具体地,第一阶段为预训练过程,该过程关注于特定分子结构所提供的基本物理特性(例如,分子能量),可以首先获得预训练模型。第二阶段关注于微调,也即关注于分子的基本物理特性与其他预测目标之间的关联关系,此时可以微 调预训练模型,进而获得精度较高的预测模型。
利用本公开的示例性实现方式,在预训练阶段可以基于大量已知的公开数据来生成预训练模型。之后,基于预训练模型建立实现特定预测目标的分子预测模型,并且利用实现该特定预测目标的少量专用训练数据,来针对分子预测模型进行微调。以此方式,在专用训练数据有限的情况下,可以提高分子预测模型的精度。
在下文中,参见图2描述根据本公开的一个示例性实现方式的概要。图2示出了根据本公开的一些实现方式的用于管理分子预测的过程的框图200。如图2所示,可以首先确定预训练模型240,该预训练模型240可以描述分子结构和分子能量之间的关联关系。预训练模型240可以包括多个网络层,并且可以利用预训练模型240生成用于特定分子预测目标250的分子预测模型210。在此,分子预测模型210可以包括上游模型220和下游模型230,并且可以从预训练模型240的多个网络层中选择一部分网络层242,来形成上游模型220。
将会理解,分子结构是建立在光谱学数据之上,用于描述分子中原子的三维排列方式。将会理解,分子结构是分子的内在基础,并且在很大程度上决定了分子的其他特性。具有特定分子结构的分子将具有类似特性,并且这些特性通常是由分子能量所确定的。根据本公开的一个示例性实现方式,由于分子结构和分子能量是分子相关的其他特性的基础,提出了利用预训练模型240(描述分子结构和分子能量之间的关联关系)来构建实现特定预测目标的分子预测模型210。
此时,预训练模型240的多个网络层已经积累了有关分子内在因素的丰富知识,可以利用直接多个网络层中的某些网络层构建分子预测模型210。以此方式,可以大大降低从零基础训练分子预测模型210的训练样本需求,并且保持分子预测模型210的精度。将会理解,由于目前存在众多公开可获得的分子数据集,可以利用这些数据集来生成预训练模型240。
进一步,可以基于特定分子预测目标250来确定下游模型230,并且下游模型230的输出层是基于分子预测目标250确定的。在此, 分子预测目标250表示期望分子预测模型210输出的目标。可以基于上游模型220和下游模型230生成分子预测模型210,以便描述分子结构和与分子结构相关联的分子预测目标250之间的关联关系。在此,分子预测目标250可以表示期望输出的目标,例如,分子力场、分子性质或者其他目标。
利用本公开的示例性实现方式,一方面可以降低训练分子预测模型210所需的专用训练数据的数量,另一方面可以在不同预测目标(例如,分子力场、分子性质,等)之间共享预训练模型240,进而提高生成分子预测模型210的效率。
模型训练过程
在下文中,将参见图3描述有关基于预训练模型240来构建分子预测模型210的更多细节。图3示出了根据本公开的一些实现方式的用于基于预训练模型240来生成分子预测模型210的过程的框图300。如图3所示,预训练模型240可以描述分子结构310和分子能量314之间的关联关系。预训练模型240可以包括N个网络层,具体地,第1层作为输入层,用于接收输入的分子结构310,并且第N层作为输出层312来输出分子能量314。
根据本公开的一个示例性实现方式,可以从预训练模型240中的多个网络层中的输出层312以外的一组网络层中,确定上游模型220。例如,可以直接将预训练模型240中的前N-1个网络层作为分子预测模型210的上游模型220。进一步,可以基于分子预测目标250来生成下游模型230。以此方式,分子预测模型210可以直接利用第1层至第N层中所获得的有关分子的多方面知识,进而将其应用于执行与特定分子预测目标250相关联的预测任务。如图所示,分子预测模型210可以接收分子结构320,并且输出与分子预测目标250相对应的目标值322。
在下文中,将详细描述有关获取预训练模型240的更多细节。根据本公开的一个示例性实现方式,可以根据分子预测目标250来选择 用于实现预训练模型240的骨干模型。例如,当分子预测目标250为预测分子力场时,可以基于几何消息传递神经网络(Geometric Message Passing Neural Network,缩写GemNet)模型来实现预训练模型240。当分子预测目标250为预测分子性质时,可以基于等变图神经网络(E(n)-Equivariant Graph Neural Network,缩写EGNN)模型来实现预训练模型240。备选地和/或附加地,还可以选择以下任一模型:对称梯度域机器学习(Symmetric Gradient Domain Machine Learning,缩写sGDML)模型、NequIP模型、GemNet-T模型,等等。
备选地和/或附加地,可以从预训练模型240选择其他数量的网络层,例如,可以选择第1个至第N-2个网络层,或者可以选择更少的网络层。尽管此时所选择的网络层的数量较小,所选网络层中仍然包括有关分子的多方面知识。此时,仍然可以降低训练分子预测模型210所需的训练样本的数量。
可以将针对预训练模型240执行的训练过程称为预训练过程,在下文中,将参见图4描述有关预训练过程的更多细节。图4示出了根据本公开的一些实现方式用于获取预训练模型240的过程的框图400。如图4所示,可以利用预训练数据集410中的预训练数据420来训练预训练模型240,以使得与预训练模型240相关联的损失函数430满足预定条件,预训练数据420可以包括样本分子结构422和样本分子能量424。
将会理解,分子能量的相关研究已经得到长期并且广泛的实践,并且目前已经提供了大量公开数据集。例如,PubChemQC PM6数据集是一个公开数据集,该数据集包括数亿的分子结构及其相对应的电子特性。又例如,量子机器9(Quantum Machine 9,缩写QM9)数据集提供了有关分子的几何结构、能量、电子和热动力学特性。可以使用这些公开数据集(或者其中的一部分)作为训练数据,以便获得预训练模型240。换言之,在经过训练过程之后,即可得到预训练模型240中的第1至N个网络层的具体配置。
如图4所示,预训练数据集410可以包括多个训练数据420,并 且训练数据420可以包括样本分子结构422和样本分子能量424。在下文中,将仅以PubChemQC PM6数据集作为预训练数据集410的具体示例,描述如何执行预训练过程。PubChemQC PM6数据集包括大量分子结构及其相应的电子特性。例如,该数据集包括大约八千六百万个优化的3D分子结构及其相关的分子能量。可以利用这些分子结构和分子能量,作为训练数据。具体地,可以选择预训练模型240的骨干模型,并且构建预训练模型240的损失函数430,该损失函数430可以表示样本数据的真值和预测值之间的差异,进而使得预训练过程可以朝向使得该差异逐渐缩小的方向,来迭代地优化预训练模型240。
利用本公开的示例性实现方式,可以直接使用公开可获得的各种数据集来作为预训练数据集410。一方面,这些公开可获得的数据集包括巨大数量的样本数据,因而可以在无需准备专门训练数据的情况下,获得分子结构和分子能量的基础知识。另一方面,这些数据集中的样本数据是经过长时间研究并且已经被证明为准确或者较为准确的数据,基于这些样本数据来执行预训练过程,可以获得较为准确的预训练模型240。进一步,由于实现特定分子预测目标250的分子预测模型210包括预训练模型240中的一部分,这继而可以确保后续生成的分子预测模型210的也是可靠的。
根据本公开的一个示例性实现方式,损失函数430可以包括多方面内容,图5示出了根据本公开的一些实现方式用于预训练模型240的损失函数430的框图500。如图5所示,损失函数430可以包括能量损失510,在此能量损失510表示样本分子能量424和基于样本分子结构422获得的样本分子能量424的预测值之间的差异。具体地,可以基于如下公式1来确定能量损失510。
在公式1中,符号表示能量损失510,符号R表示分子结构,符号E表示具有分子结构R的分子的分子能量,Z表示预训练模型240,表示基于分子结构R和预训练模型240所获得的分子能量E的 预测值,并且d表示E和之间的差异。根据本公开的一个示例性实现方式,可以采用不同格式描述分子结构。例如,可以以SMILES或者其他格式表示分子结构;又例如,可以进一步通过RDKIT等工具获得原子坐标形式的分子结构;再例如,可以以分子图的形式表示分子结构。
利用本公开的示例性实现方式,公式1可以以量化方式表示预训练的目标。以此方式,可以基于预训练数据集410中的各个预训练数据420,朝向将能量损失510最小化的方式调整预训练模型240的各个网络层的参数,以便使得预训练模型240可以准确地描述分子结构310和分子能量314之间的关联关系。
将会理解,下游预测任务的训练数据集通常仅提供SMILES格式的分子结构而并不提供精确的原子坐标。此时,损失函数430可以包括估计能量损失520,该估计能量损失520表示样本分子能量424和基于样本分子结构422获得的样本分子能量424的预测值之间的差异,在此样本分子结构是估计的。具体地,可以基于如下公式2来确定估计能量损失520。
在公式2中,符号表示估计能量损失520,符号Rnoisy表示估计的分子结构,符号E表示具有分子结构Rnoisy的分子的分子能量,Z表示预训练模型240,表示基于估计的分子结构Rnoisy和预训练模型240所获得的分子能量E的预测值,并且d表示E和之间的差异。此时,可以基于RDKIT等工具来从SMILES确定估计的分子结构。利用本公开的示例性实现方式,公式2可以以量化方式表示预训练的目标。此时,估计的分子结构Rnoisy的表达方式与下游任务的输入的分子结构相一致,由此可以提高预测结果的准确性。
备选地和/或附加地,在预训练过程中可以进一步提供数据增强,也即,基于训练数据集410中的已有数据确定额外的损失函数。具体地,损失函数430可以包括力损失530,力损失530表示基于样本分 子结构422获得的样本分子能量424的预测值相对于样本分子结构422的梯度与预定梯度(例如,0)之间的差异。将会理解,PubChemQC PM6数据集是出于分子优化几何结构目的来建立的,因而可以将分子能量最小化。分子力表示能量相对于原子坐标的梯度,由于此时分子较为稳定,因而梯度应当具有接近0的数值。此时,可以基于预训练数据集410中的预训练数据420实现数据增强,也即,针对原子施加的潜在力是能量的梯度。这等效于假设针对力的标签为0的监督学习损失。也即,可以基于如下公式3来确定力损失530。
在公式3中,表示力损失530,表示基于分子结构R和预训练模型Z获得的分子能量的预测值相对于分子结构的梯度,F表示预定梯度(F=0),并且表示计算的梯度和预定梯度F=0之间的差异。利用本公开的示例性实现方式,可以针对预训练数据集410进行数据增强,以便在预训练模型240中包括有关分子力的更多知识。以此方式,可以提高预训练模型240的精度,进而在分子预测目标250涉及分子力场时提供更为准确的预测结果。
根据本公开的一个示例性实现方式,可以基于公式1至3中的任一项来确定损失函数430。进一步,可以综合考虑1至3中的两个或者更多公式,例如,可以基于如下公式4至7中的任一项来确定用于预训练的损失函数430。




在公式4至7中,各符号的含义与上文公式所述相同,并且α和β分别表示预先确定的[0,1]之间的数值。根据本公开的一个示例性实现方式,可以基于具体预测目标来确定损失函数430。例如,当期望预测分子力场时,可以使用公式3、4、6或者7。当下游数据涉及估计的分子结构时,可以使用公式2、5、6、或者7,等等。
根据本公开的一个示例性实现方式,可以指定预定的停止条件,以便在预训练模型240满足该停止条件时,停止预训练过程。利用本公开的示例性实现方式,可以将复杂的预训练过程转换为基于公式1至7实现的简单数学运算。以此方式,可以在无需准备专用训练数据的情况下,利用公开的训练数据集610获得较高精度的预训练模型240。
上文已经描述了预训练的具体过程,在已经获得预训练模型240之后,可以直接将该预训练模型240中的第1至N-1个网络层作为分子预测模型210的上游模型220。进一步,可以基于分子预测目标250确定分子预测模型210的下游模型230。具体地,下游模型230可以包括一个或多个个网络层。根据本公开的一个示例性实现方式,分子预测目标250可以包括分子力场和/或分子性质。此时,可以利用单一网络层实现下游模型230,也即下游模型230仅包括单一输出层。备选地和/或附加地,下游模型230还可以包括两个或者更多的网络层。此时,下游模型230中的多个网络层中的最后的网络层是下游模型230的输出层。
根据本公开的一个示例性实现方式,可以连接上游模型220和下游模型230,以便获得最终的分子预测模型210。将会理解,上游模型220中的各项参数是直接从预训练模型240中获取的,并且下游模型230的参数可以被设置为任意初始值和/或经由其他方式获得的数值。根据本公开的一个示例性实现方式,可以使用随机初始值。下游任务可能要求最终输出层具有与预训练不同维度的输出,或者即使维度相同,由于微调时提供了更少的偏差损失梯度,随机初始化输出层 的参数通常可以达到更高精度的分子预测模型210。
继而,可以将分子预测模型210作为整体预测模型,并利用与分子预测目标250相关联的专用数据集进行训练。利用本公开的示例性实现方式,由于上游模型220已经包括有关分子的各种知识,此时使用少量的专用训练数据,即可获得较高精度的分子预测模型210。
进一步,参见图6描述训练分子预测模型210的更多细节。如图6所示,可以利用训练数据集610中的训练数据620来训练分子预测模型210,以使得与分子预测模型210相关联的损失函数630满足预定条件。在此,训练数据620可以包括样本分子结构622和与分子预测目标250相对应的样本目标测量值624。具体地,假设分子预测目标250为分子力场,则样本目标测量值624可以是分子力场的测量值;假设分子预测目标250为可溶性,则样本目标测量值624可以是可溶性的测量值。
根据本公开的一个示例性实现方式,可以获取与分子预测目标250相对应的训练数据集610,该训练数据集610可以是为了分子预测目标250所准备的专用数据集(例如,通过实验等方式)。相对于包括大量预训练数据(例如,数百万甚至更多)的预训练数据集410而言,训练数据集610通常包括较少的训练数据(例如,数千甚至更少)。以此方式,不必采集海量专用训练数据,而是可以使用有限的专用训练数据即可获得更高精度的分子预测模型210。
根据本公开的一个示例性实现方式,可以为分子预测模型210构造损失函数630。图7示出了根据本公开的一些实现方式用于分子预测模型210的损失函数630的框图700。如图7所示,分子预测模型210的损失函数630可以包括能量损失710,也即包括样本目标测量值624和基于样本分子结构622获得的样本目标测量值624的预测值之间的差异。
当期望预测分子性质时,可以基于如下公式8来确定能量损失710。
在公式8中,表示分子预测模型210的性质损失710,y表示训练数据620中的样本目标测量值624(对应于分子结构R),并且表示基于分子结构R和分子预测模型210获得的预测值,并且表示y和之间的差异。以此方式,可以通过公式8来确定损失函数630,进而朝向使得损失函数630最小化的方向执行微调。以此方式,可以将微调分子预测模型210的复杂过程转换为简单并且有效的数学运算。
根据本公开的一个示例性实现方式,当期望预测分子力场时,分子预测模型210的损失函数630可以进一步包括力场损失720。该力场损失720包括基于样本分子结构622获得的样本分子能量624的预测值相对于样本分子结构622的梯度与预定梯度之间的差异。具体地,可以基于如下公式9来确定力场损失720。
在公式8中,表示分子预测模型210的力场损失720,各符号的含义与上文公式所述相同,并且γ表示预先确定的[0,1]之间的数值。以此方式,可以通过公式0来确定损失函数,进而将微调分子预测模型210的复杂过程转换为简单并且有效的数学运算。利用本公开的示例性实现方式,可以以更为准确并且有效的方式获得分子预测模型210。
上文已经参见附图描述了用于获取分子预测模型210的过程。利用本公开的示例性实现方式,可以基于已知的公开数据集中的大量数据获取预训练模型240。进一步,可以基于包括有限数量训练数据的较小专用训练数据集,进一步微调分子预测模型210。以此方式,可以在训练精度和准备大量专用训练数据的多种开销之间执行有效的平衡,进而以较小的代价获得较高精度的分子预测模型210。
模型应用过程
上文已经描述了的对分子预测模型210的训练,在下文中,将描 述如何使用分子预测模型210确定与分子预测目标250相关联的预测值。根据本公开的一个示例性实现方式,在已经完成模型训练阶段之后,可以使用已经训练好的、具有训练后的参数值的分子预测模型210来处理接收到的输入数据。如果接收到目标分子结构,可以基于分子预测模型210确定与分子预测目标相对应的预测值。
例如,可以向分子预测模型210输入待处理的目标分子结构。此时目标分子结构可以基于SMILES格式或者原子坐标形式来表示。分子预测模型210即可输出该模板分子结构相对应的预测值。在此,依赖于分子预测目标250,预测值可以包括相应目标的预测值。具体地,当分子预测模型210用于预测分子力场时,则分子预测模型210可以输出分子力场的预测值。以此方式,训练后的分子预测模型210可以具有较高的精度,进而为后续的处理操作提供判断依据。
根据本公开的一个示例性实现方式,在预测分子力场的应用环境中,使用分子预测模型210的预测结果在域内测试和域外测试两方面都获得了更高的精度。例如,下文表1示出了域内测试数据。
表1域内测试数据
在表1中,行表示不同预测模型所基于的骨干模型,并且列表示基于不同预测模型得出的有关分子力场的预测值的误差数据。具体地,第2行“阿司匹林”中的各项数据表示:使用sGDML模型预测阿司匹林的分子力场的相关误差为33.0,使用NequIP模型的相关误差数据为14.7,使用GemNet-T模型的相关误差数据为12.6,并且利用根 据本公开的方法改进后的GemNet-T的相关误差数据为10.2。可见,相对改进达到了19.0%。类似地,表1中的其他列示出了针对其他分子的分子力场预测的相关数据。从表1可见,利用本公开的示例性实现方式,可以大大降低分子力场预测的误差,并且提供更高的准确度。进一步,改进GemNet-T在域外测试中也获得了较高的准确度。
根据本公开的一个示例性实现方式,在预测分子性质的应用环境中,分子预测模型210可以输出可溶性的预测值。可以利用本公开的方法来改进EGNN模型,以便用于预测分子性质。此时,改进的EGNN模型实现了更好的预测效果。将会理解,尽管上文以可溶性作为分子性质的示例,在此的分子性质可以包括分子的多方面的性质,例如,可溶性、稳定性、反应性、极性、相态、颜色、磁性和生物活性,等等。利用本公开的示例性实现方式,可以在仅使用较少专用训练数据的情况下,获得准确并且可靠的分子预测模型210,并且利用分子预测模型210来预测分子性质。
示例过程
图8示出了根据本公开的一些实现方式的用于管理分子预测的方法800的流程图。具体地,在框810处,从预训练模型中的一部分网络层中获取上游模型,预训练模型描述分子结构和分子能量之间的关联关系;在框820处,基于分子预测目标确定下游模型,下游模型的输出层是基于分子预测目标确定的;以及在框830处,基于上游模型和下游模型生成分子预测模型,分子预测模型描述分子结构和与分子结构相关联的分子预测目标之间的关联关系。
根据本公开的一个示例性实现方式,获取上游模型包括:获取预训练模型,预训练模型包括多个网络层;以及从多个网络层中的预训练模型的输出层以外的一组网络层中,选择上游模型。
根据本公开的一个示例性实现方式,获取预训练模型包括:利用预训练数据集中的预训练数据来训练预训练模型,以使得与预训练模型相关联的损失函数满足预定条件,预训练数据包括样本分子结构和 样本分子能量。
根据本公开的一个示例性实现方式,损失函数包括以下至少任一项:能量损失,能量损失表示样本分子能量和基于样本分子结构获得的样本分子能量的预测值之间的差异;估计能量损失,估计能量损失表示样本分子能量和基于样本分子结构获得的样本分子能量的预测值之间的差异,样本分子结构是估计的;以及力损失,力损失表示基于样本分子结构获得的样本分子能量的预测值相对于样本分子结构的梯度与预定梯度之间的差异。
根据本公开的一个示例性实现方式,分子预测目标包括以下至少任一项:分子性质和分子力场,并且预训练模型是基于分子预测目标来选择的。
根据本公开的一个示例性实现方式,下游模型包括至少一个下游网络层,并且至少一个下游网络层中的最后下游网络层是下游模型的输出层。
根据本公开的一个示例性实现方式,基于上游模型和下游模型生成分子预测模型包括:连接上游模型和下游模型以形成分子预测模型;以及利用训练数据集中的训练数据,训练分子预测模型以使得分子预测模型的损失函数满足预定条件,训练数据包括样本分子结构和与分子预测目标相对应的样本目标测量值。
根据本公开的一个示例性实现方式,分子预测模型的损失函数包括样本目标测量值和基于样本分子结构获得的样本目标测量值的预测值之间的差异。
根据本公开的一个示例性实现方式,响应于确定分子预测目标为分子力场,分子预测模型的损失函数进一步包括:基于样本分子结构获得的样本分子能量的预测值相对于样本分子结构的梯度与预定梯度之间的差异。
根据本公开的一个示例性实现方式,该方法800进一步包括:响应于接收到目标分子结构,基于分子预测模型确定与分子预测目标相对应的预测值。
示例装置和设备
图9示出了根据本公开的一些实现方式的用于管理分子预测的装置900的框图。该装置900包括:获取模块910,被配置用于从预训练模型中的一部分网络层中获取上游模型,预训练模型描述分子结构和分子能量之间的关联关系;确定模块920,被配置用于基于分子预测目标确定下游模型,下游模型的输出层是基于分子预测目标确定的;以及生成模块930,被配置用于基于上游模型和下游模型生成分子预测模型,分子预测模型描述分子结构和与分子结构相关联的分子预测目标之间的关联关系。
根据本公开的一个示例性实现方式,获取模块910包括:预获取模块,被配置用于获取预训练模型,预训练模型包括多个网络层;以及选择模块,被配置用于从多个网络层中的预训练模型的输出层以外的一组网络层中,选择上游模型。
根据本公开的一个示例性实现方式,预获取模块包括:预训练模块,被配置用于利用预训练数据集中的预训练数据来训练预训练模型,以使得与预训练模型相关联的损失函数满足预定条件,预训练数据包括样本分子结构和样本分子能量。
根据本公开的一个示例性实现方式,损失函数包括以下至少任一项:能量损失,能量损失表示样本分子能量和基于样本分子结构获得的样本分子能量的预测值之间的差异;估计能量损失,估计能量损失表示样本分子能量和基于样本分子结构获得的样本分子能量的预测值之间的差异,样本分子结构是估计的;以及力损失,力损失表示基于样本分子结构获得的样本分子能量的预测值相对于样本分子结构的梯度与预定梯度之间的差异。
根据本公开的一个示例性实现方式,分子预测目标包括以下至少任一项:分子性质和分子力场,并且预训练模型是基于分子预测目标来选择的。
根据本公开的一个示例性实现方式,下游模型包括至少一个下游 网络层,并且至少一个下游网络层中的最后下游网络层是下游模型的输出层。
根据本公开的一个示例性实现方式,生成模块930包括:连接模块,被配置用于连接上游模型和下游模型以形成分子预测模型;以及训练模块,被配置用于利用训练数据集中的训练数据,训练分子预测模型以使得分子预测模型的损失函数满足预定条件,训练数据包括样本分子结构和与分子预测目标相对应的样本目标测量值。
根据本公开的一个示例性实现方式,分子预测模型的损失函数包括样本目标测量值和基于样本分子结构获得的样本目标测量值的预测值之间的差异。
根据本公开的一个示例性实现方式,响应于确定分子预测目标为分子力场,分子预测模型的损失函数进一步包括:基于样本分子结构获得的样本分子能量的预测值相对于样本分子结构的梯度与预定梯度之间的差异。
根据本公开的一个示例性实现方式,该装置900进一步包括:预测值确定模块,被配置用于响应于接收到目标分子结构,基于分子预测模型确定与分子预测目标相对应的预测值。
图10示出了能够实施本公开的多个实现方式的设备1000的框图。应当理解,图10所示出的计算设备1000仅仅是示例性的,而不应当构成对本文所描述的实现方式的功能和范围的任何限制。图10所示出的计算设备1000可以用于实现如图6所示方法600。
如图10所示,计算设备1000是通用计算设备的形式。计算设备1000的组件可以包括但不限于一个或多个处理器或处理单元1010、存储器1020、存储设备1030、一个或多个通信单元1040、一个或多个输入设备1050以及一个或多个输出设备1060。处理单元1010可以是实际或虚拟处理器并且能够根据存储器1020中存储的程序来执行各种处理。在多处理器系统中,多个处理单元并行执行计算机可执行指令,以提高计算设备1000的并行处理能力。
计算设备1000通常包括多个计算机存储介质。这样的介质可以 是计算设备1000可访问的任何可以获得的介质,包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器1020可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非易失性存储器(例如,只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储设备1030可以是可拆卸或不可拆卸的介质,并且可以包括机器可读介质,诸如闪存驱动、磁盘或者任何其他介质,其可以能够用于存储信息和/或数据(例如用于训练的训练数据)并且可以在计算设备1000内被访问。
计算设备1000可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性存储介质。尽管未在图10中示出,可以提供用于从可拆卸、非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中,每个驱动可以由一个或多个数据介质接口被连接至总线(未示出)。存储器1020可以包括计算机程序产品1025,其具有一个或多个程序模块,这些程序模块被配置为执行本公开的各种实现方式的各种方法或动作。
通信单元1040实现通过通信介质与其他计算设备进行通信。附加地,计算设备1000的组件的功能可以以单个计算集群或多个计算机器来实现,这些计算机器能够通过通信连接进行通信。因此,计算设备1000可以使用与一个或多个其他服务器、网络个人计算机(PC)或者另一个网络节点的逻辑连接来在联网环境中进行操作。
输入设备1050可以是一个或多个输入设备,例如鼠标、键盘、追踪球等。输出设备1060可以是一个或多个输出设备,例如显示器、扬声器、打印机等。计算设备1000还可以根据需要通过通信单元1040与一个或多个外部设备(未示出)进行通信,外部设备诸如存储设备、显示设备等,与一个或多个使得用户与计算设备1000交互的设备进行通信,或者与使得计算设备1000与一个或多个其他计算设备通信的任何设备(例如,网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。
根据本公开的示例性实现方式,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,其中计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式,还提供了一种计算机程序产品,计算机程序产品被有形地存储在非瞬态计算机可读介质上并且包括计算机可执行指令,而计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式,提供了一种计算机程序产品,其上存储有计算机程序,程序被处理器执行时实现上文描述的方法。
这里参照根据本公开实现的方法、装置、设备和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其他可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
可以把计算机可读程序指令加载到计算机、其他可编程数据处理装置、或其他设备上,使得在计算机、其他可编程数据处理装置或其他设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其他可编程数据处理装置、或其他设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实现的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部 分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实现,上述说明是示例性的,并非穷尽性的,并且也不限于所公开的各实现。在不偏离所说明的各实现的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实现的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文公开的各个实现方式。

Claims (20)

  1. 一种用于管理分子预测的方法,包括:
    从预训练模型中的一部分网络层中获取上游模型,所述预训练模型描述分子结构和分子能量之间的关联关系;
    基于分子预测目标确定下游模型,所述下游模型的输出层是基于所述分子预测目标确定的;以及
    基于所述上游模型和所述下游模型生成所述分子预测模型,所述分子预测模型描述分子结构和与所述分子结构相关联的分子预测目标之间的关联关系。
  2. 根据权利要求1的所述方法,其中获取所述上游模型包括:
    获取所述预训练模型,所述预训练模型包括多个网络层;以及
    从所述多个网络层中的所述预训练模型的输出层以外的一组网络层中,选择所述上游模型。
  3. 根据权利要求1或2的所述方法,其中获取所述预训练模型包括:利用预训练数据集中的预训练数据来训练所述预训练模型,以使得与所述预训练模型相关联的损失函数满足预定条件,所述预训练数据包括样本分子结构和样本分子能量。
  4. 根据权利要求3的所述方法,其中所述损失函数包括以下至少任一项:
    能量损失,所述能量损失表示所述样本分子能量和基于所述样本分子结构获得的所述样本分子能量的预测值之间的差异;
    估计能量损失,所述估计能量损失表示所述样本分子能量和基于所述样本分子结构获得的所述样本分子能量的预测值之间的差异,所述样本分子结构是估计的;以及
    力损失,所述力损失表示基于所述样本分子结构获得的所述样本分子能量的预测值相对于所述样本分子结构的梯度与预定梯度之间的差异。
  5. 根据权利要求1或2的所述方法,其中所述分子预测目标包 括以下至少任一项:分子性质和分子力场,并且所述预训练模型是基于所述分子预测目标来选择的。
  6. 根据权利要求5的所述方法,其中所述下游模型包括至少一个下游网络层,并且所述至少一个下游网络层中的最后下游网络层是所述下游模型的所述输出层。
  7. 根据权利要求5的所述方法,其中基于所述上游模型和所述下游模型生成所述分子预测模型包括:
    连接所述上游模型和所述下游模型以形成所述分子预测模型;以及
    利用训练数据集中的训练数据,训练所述分子预测模型以使得所述分子预测模型的损失函数满足预定条件,所述训练数据包括样本分子结构和与所述分子预测目标相对应的样本目标测量值。
  8. 根据权利要求7的所述方法,其中所述分子预测模型的所述损失函数包括所述样本目标测量值和基于所述样本分子结构获得的所述样本目标测量值的预测值之间的差异。
  9. 根据权利要求8的所述方法,其中响应于确定所述分子预测目标为所述分子力场,所述分子预测模型的所述损失函数进一步包括:基于所述样本分子结构获得的所述样本分子能量的预测值相对于所述样本分子结构的梯度与预定梯度之间的差异。
  10. 根据权利要求1或2的所述方法,进一步包括:响应于接收到目标分子结构,基于所述分子预测模型确定与所述分子预测目标相对应的预测值。
  11. 一种用于管理分子预测的装置,包括:
    获取模块,被配置用于从预训练模型中的一部分网络层中获取上游模型,所述预训练模型描述分子结构和分子能量之间的关联关系;
    确定模块,被配置用于基于分子预测目标确定下游模型,所述下游模型的输出层是基于所述分子预测目标确定的;以及
    生成模块,被配置用于基于所述上游模型和所述下游模型生成所述分子预测模型,所述分子预测模型描述分子结构和与所述分子结构 相关联的分子预测目标之间的关联关系。
  12. 根据权利要求11的所述装置,其中所述获取模块包括:
    预获取模块,被配置用于获取所述预训练模型,所述预训练模型包括多个网络层;以及
    选择模块,被配置用于从所述多个网络层中的所述预训练模型的输出层以外的一组网络层中,选择所述上游模型。
  13. 根据权利要求11或12的所述装置,其中所述预获取模块包括:预训练模块,被配置用于利用预训练数据集中的预训练数据来训练所述预训练模型,以使得与所述预训练模型相关联的损失函数满足预定条件,所述预训练数据包括样本分子结构和样本分子能量。
  14. 根据权利要求13的所述装置,其中所述损失函数包括以下至少任一项:
    能量损失,所述能量损失表示所述样本分子能量和基于所述样本分子结构获得的所述样本分子能量的预测值之间的差异;
    估计能量损失,所述估计能量损失表示所述样本分子能量和基于所述样本分子结构获得的所述样本分子能量的预测值之间的差异,所述样本分子结构是估计的;以及
    力损失,所述力损失表示基于所述样本分子结构获得的所述样本分子能量的预测值相对于所述样本分子结构的梯度与预定梯度之间的差异。
  15. 根据权利要求11或12的所述装置,其中所述分子预测目标包括以下至少任一项:分子性质和分子力场,并且所述预训练模型是基于所述分子预测目标来选择的,其中所述下游模型包括至少一个下游网络层,并且所述至少一个下游网络层中的最后下游网络层是所述下游模型的所述输出层。
  16. 根据权利要求15的所述装置,其中所述生成模块包括:
    连接模块,被配置用于连接所述上游模型和所述下游模型以形成所述分子预测模型;以及
    训练模块,被配置用于利用训练数据集中的训练数据,训练所述 分子预测模型以使得所述分子预测模型的损失函数满足预定条件,所述训练数据包括样本分子结构和与所述分子预测目标相对应的样本目标测量值。
  17. 根据权利要求16的所述装置,其中所述分子预测模型的所述损失函数包括所述样本目标测量值和基于所述样本分子结构获得的所述样本目标测量值的预测值之间的差异,
    其中响应于确定所述分子预测目标为所述分子力场,所述分子预测模型的所述损失函数进一步包括:基于所述样本分子结构获得的所述样本分子能量的预测值相对于所述样本分子结构的梯度与预定梯度之间的差异。
  18. 根据权利要求11或12的所述装置,进一步包括:预测值确定模块,被配置用于响应于接收到目标分子结构,基于所述分子预测模型确定与所述分子预测目标相对应的预测值。
  19. 一种电子设备,包括:
    至少一个处理单元;以及
    至少一个存储器,所述至少一个存储器被耦合到所述至少一个处理单元并且存储用于由所述至少一个处理单元执行的指令,所述指令在由所述至少一个处理单元执行时使所述电子设备执行根据权利要求1至10中任一项所述的方法。
  20. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序在被处理器执行时使所述处理器实现根据权利要求1至10中任一项所述的方法。
PCT/CN2023/089548 2022-05-13 2023-04-20 用于管理分子预测的方法、装置、设备和介质 WO2023216834A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210524875.6 2022-05-13
CN202210524875.6A CN114944204A (zh) 2022-05-13 2022-05-13 用于管理分子预测的方法、装置、设备和介质

Publications (1)

Publication Number Publication Date
WO2023216834A1 true WO2023216834A1 (zh) 2023-11-16

Family

ID=82907180

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/089548 WO2023216834A1 (zh) 2022-05-13 2023-04-20 用于管理分子预测的方法、装置、设备和介质

Country Status (2)

Country Link
CN (1) CN114944204A (zh)
WO (1) WO2023216834A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114944204A (zh) * 2022-05-13 2022-08-26 北京字节跳动网络技术有限公司 用于管理分子预测的方法、装置、设备和介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106575320A (zh) * 2014-05-05 2017-04-19 艾腾怀斯股份有限公司 结合亲和力预测系统和方法
US20190272468A1 (en) * 2018-03-05 2019-09-05 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Spatial Graph Convolutions with Applications to Drug Discovery and Molecular Simulation
CN113255770A (zh) * 2021-05-26 2021-08-13 北京百度网讯科技有限公司 化合物属性预测模型训练方法和化合物属性预测方法
CN113971992A (zh) * 2021-10-26 2022-01-25 中国科学技术大学 针对分子属性预测图网络的自监督预训练方法与系统
CN114944204A (zh) * 2022-05-13 2022-08-26 北京字节跳动网络技术有限公司 用于管理分子预测的方法、装置、设备和介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106575320A (zh) * 2014-05-05 2017-04-19 艾腾怀斯股份有限公司 结合亲和力预测系统和方法
US20190272468A1 (en) * 2018-03-05 2019-09-05 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Spatial Graph Convolutions with Applications to Drug Discovery and Molecular Simulation
CN113255770A (zh) * 2021-05-26 2021-08-13 北京百度网讯科技有限公司 化合物属性预测模型训练方法和化合物属性预测方法
CN113971992A (zh) * 2021-10-26 2022-01-25 中国科学技术大学 针对分子属性预测图网络的自监督预训练方法与系统
CN114944204A (zh) * 2022-05-13 2022-08-26 北京字节跳动网络技术有限公司 用于管理分子预测的方法、装置、设备和介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Hands-On Machine Learning with Scikit-Learn and TensorFlow", 31 October 2020, O'REILLY MEDIA INC, CN, ISBN: 711553621X, article AURÉLIEN GÉRON: "Reusing Pretrained Layers", pages: 305 - 307, XP009550230 *

Also Published As

Publication number Publication date
CN114944204A (zh) 2022-08-26

Similar Documents

Publication Publication Date Title
Shao et al. Online multi-view clustering with incomplete views
JP7471736B2 (ja) 量子系の基底状態エネルギーの推定方法、およびシステム
JP2023134499A (ja) ラベルノイズが存在する状態でのロバストな訓練
Lee et al. Generalized leverage score sampling for neural networks
US20210150412A1 (en) Systems and methods for automated machine learning
US20140067342A1 (en) Particle tracking in biological systems
WO2023216834A1 (zh) 用于管理分子预测的方法、装置、设备和介质
Sewell et al. Large-scale compute-intensive analysis via a combined in-situ and co-scheduling workflow approach
JP6381962B2 (ja) シミュレーションシステム及び方法と該システムを含むコンピュータシステム
Zhang et al. Autoassist: A framework to accelerate training of deep neural networks
Li et al. Data-augmented turbulence modeling by reconstructing Reynolds stress discrepancies for adverse-pressure-gradient flows
Chuang et al. Infoot: Information maximizing optimal transport
Geng et al. Scalable semi-supervised svm via triply stochastic gradients
Sun et al. A stagewise hyperparameter scheduler to improve generalization
RU2715024C1 (ru) Способ отладки обученной рекуррентной нейронной сети
Hornsby et al. Gaussian process regression models for the properties of micro-tearing modes in spherical tokamaks
WO2019211437A1 (en) Computational efficiency in symbolic sequence analytics using random sequence embeddings
JP2017538226A (ja) スケーラブルなウェブデータの抽出
US20210110038A1 (en) Method and apparatus to identify hardware performance counter events for detecting and classifying malware or workload using artificial intelligence
CN111724487B (zh) 一种流场数据可视化方法、装置、设备及存储介质
Amarloo et al. Progressive augmentation of turbulence models for flow separation by multi-case computational fluid dynamics driven surrogate optimization
Chadda et al. Engineering an intelligent essay scoring and feedback system: An experience report
Lee et al. Artificial Intelligence for Scientific Discovery at High-Performance Computing Scales
Xu et al. X2-Softmax: Margin Adaptive Loss Function for Face Recognition
Cuzzocrea Multidimensional Clustering over Big Data: Models, Issues, Analysis, Emerging Trends

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23802609

Country of ref document: EP

Kind code of ref document: A1