WO2023115343A1 - Data processing method and apparatus, model training method and free energy prediction method - Google Patents

Data processing method and apparatus, model training method and free energy prediction method Download PDF

Info

Publication number
WO2023115343A1
WO2023115343A1 PCT/CN2021/140134 CN2021140134W WO2023115343A1 WO 2023115343 A1 WO2023115343 A1 WO 2023115343A1 CN 2021140134 W CN2021140134 W CN 2021140134W WO 2023115343 A1 WO2023115343 A1 WO 2023115343A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
sub
node
processing result
solvation
Prior art date
Application number
PCT/CN2021/140134
Other languages
French (fr)
Chinese (zh)
Inventor
付文博
曾群
Original Assignee
深圳晶泰科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳晶泰科技有限公司 filed Critical 深圳晶泰科技有限公司
Priority to PCT/CN2021/140134 priority Critical patent/WO2023115343A1/en
Publication of WO2023115343A1 publication Critical patent/WO2023115343A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model

Definitions

  • the present application relates to the technical field of computer simulation, in particular to a data processing method, device, model training method and free energy prediction method.
  • this application provides a data processing method, device, model training method and free energy prediction method, which can effectively improve the accuracy of the obtained molecular solvation free energy.
  • the first aspect of the present application provides a data processing method, including: obtaining the data to be processed, the data to be processed includes the respective attribute information of a plurality of atoms in the target molecule; in response to the respective attribute information of the plurality of atoms, generating A node set and a node position set for the target molecule, wherein multiple nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes coordinate information of each node in the node set in a specific coordinate system; node scalar feature N s and node vector feature N v , and generate edge scalar feature E s and edge vector feature E v for the node set based on the coordinate information of each node in the node position set; based on the node scalar feature N for the node set s , node vector feature N v , edge scalar feature E s and edge vector feature E v construct a virtual molecular graph to determine the molecular feature
  • the second aspect of the present application provides a method for training a prediction model of free energy of solvation, including: inputting the virtual molecular graph determined based on the above-mentioned method into the prediction model of free energy of solvation, and adjusting the model parameters to make the loss function converge , to obtain a trained solvation free energy prediction model, in which there is corresponding solvation free energy label information in the virtual molecular graph, and the input of the loss function includes the predicted solvation free energy and solvation free energy in the solvation free energy label information Free Energy.
  • the third aspect of the present application provides a method for determining the free energy of solvation, comprising: processing a virtual molecular graph with a trained solvation free energy prediction model to obtain the solvation free energy for the virtual molecular graph, wherein, the virtual molecule
  • the graph is a graph generated based on the data to be processed, and the data to be processed includes attribute information for multiple atoms in the target molecule, and the target molecule includes solute molecules and/or solvent molecules.
  • the fourth aspect of the present application provides a design method, including: determining the free energy of solvation according to the above-mentioned method; performing drug design or material design based on the free energy of solvation.
  • the fifth aspect of the present application provides a data processing device, including: a module for obtaining data to be processed, for obtaining data to be processed, the data to be processed includes attribute information for each of multiple atoms in the target molecule; a set generation module, It is used to generate a node set and a node position set for the target molecule in response to the respective attribute information of multiple atoms, wherein the multiple nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes each node in the node set Coordinate information in a specific coordinate system; node and edge feature generation module, used to generate node scalar feature N s and node vector feature N v for the node set, and generate node set based on the coordinate information of each node in the node position set
  • the edge scalar feature E s and edge vector feature E v of the virtual molecular building block for constructing virtual A molecular map to determine a molecular characteristic X of the target molecule
  • the sixth aspect of the present application provides a device for training a solvation free energy prediction model, including: a model training module, which is used to input the virtual molecular graph determined based on the above method into the solvation free energy prediction model, by adjusting the model parameters so that The loss function converges, and a trained solvation free energy prediction model is obtained, in which there is corresponding solvation free energy labeling information in the virtual molecular map, and the input of the loss function includes the predicted solvation free energy and solvation free energy labeling information free energy of solvation.
  • a model training module which is used to input the virtual molecular graph determined based on the above method into the solvation free energy prediction model, by adjusting the model parameters so that The loss function converges, and a trained solvation free energy prediction model is obtained, in which there is corresponding solvation free energy labeling information in the virtual molecular map, and the input of the loss function includes the predicted solvation free energy and solvation free energy labeling information free energy of
  • the seventh aspect of the present application provides a device for determining the free energy of solvation, including: a free energy prediction module, which is used to process a virtual molecular graph using a trained solvation free energy prediction model to obtain a solvent for the virtual molecular graph
  • the chemical free energy wherein, the virtual molecular map is a map generated based on the data to be processed, the data to be processed includes attribute information for a plurality of atoms in the target molecule, and the target molecule includes solute molecules and/or solvent molecules.
  • the eighth aspect of the present application provides a design device, the device includes: a solvation free energy determination module, used to determine the solvation free energy according to the above method; a design module, used for drug design based on the solvation free energy Or Material Design.
  • a ninth aspect of the present application provides an electronic device, including: a processor; and a memory, on which executable code is stored, and when the executable code is executed by the processor, the processor is made to execute the above method.
  • the tenth aspect of the present application also provides a computer-readable storage medium, on which executable codes are stored, and when the executable codes are executed by a processor of an electronic device, the processor is made to execute the above method.
  • the eleventh aspect of the present application further provides a computer program product, including executable codes, and the above method is implemented when the executable codes are executed by a processor.
  • the data processing method, device, model training method and prediction free energy method provided by the present application convert the data to be processed into a node set and a node position set for the target molecule, so that the node scalar feature N s and the node position set for the node set can be generated.
  • Vector feature N v and generate edge scalar feature E s and edge vector feature E v for the node set based on the coordinate information of each node in the node position set; these descriptors that can represent three-dimensional features of molecules are relatively low-dimensional descriptions in related technologies
  • the symbol can more completely represent the characteristics of the target molecule and effectively improve the accuracy of the determined solvation free energy.
  • solvent-solute interaction is described by the matrix product of the solute molecule feature vector and the solvent molecule feature vector, which can better visualize
  • the solvent-solute interaction is described by the formula, which effectively improves the accuracy of the determined solvation free energy.
  • FIG. 1 schematically shows an exemplary system architecture to which a data processing method, device, model training method and prediction free energy method can be applied according to an embodiment of the present application;
  • Fig. 2 schematically shows a flow chart of a data processing method according to an embodiment of the present application
  • FIG. 3 schematically shows a flow chart of a method for determining molecular characteristics of a target molecule based on a virtual molecular map according to an embodiment of the present application
  • Fig. 4 schematically shows a logic diagram for updating node scalar features and node vector features based on a virtual molecular graph according to an embodiment of the present application
  • FIG. 5 schematically shows a flow chart of another data processing method according to an embodiment of the present application.
  • FIG. 6 schematically shows a flow chart of a method for training a solvation free energy prediction model according to an embodiment of the present application
  • FIG. 7 schematically shows a schematic structural diagram of an equivariant graph convolutional network according to an embodiment of the present application.
  • FIG. 8 schematically shows a schematic structural diagram of a fully connected network according to an embodiment of the present application.
  • Fig. 9 schematically shows a flowchart of a method for determining the free energy of solvation according to an embodiment of the present application
  • Fig. 10 schematically shows the correlation diagram between the solvation free energy predicted by the model and the real solvation free energy on the training set according to the embodiment of the present application by dividing the data set training by solvent type;
  • Figure 11 schematically shows the correlation diagram between the solvation free energy predicted by the model and the real solvation free energy on the test set according to the solvent type division data set training according to the embodiment of the present application;
  • Fig. 12 schematically shows the correlation diagram of the solvation free energy predicted by the model and the real solvation free energy on the training set obtained by dividing the data set by solute type according to the embodiment of the present application;
  • Fig. 13 schematically shows the correlation diagram between the free energy of solvation predicted by the model and the real free energy of solvation on the test set obtained by dividing the data set according to the solute type according to the embodiment of the present application;
  • Fig. 14 schematically shows a flow chart of a design method according to an embodiment of the present application.
  • Fig. 15 schematically shows a block diagram of a data processing device according to an embodiment of the present application.
  • Fig. 16 schematically shows a block diagram of a device for training a solvation free energy prediction model according to an embodiment of the present application
  • Fig. 17 schematically shows a block diagram of a device for determining the free energy of solvation according to an embodiment of the present application
  • Fig. 18 schematically shows a block diagram of a design device according to an embodiment of the present application.
  • Fig. 19 schematically shows a block diagram of an electronic device according to an embodiment of the present application.
  • first, second, third and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another.
  • first information may also be called second information, and similarly, second information may also be called first information.
  • second information may also be called first information.
  • a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • “plurality” means two or more, unless otherwise specifically defined.
  • a molecular descriptor is a representation of a molecule as a data structure that a computer program can process.
  • a virtual molecular graph is a molecular descriptor, which represents atoms as nodes and the relationship between atoms as edges; unlike ordinary molecular graphs that establish edges based on the bonding information between atoms, virtual molecular graphs are based on Cut off radius to create edges.
  • Cutoff radius for a certain atom in a molecule, if the atom is established with all other atoms, the number of edges will be too many, and the calculation will be too large. Considering that other atoms farther away from the atom have less influence on the atom, so a cut-off radius is taken, and only the atom is allowed to establish edges with atoms whose distance from it is smaller than the cut-off radius. For atoms outside the cut-off radius, then Ignore its interactions.
  • a coordinate system is a reference to describe the position and attitude of an object, so it is also called a frame of reference or a frame of reference.
  • the coordinate system may be a coordinate system created during simulation, such as Cartesian coordinates.
  • Coordinates are used to represent the absolute position of an object in a specific coordinate system. In mathematics, the essence of coordinates is an ordered logarithm.
  • Solvation is a reaction process driven by the interaction between solute molecules and solvent molecules. It is the key to the process of drug research and development, such as crystallization nucleation, chemical reaction, drug metabolism, drug interaction and drug-receptor interaction. step.
  • the strength of solvation is usually characterized by the free energy of solvation, so it is of great significance to quickly and accurately predict the free energy of solvation in the field of drug development.
  • the solvation free energy prediction process in the related art can be realized through two paths.
  • the other is based on the existing experimental and calculation data, using machine learning methods to construct a structure-solvation free energy model.
  • This method can quickly predict the free energy of solvation, but in the process of practice, the applicant found that in some cases the accuracy of the predicted free energy of solvation is not enough to solve the problems related to solvation.
  • the reasons for the low accuracy of the prediction results using machine learning methods include the following two aspects: First, the machine learning methods in related technologies can express molecules as SMILES, MACCS, Morgan and hybrid fingerprints, etc. Low-dimensional descriptors cannot fully represent the three-dimensional characteristics of molecules. The second is that when describing the solvent-solute interaction, the models established by these methods can be realized by simply splicing, arranging or summing the molecular features of the solvent and the solute molecules, without explicitly describing them in a physically meaningful framework Solvent-solute interactions.
  • a data processing method, device, model training method and prediction free energy method of the embodiments of the present application will be described in detail below with reference to FIGS. 1 to 19 .
  • FIG. 1 schematically shows an exemplary system architecture to which a data processing method, an apparatus, a model training method and a free energy prediction method can be applied according to an embodiment of the present application.
  • Figure 1 is only an example of the system architecture to which the embodiment of the present application can be applied, to help those skilled in the art understand the technical content of the present application, but it does not mean that the embodiment of the present application cannot be used in other device, system, environment or scenario.
  • a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
  • Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
  • Terminal devices 101, 102, 103 Users can use terminal devices 101, 102, 103 to interact with other terminal devices and server 105 through network 104 to receive or send information, such as sending model training requests, free energy prediction requests and receiving model training results, solvation free energy wait.
  • Terminal devices 101, 102, and 103 can be installed with various communication client applications, for example, drug development applications, material design applications, web browser applications, database applications, search applications, instant messaging tools, email clients, social platforms software and other applications.
  • Terminal devices 101, 102, and 103 include, but are not limited to, smart desktop computers, tablet computers, laptop computers, and other electronic devices that can support functions such as surfing the Internet, modeling, analysis and calculation, and design.
  • the server 105 can receive model training requests, solvation free energy requests, etc., adjust model parameters, store model topology, model parameters, predict solvation free energy, etc., and can also send solvation free energy to terminal devices 101, 102, 103.
  • the server 105 may be a background management server, a server cluster, and the like.
  • terminal devices are only illustrative. According to implementation requirements, there can be any number of terminal devices, networks and clouds.
  • Fig. 2 schematically shows a flowchart of a data processing method according to an embodiment of the present application.
  • this embodiment provides a method for data processing, the method includes operation S210 to operation S240, specifically as follows:
  • data to be processed is obtained, and the data to be processed includes property information for each of a plurality of atoms in the target molecule.
  • the data to be processed may be a character string.
  • Property information can be used to characterize properties of the target molecule and at least some of the atoms in the target molecule.
  • the attribute includes but not limited to: spatial position attribute, molecule type, atom type and so on.
  • the spatial position attribute can be coordinates in Cartesian coordinate system or polar coordinate system.
  • Molecular species may include solute molecules, solvent molecules.
  • the atomic species can be determined from the number of protons and/or neutrons in the atom. For example, protium, deuterium, and tritium can be considered to be the same atomic species or different atomic species.
  • the data to be processed may be three-dimensional conformations of solute molecules and solvent molecules represented by strings in x, y, z format.
  • a string can include the three-dimensional conformation of a molecule and the x, y, and z coordinates of each atom in the molecule.
  • the same coordinate system can be used in the calculation of molecular solvation free energy and in the process of drug design.
  • some or all of the geometric center coordinates of the solute molecule, the atomic coordinates in the solute molecule, the geometric center coordinates of the solvent molecule, and the atomic coordinates in the solvent molecule are coordinates in the same coordinate system. It should be understood that some or all of the geometric center coordinates of the solute molecule, the atomic coordinates in the solute molecule, the geometric center coordinates of the solvent molecule, and the atomic coordinates in the solvent molecule may also be coordinates in different coordinate systems, but Coordinates in each coordinate system can be converted to each other.
  • a node set and a node position set for the target molecule are generated, wherein a plurality of nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes a node set The coordinate information of each node in a specific coordinate system.
  • node set ⁇ A i ⁇ and node position set ⁇ (xi , y i , zi ) ⁇ , ⁇ ( xi , y i , zi ) ⁇ , where i 1,2 ,...N, N represents the number of atoms contained in the molecule, and A represents the type of atoms.
  • node scalar features N s and node vector features N v for the node set are generated, and edge scalar features E s and edge vector features E v for the node set are generated based on coordinate information of each node in the node position set.
  • the target molecule includes N atoms, and multiple nodes in the node set each have F-dimensional features.
  • the dimension of node scalar feature N s includes N ⁇ F ⁇ 1 dimension
  • the dimension of node vector feature N v includes N ⁇ F ⁇ 3 dimension
  • the dimension of edge scalar feature E s includes N ⁇ 1 ⁇ 1 dimension
  • the dimension of edge vector The dimensions of the feature E v include N ⁇ 3 ⁇ 1 dimensions.
  • set the feature dimension F for example, the value of F is not less than 64.
  • the value of F is an integer power of 2.
  • the node type such as atom type
  • the elements in the node set are embedded and coded, and the node set is expressed as an N ⁇ F ⁇ 1-dimensional matrix, which represents the node scalar feature Ns, and at the same time initializes an N ⁇ F ⁇ 3-dimensional matrix, which represents the node vector feature N v .
  • embedding coding may refer to randomly expressing nodes as an F-dimensional vector, and this vector will be updated during subsequent model training.
  • the set of edge position vectors is shown in formula (2).
  • the set of edge distances is expressed as a matrix of dimension E ⁇ 1 ⁇ 1 to represent the edge scalar feature E s
  • the set of edge position vectors is expressed as a matrix of dimension E ⁇ 3 ⁇ 1 to represent the feature of edge vector E v .
  • the adjacent node set ⁇ N i ⁇ that meets the requirement of the truncation radius can be selected from the above node set ⁇ A i ⁇ , and then determined for The edge set of the adjacent node set ⁇ N i ⁇ .
  • the above method may further include the following operations.
  • the truncation radius r cut is determined.
  • the truncation radius r cut can be a preset value, and the preset value can be a value determined based on expert experience or simulation results, such as 3 Angstroms or wait.
  • the template node set N i is the adjacent node set ⁇ N i ⁇ .
  • the truncation radius r cut can be set first, and for each element i in the node set, the adjacent node sets of each element i within the truncation radius are respectively determined As shown in formula (3).
  • edges between element i and all nodes j in the adjacent node set to form an edge set ⁇ (i,j) ⁇ , where i 1,2,...N, j ⁇ A i , and the number of edges is denoted as E.
  • generating the edge scalar feature and edge vector feature for the node set based on the coordinate information of each node in the node position set includes:
  • the adjacent node set ⁇ N i ⁇ is selected from the node set ⁇ A i ⁇ , correspondingly, the dimension of the edge scalar feature E s and the edge vector feature E v will also change.
  • the target node set includes E nodes, and each of the E nodes has F-dimensional features.
  • the dimension of node scalar feature N s includes N ⁇ F ⁇ 1 dimension
  • the dimension of node vector feature N v includes N ⁇ F ⁇ 3 dimension
  • the dimension of edge scalar feature E s includes E ⁇ 1 ⁇ 1 dimension
  • the dimension of edge vector feature E v The dimensions of include E ⁇ 3 ⁇ 1 dimensions.
  • the virtual molecular graph is composed of node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v .
  • the method for determining the molecular feature X can use a variety of related technologies, such as using a method similar to the method of extracting the molecular feature X based on a molecular map.
  • the virtual molecular map includes the scalar and vector information of atoms, as well as the scalar and vector information between atoms. It is a universal and accurate descriptor that can effectively improve the accuracy of the determined molecular solvation free energy.
  • the three-dimensional information of molecules is expressed as a virtual molecular graph.
  • different molecular conformations can also be strictly distinguished. Compared with the two-dimensional Descriptors, which describe molecules more accurately.
  • Fig. 3 schematically shows a flowchart of a method for determining molecular features of a target molecule based on a virtual molecular map according to an embodiment of the present application.
  • the process of determining the molecular characteristics of the target molecule based on the virtual molecular map may include operation S310 to operation S340.
  • the node scalar feature N s and the node vector feature N v are updated based on the virtual molecular graph, and the updated node scalar feature New_N s and the updated node vector feature New_N v are obtained.
  • the updated node scalar feature New_N s and the updated node vector feature New_N v are used as the current node scalar feature Now_N s and the current node vector feature Now_N v , respectively.
  • an updated virtual molecular graph is constructed using the current node scalar feature Now_N s , the current node vector feature Now_N v , the edge scalar feature E s , and the edge vector feature E v .
  • the updated node scalar feature New_N s and the updated node vector feature New_N v are updated based on the updated virtual molecular graph.
  • Operation S320 to operation S340 are repeatedly performed until the specified number of cycles num_conv is reached, and the updated node scalar feature New_N s obtained when the specified cycle number num_conv is reached is used as the molecular feature X.
  • the number of convolutional layers num_conv can be set.
  • For the input (solvent or solute molecule) virtual molecular graph keep E s and E v unchanged, update N s and N v with NewN s and NewN v respectively, and iterate num_conv time, compress the last dimension of NewN s obtained in num_conv time, and convert it into an N ⁇ F matrix to represent the molecular feature X.
  • Fig. 4 schematically shows a logic diagram for updating node scalar features and node vector features based on a virtual molecular graph according to an embodiment of the present application.
  • the update process can be composed of four basic operations: matrix linear transformation operation (Linear), activation operation (such as ReLU), matrix corresponding multiplication operation, such as the combination of Hadamard product and matrix sum operation (Sum) become.
  • matrix linear transformation operation Linear
  • activation operation such as ReLU
  • matrix corresponding multiplication operation such as the combination of Hadamard product and matrix sum operation (Sum) become.
  • these four basic operations have been maturely implemented in the program framework such as pytorch.
  • the matrix linear transformation operation transforms the input into a feature space, extracts useful information in the input and retains it.
  • the activation operation (such as ReLU) is a nonlinear mapping that endows the network with nonlinear expressiveness.
  • the matrix corresponding multiplication operation is the one-to-one correspondence product of two matrices with the same dimension, which plays the role of feature scaling.
  • the matrix sum operation (Sum) is two matrices with the same dimension for one-to-one summation of matrix elements, which plays the role of feature fusion.
  • the inner product operation is the inner product of two vectors, which converts the vector information into a scalar.
  • updating the node scalar feature N s and the node vector feature N v based on the virtual molecular graph above to obtain the updated node scalar feature New_N s and the updated node vector feature New_N v may include the following process.
  • the operation of obtaining the first sub-processing result Q1 realizes the extraction of useful information in N s and nonlinear mapping to the feature space.
  • the operation of obtaining the second sub-processing result Q2 realizes extracting useful information in E s and linearly mapping it to the feature space.
  • the operation of obtaining the third sub-processing result Q3 uses the scalar feature of the edge to scale the feature of the node, and integrates the information of the edge into the node, so that the feature of the node is more expressive.
  • the corresponding multiplication operation of the second matrix is performed to obtain the fourth sub-processing result Q4, and the third matrix is performed based on the third sub-processing result Q3 and the edge vector feature Ev Corresponding to the multiplication operation, the fifth sub-processing result Q5 is obtained.
  • the operation of obtaining the fourth sub-processing result Q4 uses the vector feature of the node to scale the feature of the node, and integrates the vector information into the node to make the feature of the node more expressive.
  • the operation of obtaining the fifth sub-processing result Q5 uses the vector feature of the edge to scale the feature of the node, and integrates the vector information of the edge into the node, so that the feature of the node is more expressive.
  • the operation of obtaining the sixth sub-processing result Q6 realizes the fusion of vector features of nodes and edges.
  • the sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8.
  • the seventh sub-processing result Q7 is a vector feature, which is used to interact with a scalar feature.
  • the eighth sub-processing result Q8 is a vector feature for interacting with the vector feature.
  • the seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3 and the seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform the inner product operation Inner to obtain the tenth sub-processing result Q10; perform the fifth matrix corresponding multiplication operation on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v .
  • the operation of obtaining the ninth sub-processing result Q9 realizes updating scalar features with vector information.
  • the operation of obtaining the tenth sub-processing result Q10 realizes the conversion of vector information into scalar information.
  • the operation of obtaining the update node vector feature NewN v realizes updating the vector feature with scalar information.
  • the fourth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain the eleventh sub-processing result Q11.
  • the operation of obtaining the eleventh sub-processing result Q11 realizes scaling the scalar feature by using the scalar information obtained by the vector inner product operation.
  • the second matrix addition is performed on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the updated node scalar feature NewN s .
  • the scalar and vector features of the edges interact with the scalar and vector features of the nodes to update the node features and output new scalar and vector features of the nodes.
  • the edge information is fused into the node information to form a new feature, which improves the representation ability of the features NewN s and NewN v for the structure, making it easier for the model to extract information related to the free energy of solvation, and finally makes The prediction results are more accurate.
  • the logic of updating node scalar feature New_N s and updating node vector feature New_N v based on updating virtual molecular graph is similar to the logic shown in FIG. 4 , and will not be described in detail here.
  • the molecule can be represented by a virtual molecular graph as a descriptor, and the virtual molecular graph includes relatively complete three-dimensional characteristics of the molecule, which helps to improve the accuracy of the determined free energy of solvation of the molecule.
  • vector features are also used in convolution, which makes extracting molecular features easier and more accurate than the method of using only scalar features in the related art.
  • the solute molecular characteristics for solute molecules and the solvent molecular characteristics for solvent molecules can be respectively determined based on the above-mentioned manner of determining molecular characteristics. It should be noted that a solute molecule can have an interaction force with multiple adjacent solvent molecules, and the force between a solute molecule and a solvent molecule can be determined first, and then the molecular force relative to multiple solvent molecules can be determined. Free energy of solvation. In addition, it is also possible to directly determine the molecular solvation free energy of a solute molecule relative to multiple adjacent solvent molecules.
  • target molecules may be solute molecules and/or solvent molecules.
  • the data to be processed may include molecular attribute information, for example, the data to be processed is data of solute molecules, and/or data of solvent molecules.
  • the above method may also include the following operations: determining the solute molecular characteristics of the solute molecule, and the solvent molecular characteristics of at least one solvent molecule associated with the solute molecule, so that The solvent molecule characteristic of at least one solvent molecule determines the free energy of solvation. It should be noted that the feature dimensions of the solute molecular feature and the solvent molecular feature may be the same.
  • Fig. 5 schematically shows a flowchart of another data processing method according to an embodiment of the present application.
  • the above method may further include operation S510 to operation S520.
  • the matrix product of the solvent molecular signature and the solute molecular signature is used as a matrix product between the solvent molecule and the solute molecule Solvation matrix.
  • solvation characteristics are determined based on the solvation matrix.
  • the solvent-solute interaction is not explicitly described in the related art.
  • the solvent-solute interaction is described by the matrix product of the solute molecular characteristics and the solvent molecular characteristics, and the solvent-solute interaction is explicitly described. Solute interactions, which help to improve the accuracy of the solvation signature and, in turn, the solvation free energy of the molecule.
  • the above-mentioned determination of solvation characteristics based on the solvation matrix may include the following operations.
  • the solvent characteristics corresponding to the preset solute weights are calculated based on the solvation matrix, and the solute characteristics corresponding to the preset solvent weights are calculated based on the solvation matrix.
  • the solvent feature and the solute feature are respectively converted into one-dimensional row vectors including F elements.
  • X M is an M ⁇ F dimensional matrix
  • X N is an N ⁇ F dimensional matrix
  • M and N are solvent molecules
  • solute molecules The number of atoms involved.
  • X′ M and X′ N are weighted and summed, X′ M and X′ N are converted into a one-dimensional row vector containing F elements, and finally the two row vectors are spliced into a 2F-dimensional row
  • the vector I MN is the solvation signature.
  • X′ M (1,2,3,...,F)
  • X′ N (1,2,3,...,F)
  • I MN (1,2,3,...,F,1, 2,3,...,F).
  • the array element weight may be determined based on an attention mechanism.
  • converting the solvent feature and the solute feature into a one-dimensional row vector including F elements may include the following operations.
  • the weight of the first array element corresponding to the atom of the solvent molecule in the solvent feature is determined, and the weight of the second array element of the array element corresponding to the atom of the solute molecule in the solute feature is determined.
  • the attention mechanism calculate the attention coefficient of each atom in X'M and X'N , and sum the weights of X'M and X'N according to the attention coefficient, and convert X'M and X'N is a one-dimensional row vector containing F elements, and finally the two row vectors are concatenated into a 2F-dimensional row vector representing the solvation feature I MN .
  • the solvation free energy of the molecule can be obtained by performing weighted summation, offset and other processing on each element in the solvation feature I MN .
  • the molecular solvation free energy can be obtained.
  • the attention mechanism of summation embodies the physical meaning of solute weight and solvent weight, and explicitly describes the solvent
  • the effect of solvation improves the accuracy of prediction of solvation free energy.
  • Another aspect of the present application also provides a method for training a solvation free energy prediction model.
  • the above-mentioned method for training the solvation free energy prediction model may include: inputting the virtual molecular graph determined based on the above method into the solvation free energy prediction model, and adjusting the model parameters so that the loss function converges to obtain the trained The solvation free energy prediction model, in which there is corresponding solvation free energy label information in the virtual molecular map, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information.
  • the solvation free energy prediction model may include at least one of the following networks.
  • the molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein the training Data have free energy of solvation label information.
  • Equivariant graph convolutional networks configured to convert virtual molecular graphs into solute molecular features and/or solvent molecular features.
  • a solvation network configured to convert solute molecular features and solvent molecular features into solvation features.
  • a fully connected network configured to convert solvation features into solvation free energies.
  • the above training method may include: input the training data into the molecular encoding network, and adjust the model parameters (such as network parameters) to make the loss function converge, wherein the input of the loss function includes the solvation free energy and solvent The free energy of solvation in the free energy label information.
  • the model parameters such as network parameters
  • the solvation network includes a self-attention network configured to determine a first element weight of an element corresponding to an atom of a solvent molecule in a solvent feature, and to determine an element corresponding to an atom of a solute in a solute feature.
  • the atoms of the molecule correspond to the second array element weight of the array element, so that according to the first array element weight, the corresponding array elements in the solvent feature and the solvent molecules are fused, and the solute feature and the solute molecule are fused according to the second array element weight.
  • the corresponding array elements of each atom are fused, wherein the solvent characteristics and solute characteristics are determined based on the solvation matrix, and the solvation matrix is determined based on the solute molecular characteristics and solvent molecular characteristics.
  • the solvent characteristics and solute characteristics are determined based on the solvation matrix
  • the solvation matrix is determined based on the solute molecular characteristics and solvent molecular characteristics.
  • the above method may further include the following operations.
  • the training data set is divided into sub-training data sets of a specified number.
  • the specified number of parts can be determined based on expert experience or the accuracy of prediction of the molecule's free energy of solvation.
  • the specified number of copies can be 3, 5, 8, 10, 13, 18, 20, etc.
  • inputting the training data into the molecular encoding network includes: respectively inputting the training data in each sub-training data set into the molecular encoding network of different solvation free energy prediction models, so as to model the different solvation free energy prediction models respectively Train to get as many trained solvation free energy prediction models as the specified number.
  • Fig. 6 schematically shows a flowchart of a method for training a solvation free energy prediction model according to an embodiment of the present application.
  • the model consists of four parts: a molecular encoding network that converts molecular x,y,z strings into virtual molecular graphs.
  • An equivariant graph convolutional network that converts virtual molecular graphs to molecular features.
  • the solvation attention network converts the features of solute molecules and solvent molecules through matrix product and attention aggregation into solvation features (the matrix product of solute molecules and solvent molecules is obtained at the atomic level of the molecule, and attention aggregation is These atomic-level features are aggregated into molecular-level features through the attention mechanism), and the solvation features are converted into a fully connected network of solvation free energy.
  • the loss function (such as mean square error loss function, absolute difference loss function, Huber loss function, etc.).
  • the data set is equally divided into ten parts, and the model is trained by ten-fold cross-validation.
  • the preset value can be 0.0005, 0.001, 0.002, etc., that is, convergence), and ten equivariant graphs are obtained. network model. It should be noted that a 5-fold cross-validation method or a k-fold cross-validation method may also be used.
  • G i,pred is the predicted value of solvation free energy
  • G i,true is the real value of solvation free energy
  • n is the number of solvent-solute pairs used in training.
  • Fig. 7 schematically shows a schematic structural diagram of an equivariant graph convolutional network according to an embodiment of the present application.
  • the equivariant graph convolutional network includes a convolutional layer with a specified number of cycles num_conv layer, wherein the output of the current convolutional layer is used as part of the input of the adjacent convolutional layer of the next layer.
  • the input of the first convolutional layer (refer to the first convolutional layer) includes: node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v .
  • the output of the first convolutional layer includes: update node scalar feature New_N s and update node vector feature New_N v .
  • the input of the convolutional layer other than the first convolutional layer includes: update node scalar feature New_N s and update node vector feature New_N v , Edge scalar features E s and edge vector features E v .
  • the outputs of the convolutional layers other than the first convolutional layer include: updated node scalar feature New_N s and updated node vector feature New_N v .
  • Atomic features can be transformed into molecular features through equivariant graph convolutional networks.
  • each convolutional layer can implement feature transformation as follows.
  • the equivariant graph convolutional network is composed of four basic operations Linear, ReLU, Hadamard and Sum.
  • Linear is a matrix linear transformation operation
  • ReLU is an activation operation
  • Hadamard is a matrix corresponding multiplication operation
  • Sum is a matrix addition operation.
  • the convolutional layer is configured to perform the following operations.
  • the first linear operation may be implemented by the first linear layer
  • the second linear operation may be implemented by the second linear layer.
  • the first linear layer and the second linear layer may be the same layer or different layers.
  • the first matrix corresponding multiplication operation is performed on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain the third sub-processing result Q3.
  • the first matrix addition operation is performed on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain the sixth sub-processing result Q6.
  • the sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8.
  • the seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the seventh sub-processing result Q7 and The eighth sub-processing result Q8 is subjected to the inner product operation Inner to obtain the tenth sub-processing result Q10.
  • the fourth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain the eleventh sub-processing result Q11.
  • the second matrix addition is performed on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the updated node scalar feature NewN s .
  • the fifth matrix corresponding multiplication operation is performed on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v .
  • atoms of solute molecules or solvent molecules in the training data respectively have F-dimensional features.
  • Fig. 8 schematically shows a schematic structural diagram of a fully connected network according to an embodiment of the present application.
  • the fully connected network may include: a sequentially connected first linear layer (such as Linear), a first activation function layer (such as ReLU), a second linear layer, a second activation function layer, and a third linear layer, where , the output dimension of the first linear layer and the second linear layer is F dimension, and the output dimension of the third linear layer is 1 dimension.
  • the input to the first linear layer is a 2F-dimensional row vector representing the solvation feature I MN .
  • the input solvation feature I MN it can be converted into molecular solvation free energy through a fully connected network.
  • Another aspect of the present application provides a method of determining the free energy of solvation.
  • the above-mentioned method for determining the free energy of solvation may include the following operations, using the free energy of solvation prediction model trained according to the above-mentioned method to process the virtual molecular graph to obtain the free energy of solvation for the virtual molecular graph, wherein,
  • the virtual molecular graph is a graph generated based on the data to be processed, and the data to be processed includes attribute information for multiple atoms in the target molecule, and the target molecule includes solute molecules and/or solvent molecules.
  • the solvation free energy prediction model may include at least one of the following networks.
  • the molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein the training Data have free energy of solvation label information.
  • Equivariant graph convolutional networks configured to convert virtual molecular graphs into solute molecular features and/or solvent molecular features.
  • a solvation network configured to convert solute molecular features and solvent molecular features into solvation features.
  • a fully connected network configured to convert solvation features into solvation free energies.
  • the above method may include the following operations, using the trained solvation free energy prediction model to process the data to be processed to obtain the solvation free energy for the data to be processed, wherein the data to be processed includes
  • the respective attribute information, target molecules include solute molecules and/or solvent molecules.
  • Fig. 9 schematically shows a flowchart of a method for determining the free energy of solvation according to an embodiment of the present application.
  • network parameters can be input into the solvation free energy prediction model so that the solvent conformation and solute conformation can be processed by the trained neural network.
  • the solvent conformation eg, can be expressed as an xyz string for solvent molecules
  • the solute conformation eg, can be expressed as an xyz string for solute molecules
  • the above method may further include the following operations.
  • the virtual molecular map or the data to be processed are respectively input into different trained solvation free energy prediction models with a specified number to obtain the specified number of solvation free energies.
  • the specified number may be the number of trained solvation free energy prediction models.
  • the solvent molecules and solute molecules to be predicted are respectively input into ten models in the format of x, y, and z, and ten predicted values of solvation free energy are obtained, and the average of them is taken as the final prediction result.
  • a total of 48,776 molecular conformations of 11,940 molecules are collected (for example, molecular conformations can be collected through online databases such as pubchem), and only a single molecular conformation is selected among them (selecting a single conformation molecule here is just for calculation convenience)
  • COSMOtherm to calculate 48,776 conformations using 15 (just an example, more or less than 15) molecules of alkanes, diethyl ether, acetonitrile, dimethylformamide, dimethyl sulfoxide, and methyl tert-butyl ether as solvents
  • 15 just an example, more or less than 15
  • 48776 conformations are stored in the data set in x, y, z format, and 731640 solvation free energy data corresponding to solute conformation and solvent conformation are stored in the data set as floating point numbers. Select 48776 systems using water as the solvent as the test set, and the other 682864 systems as the training set.
  • a total of 48,776 conformations of 11,940 molecules are collected, and water, tetrahydrofuran, chloroform, dichloromethane, dioxane, toluene, methanol, acetone, n- Heptane, cyclohexane, diethyl ether, acetonitrile, dimethylformamide, dimethyl sulfoxide and methyl tert-butyl ether are used as solvents, and COSMOtherm is used to calculate the solvation of 48776 conformations in 15 solvents Free energy data 731640 items.
  • 48776 conformations are stored in the data set in x, y, z format, and 731640 solvation free energy data corresponding to solute conformation and solvent conformation are stored in the data set as floating point numbers. 41475 pieces of solute-solvent-solvation free energy data of 2765 conformation systems of 740 kinds of molecules were selected as the test set, and the other 690165 systems were used as the training set.
  • this embodiment aims at the defects and insufficiencies in the prediction of molecular solvation free energy in related technologies, a neural network based on equivariant graphs is proposed to predict the solvation free energy.
  • this embodiment uses virtual molecular graphs as descriptors to represent molecules.
  • this embodiment uses solute molecular feature vectors and solvent molecules
  • a matrix product of eigenvectors describes the solvent-solute interaction. Specifically, it consists of four steps: molecular encoding, equivariant graph convolution, feature interaction and free energy prediction.
  • the molecular encoding step represents solvent and solute molecules as virtual molecular graphs with feature encodings.
  • the equivariant graph convolution step transforms the virtual molecular graph into a feature representation in matrix form.
  • the characteristic interaction step the characteristic representation of solvent and solute is matrix multiplied to obtain the characteristic representation of solvation.
  • the free energy prediction step is based on the characteristic representation of solvation to predict the molecular solvation free energy through the fully connected neural network, which effectively improves the accuracy of the predicted molecular solvation free energy.
  • Another aspect of the present application also provides a design method.
  • Fig. 14 schematically shows a flowchart of a design method according to an embodiment of the present application.
  • the design method may include operation S1410 and operation S1420.
  • Another aspect of the present application also provides a data processing device.
  • Fig. 15 schematically shows a block diagram of a data processing device according to an embodiment of the present application.
  • the data processing device may include: a module for obtaining data to be processed 1510 , a set generation module 1520 , a node and edge feature generation module 1530 , and a virtual molecule construction module 1540 .
  • the to-be-processed data obtaining module 1510 is used to obtain the to-be-processed data, and the to-be-processed data includes property information for multiple atoms in the target molecule.
  • the set generation module 1520 is used to generate a node set and a node position set for the target molecule in response to the respective attribute information of a plurality of atoms, wherein the multiple nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes nodes The coordinate information of each node in the collection in a specific coordinate system.
  • the node and edge feature generation module 1530 is used to generate the node scalar feature N s and the node vector feature N v for the node set, and generate the edge scalar feature E s and the edge vector for the node set based on the coordinate information of each node in the node position set Features E v .
  • the virtual molecule construction module 1540 is used to construct a virtual molecular graph based on the node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v for the node set, to determine the molecule of the target molecule based on the virtual molecular graph
  • the feature X facilitates determining the free energy of solvation based at least on the molecular feature X of the target molecule.
  • the target molecule includes N atoms, and the plurality of nodes in the node set each have F-dimensional features.
  • the dimension of node scalar feature N s includes N ⁇ F ⁇ 1 dimension
  • the dimension of node vector feature N v includes N ⁇ F ⁇ 3 dimension
  • the dimension of edge scalar feature E s includes N ⁇ 1 ⁇ 1 dimension
  • the dimension of edge vector feature E v The dimensions of include N ⁇ 3 ⁇ 1 dimensions.
  • the above apparatus 1500 may further include: a truncation radius determination module and a target node set determination module.
  • the cutoff radius determination module is configured to determine the cutoff radius r cut after generating a node set and a node position set for the target molecule in response to the respective attribute information of the plurality of atoms.
  • the target node set determining module is configured to determine the target nodes whose distance between nodes is less than or equal to the cutoff radius r cut from the node set to obtain the target node set N i .
  • the node and edge feature generation module 1530 is specifically configured to generate the edge scalar feature E s and the edge vector feature E v for the target node set N i based on the coordinate information for the target node in the node position set.
  • the set of target nodes includes E nodes, each of which has F-dimensional features.
  • the dimension of node scalar feature N s includes N ⁇ F ⁇ 1 dimension
  • the dimension of node vector feature N v includes N ⁇ F ⁇ 3 dimension
  • the dimension of edge scalar feature E s includes E ⁇ 1 ⁇ 1 dimension
  • the dimension of edge vector feature E v The dimensions of include E ⁇ 3 ⁇ 1 dimensions.
  • the above apparatus 1500 further includes a feature update module and a loop module.
  • the feature updating module is configured to update the node scalar feature N s and the node vector feature N v based on the virtual molecular graph, and obtain the updated node scalar feature New_N s and the updated node vector feature New_N v .
  • the cycle module is configured to repeat the following units until the specified number of cycles num_conv is reached, and the updated node scalar feature New_N s obtained when the specified cycle number num_conv is reached is used as the molecular feature X.
  • the feature replacement unit is configured to use the updated node scalar feature New_N s and the updated node vector feature New_N v as the current node scalar feature Now_N s and the current node vector feature Now_N v respectively.
  • the feature calculation unit is configured to use the current node scalar feature Now_N s , the current node vector feature Now_N v , the edge scalar feature E s and the edge vector feature E v to construct and update the virtual molecular graph.
  • the feature updating unit is configured to update the updated node scalar feature New_N s and the updated node vector feature New_N v based on the updated virtual molecular graph.
  • the feature update module is specifically configured to perform the following operations.
  • the first matrix corresponding multiplication operation is performed on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain the third sub-processing result Q3.
  • the first matrix addition operation is performed on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain the sixth sub-processing result Q6.
  • the sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8.
  • the seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the seventh sub-processing result Q7 and The eighth sub-processing result Q8 is subjected to the inner product operation Inner to obtain the tenth sub-processing result Q10.
  • the fourth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain the eleventh sub-processing result Q11.
  • the fifth matrix corresponding multiplication operation is performed on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v .
  • the second matrix addition is performed on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the updated node scalar feature NewN s .
  • target molecules are solute molecules and/or solvent molecules.
  • the above-mentioned apparatus 1500 further includes: a solute-solvent molecular characteristic determination module configured to determine a solute molecular characteristic of the solute molecule, and a solvent molecular characteristic of at least one solvent molecule associated with the solute molecule, so that The solvent molecule characteristic of at least one solvent molecule associated with the solute molecule determines the free energy of solvation.
  • a solute-solvent molecular characteristic determination module configured to determine a solute molecular characteristic of the solute molecule, and a solvent molecular characteristic of at least one solvent molecule associated with the solute molecule, so that The solvent molecule characteristic of at least one solvent molecule associated with the solute molecule determines the free energy of solvation.
  • the above-mentioned apparatus 1500 further includes: a solvation matrix determination module and a solvation characteristic determination module.
  • the solvation matrix determination module is configured to, after determining the solute molecular signature of the solute molecule, and the solvent molecular signature of at least one solvent molecule associated with the solute molecule, use the matrix product of the solvent molecular signature and the solute molecular signature as the solvent molecule and The solvation matrix between solute molecules.
  • the solvation signature determination module is configured to determine a solvation signature based on the solvation matrix.
  • the solvation characterization module includes: a solvent characterization unit, a solute characterization unit, and a solvation characterization unit.
  • the solvent characteristic determining unit is configured to calculate the solvent characteristic corresponding to the preset solute weight based on the solvation matrix, and calculate the solute characteristic corresponding to the preset solvent weight based on the solvation matrix.
  • the solute feature determination unit is configured to convert the solvent feature and the solute feature into a one-dimensional row vector including F elements, respectively.
  • the solvation signature determination unit is configured to concatenate row vectors to obtain solvation signatures.
  • the solute feature determination unit includes an array element weight determination subunit and a weighted summation subunit.
  • the array element weight determining subunit is configured to determine the first array element weight of the array element corresponding to the atom of the solvent molecule in the solvent feature, and determine the second array element weight of the array element corresponding to the atom of the solute molecule in the solute feature.
  • the weighted summation subunit is configured to perform weighted summation on the solvent features based on the first array element weights to obtain a one-dimensional first row vector including F elements, and to perform weighted summation on the solute features based on the second array element weights , to obtain a one-dimensional second row vector containing F elements.
  • Another aspect of the present application also provides a device for training a solvation free energy prediction model.
  • Fig. 16 schematically shows a block diagram of an apparatus for training a solvation free energy prediction model according to an embodiment of the present application.
  • the above-mentioned device 1600 includes: a model training module 1610, which is used to input the virtual molecular graph determined based on the above-mentioned method into the solvation free energy prediction model, and adjust the model parameters to make the loss function converge, so as to obtain the trained solvation free energy prediction model , where the virtual molecular graph has corresponding solvation free energy label information, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information.
  • a model training module 1610 which is used to input the virtual molecular graph determined based on the above-mentioned method into the solvation free energy prediction model, and adjust the model parameters to make the loss function converge, so as to obtain the trained solvation free energy prediction model , where the virtual molecular graph has corresponding solvation free energy label information, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information.
  • the above-mentioned solvation free energy prediction model includes: an equivariant graph convolutional network configured to convert a virtual molecular graph into solute molecular features and/or solvent molecular features.
  • the equivariant graph convolutional network includes a convolutional layer with a specified number of cycles num_conv layer, where the output of the current convolutional layer is used as part of the input of the adjacent convolutional layer of the next layer;
  • the input of the first convolutional layer includes: Node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v
  • the output of the first convolutional layer includes: update node scalar feature New_N s and update node vector feature New_N v ;
  • the input of the convolutional layer other than the product layer includes: update node scalar feature New_N s and update node vector feature New_N v , edge scalar feature E s and edge vector feature E v ;
  • the convolutional layer other than the first convolutional layer The output includes: updated node scalar feature New_N s and updated node vector feature New
  • the convolutional layer is configured to perform the following operations.
  • the first matrix corresponding multiplication operation is performed on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain the third sub-processing result Q3.
  • the first matrix addition operation is performed on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain the sixth sub-processing result Q6.
  • the sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8.
  • the seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the seventh sub-processing result Q7 and The eighth sub-processing result Q8 is subjected to the inner product operation Inner to obtain the tenth sub-processing result Q10.
  • the fourth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain the eleventh sub-processing result Q11.
  • the second matrix addition is performed on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the updated node scalar feature NewN s .
  • the fifth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the eighth sub-processing result Q8 to obtain the updated node vector feature NewN v .
  • the solvation free energy prediction model includes: a molecular encoding network.
  • the molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein the training data It has solvation free energy labeling information, where the atoms of solute molecules or solvent molecules in the training data have F-dimensional features respectively.
  • the solvation free energy prediction model includes: a solvation network.
  • the solvation network is configured to convert solute molecular features and solvent molecular features into solvation features.
  • the solvation network includes a self-attention network configured to determine the weight of the first element in the solvent feature corresponding to the atoms of the solvent molecule and to determine the weight of the first element in the solute feature corresponding to the atom of the solute molecule
  • the weight of the second array element of the element in order to fuse the array elements corresponding to each atom of the solvent molecule in the solvent feature according to the first array element weight and to fuse the array elements corresponding to each atom of the solute molecule in the solute feature according to the second array element weight
  • a fusion is performed in which solvent and solute characteristics are determined based on a solvation matrix, and the solvation matrix is determined based on solute molecular characteristics and solvent molecular characteristics.
  • the solvation free energy prediction model includes: a fully connected network.
  • a fully connected network is configured to convert solvation features into solvation free energies.
  • the fully connected network includes: the first linear layer, the first activation function layer, the second linear layer, the second activation function layer and the third linear layer connected in sequence, wherein the first linear layer and the second linear layer
  • the output dimension is F dimension
  • the output dimension of the third linear layer is 1 dimension.
  • the above-mentioned apparatus 1600 further includes: a training set segmentation module and a model building module.
  • the training set splitting module is configured to split the training data set into a specified number of sub-training data sets.
  • the model building block is configured to build as many free energy of solvation prediction models as specified.
  • the model training module 1610 is specifically configured to input the training data in each sub-training data set into the molecular encoding network of different solvation free energy prediction models, so as to perform model training on different solvation free energy prediction models respectively, and obtain multiple A model trained to predict the free energy of solvation with the same number of copies as specified.
  • Another aspect of the present application also provides an apparatus for determining the free energy of solvation.
  • Fig. 17 schematically shows a block diagram of an apparatus for determining the free energy of solvation according to an embodiment of the present application.
  • the above device 1700 includes: a free energy prediction module 1710, configured to use the trained solvation free energy prediction model to process the data to be processed to obtain the solvation free energy for the data to be processed, wherein the data to be processed includes
  • the attribute information of multiple atoms, target molecules include solute molecules and/or solvent molecules.
  • the solvation free energy prediction model includes at least one of the following networks: a molecular encoding network configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into and/or a virtual molecular map for solvent molecular data, wherein the training data has solvation free energy annotation information; an equivariant map convolutional network configured to convert the virtual molecular map into solute molecular features and/or solvent molecular features; A solvation network configured to convert solute molecular features and solvent molecular features into solvation features; a fully connected network configured to convert solvation features into solvation free energy.
  • a molecular encoding network configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into and/or a virtual molecular map for solvent molecular data, wherein the training data has solvation free energy annotation information
  • an equivariant map convolutional network configured to convert the virtual molecular
  • the above-mentioned apparatus 1700 further includes: a multi-model processing module and a weighting processing module.
  • the multi-model processing module is configured to input the data to be processed into different trained solvation free energy prediction models of a specified number to obtain a specified number of solvation free energies;
  • the weighting processing module is configured to take the weighted average of the specified number of solvation free energies as the solvation free energy corresponding to the data to be processed. It should be noted that the respective weights of the specified number of solvation free energies may be the same or different. For example, the weight of the solvation free energy obtained by the model with high prediction accuracy on the test data set can be higher than the weight of the solvation free energy obtained by other models.
  • Another aspect of the present application also provides a design device.
  • Fig. 18 schematically shows a block diagram of a design device according to an embodiment of the present application.
  • the apparatus 1800 may include: a solvation free energy determination module 1810 and a design module 1820 .
  • the solvation free energy determination module 1810 is configured to determine the solvation free energy according to the above method.
  • Design module 1820 is used for drug design or material design based on free energy of solvation.
  • Another aspect of the present application also provides an electronic device.
  • Fig. 19 schematically shows a block diagram of an electronic device implementing an embodiment of the present application.
  • an electronic device 1900 includes a memory 1910 and a processor 1920 .
  • Processor 1920 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), on-site Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory 1910 may include various types of storage units such as system memory, read only memory (ROM), and persistent storage. Wherein, the ROM can store static data or instructions required by the processor 1920 or other modules of the computer.
  • the persistent storage device may be a readable and writable storage device. Persistent storage may be a non-volatile storage device that does not lose stored instructions and data even if the computer is powered off.
  • the permanent storage device adopts a mass storage device (such as a magnetic or optical disk, flash memory) as the permanent storage device.
  • the permanent storage device may be a removable storage device (such as a floppy disk, an optical drive).
  • the system memory can be a readable and writable storage device or a volatile readable and writable storage device, such as dynamic random access memory.
  • System memory can store some or all of the instructions and data that the processor needs at runtime.
  • memory 1910 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (eg, DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and magnetic and/or optical disks may also be used.
  • memory 1910 may include a readable and/or writable removable storage device, such as a compact disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Super Density Disc, Flash memory card (such as SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, etc.
  • a readable and/or writable removable storage device such as a compact disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Super Density Disc, Flash memory card (such as SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, etc.
  • Computer-readable storage media do not contain carrier waves and transient electronic signals transmitted by wireless or wire.
  • Executable codes are stored in the memory 1910 , and when the executable codes are processed by the processor 1920 , the processor 1920 may execute part or all of the methods mentioned above.
  • the method according to the present application can also be implemented as a computer program or computer program product, the computer program or computer program product including computer program code instructions for executing some or all of the steps in the above method of the present application.
  • the present application may also be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium), on which executable code (or computer program or computer instruction code) is stored,
  • executable code or computer program or computer instruction code
  • the processor of the electronic device or server, etc.
  • the processor is made to perform part or all of the steps of the above-mentioned method according to the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A data processing method and apparatus, a model training method, and a free energy prediction method. The data processing method comprises: obtaining data to be processed, the data to be processed comprising attribute information for each of multiple atoms in a target molecule; in response to the attribute information of each of the multiple atoms, generating a node set and a node position set for the target molecule; generating a node scalar feature Ns and a node vector feature Nv for the node set, and generating an edge scalar feature Es and an edge vector feature Ev for the node set on the basis of coordinate information of each node in the node position set; constructing a virtual molecular diagram on the basis of the node scalar feature Ns, the node vector feature Nv, the edge scalar feature Es and the edge vector feature Ev for the node set, so as to determine a molecular feature X of the target molecule on the basis of the virtual molecular diagram, and determine solvation free energy on the basis of the molecular feature X of the target molecule. According to the present application, the accuracy of determined solvation free energy can be improved.

Description

数据处理方法、装置、模型训练方法和预测自由能方法Data processing method, device, model training method and prediction free energy method 技术领域technical field
本申请涉及计算机仿真技术领域,尤其涉及一种数据处理方法、装置、模型训练方法和预测自由能方法。The present application relates to the technical field of computer simulation, in particular to a data processing method, device, model training method and free energy prediction method.
背景技术Background technique
随着计算机技术和人工智能技术的快速发展,计算机仿真技术被应用到越来越多的场景中,如材料设计、药物设计等。With the rapid development of computer technology and artificial intelligence technology, computer simulation technology has been applied to more and more scenarios, such as material design, drug design, etc.
然而,申请人发现相关技术得到的分子溶剂化自由能的精准度较低。However, the applicant found that the accuracy of molecular solvation free energy obtained by related techniques is low.
发明内容Contents of the invention
为解决或部分解决相关技术中存在的问题,本申请提供一种数据处理方法、装置、模型训练方法和预测自由能方法,能够有效提升得到的分子溶剂化自由能的精准度。In order to solve or partially solve the problems existing in related technologies, this application provides a data processing method, device, model training method and free energy prediction method, which can effectively improve the accuracy of the obtained molecular solvation free energy.
本申请的第一个方面提供了一种数据处理方法,包括:获得待处理数据,待处理数据包括针对目标分子中的多个原子各自的属性信息;响应于多个原子各自的属性信息,生成针对目标分子的节点集合和节点位置集合,其中,节点集合中的多个节点分别表征特定原子类型的原子,节点位置集合包括节点集合中各节点在特定坐标系下的坐标信息;生成针对节点集合的节点标量特征N s和节点矢量特征N v,并且基于节点位置集合中各节点的坐标信息生成针对节点集合的边标量特征E s和边矢量特征E v;基于针对节点集合的节点标量特征N s、节点矢量特征N v、边标量特征E s和边矢量特征E v构建虚拟分子图,以基于虚拟分子图确定目标分子的分子特征X,便于至少基于目标分子的分子特征X确定溶剂化自由能。 The first aspect of the present application provides a data processing method, including: obtaining the data to be processed, the data to be processed includes the respective attribute information of a plurality of atoms in the target molecule; in response to the respective attribute information of the plurality of atoms, generating A node set and a node position set for the target molecule, wherein multiple nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes coordinate information of each node in the node set in a specific coordinate system; node scalar feature N s and node vector feature N v , and generate edge scalar feature E s and edge vector feature E v for the node set based on the coordinate information of each node in the node position set; based on the node scalar feature N for the node set s , node vector feature N v , edge scalar feature E s and edge vector feature E v construct a virtual molecular graph to determine the molecular feature X of the target molecule based on the virtual molecular graph, which facilitates the determination of solvation freedom at least based on the molecular feature X of the target molecule able.
本申请的第二个方面提供了一种训练溶剂化自由能预测模型的方法,包括:将基于如上述的方法确定的虚拟分子图输入溶剂化自由能预测模型,通过调整模型参数使得损失函数收敛,得到经训练的溶剂化自由能预测模型,其中,虚拟分子图存在对应的溶剂化自由能标注信息,损失函数的输入包括预测得到的溶剂化自由能和溶剂化自由能标注信息中的溶剂化自由能。The second aspect of the present application provides a method for training a prediction model of free energy of solvation, including: inputting the virtual molecular graph determined based on the above-mentioned method into the prediction model of free energy of solvation, and adjusting the model parameters to make the loss function converge , to obtain a trained solvation free energy prediction model, in which there is corresponding solvation free energy label information in the virtual molecular graph, and the input of the loss function includes the predicted solvation free energy and solvation free energy in the solvation free energy label information Free Energy.
本申请的第三方面提供了一种确定溶剂化自由能的方法,包括:利用经训练的溶剂化自由能预测模型处理虚拟分子图,得到针对虚拟分子图的溶剂化自由能,其中, 虚拟分子图是基于待处理数据生成的图,待处理数据包括针对目标分子中的多个原子各自的属性信息,目标分子包括溶质分子和/或溶剂分子。The third aspect of the present application provides a method for determining the free energy of solvation, comprising: processing a virtual molecular graph with a trained solvation free energy prediction model to obtain the solvation free energy for the virtual molecular graph, wherein, the virtual molecule The graph is a graph generated based on the data to be processed, and the data to be processed includes attribute information for multiple atoms in the target molecule, and the target molecule includes solute molecules and/or solvent molecules.
本申请的第四方面提供了一种设计方法,包括:根据如上述的方法,确定溶剂化自由能;基于溶剂化自由能进行药物设计或者材料设计。The fourth aspect of the present application provides a design method, including: determining the free energy of solvation according to the above-mentioned method; performing drug design or material design based on the free energy of solvation.
本申请的第五方面提供了一种数据处理装置,包括:待处理数据获得模块,用于获得待处理数据,待处理数据包括针对目标分子中的多个原子各自的属性信息;集合生成模块,用于响应于多个原子各自的属性信息,生成针对目标分子的节点集合和节点位置集合,其中,节点集合中的多个节点分别表征特定原子类型的原子,节点位置集合包括节点集合中各节点在特定坐标系下的坐标信息;节点和边特征生成模块,用于生成针对节点集合的节点标量特征N s和节点矢量特征N v,并且基于节点位置集合中各节点的坐标信息生成针对节点集合的边标量特征E s和边矢量特征E v;虚拟分子构建模块,用于基于针对节点集合的节点标量特征N s、节点矢量特征N v、边标量特征E s和边矢量特征E v构建虚拟分子图,以基于虚拟分子图确定目标分子的分子特征X,便于至少基于目标分子的分子特征X确定溶剂化自由能。 The fifth aspect of the present application provides a data processing device, including: a module for obtaining data to be processed, for obtaining data to be processed, the data to be processed includes attribute information for each of multiple atoms in the target molecule; a set generation module, It is used to generate a node set and a node position set for the target molecule in response to the respective attribute information of multiple atoms, wherein the multiple nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes each node in the node set Coordinate information in a specific coordinate system; node and edge feature generation module, used to generate node scalar feature N s and node vector feature N v for the node set, and generate node set based on the coordinate information of each node in the node position set The edge scalar feature E s and edge vector feature E v of the virtual molecular building block for constructing virtual A molecular map to determine a molecular characteristic X of the target molecule based on the virtual molecular map, facilitating determination of a free energy of solvation based at least on the molecular characteristic X of the target molecule.
本申请的第六方面提供了一种训练溶剂化自由能预测模型的装置,包括:模型训练模块,用于将基于如上方法确定的虚拟分子图输入溶剂化自由能预测模型,通过调整模型参数使得损失函数收敛,得到经训练的溶剂化自由能预测模型,其中,虚拟分子图存在对应的溶剂化自由能标注信息,损失函数的输入包括预测得到的溶剂化自由能和溶剂化自由能标注信息中的溶剂化自由能。The sixth aspect of the present application provides a device for training a solvation free energy prediction model, including: a model training module, which is used to input the virtual molecular graph determined based on the above method into the solvation free energy prediction model, by adjusting the model parameters so that The loss function converges, and a trained solvation free energy prediction model is obtained, in which there is corresponding solvation free energy labeling information in the virtual molecular map, and the input of the loss function includes the predicted solvation free energy and solvation free energy labeling information free energy of solvation.
本申请的第七个方面提供了一种确定溶剂化自由能的装置,包括:自由能预测模块,用于利用经训练的溶剂化自由能预测模型处理虚拟分子图,得到针对虚拟分子图的溶剂化自由能,其中,虚拟分子图是基于待处理数据生成的图,待处理数据包括针对目标分子中的多个原子各自的属性信息,目标分子包括溶质分子和/或溶剂分子。The seventh aspect of the present application provides a device for determining the free energy of solvation, including: a free energy prediction module, which is used to process a virtual molecular graph using a trained solvation free energy prediction model to obtain a solvent for the virtual molecular graph The chemical free energy, wherein, the virtual molecular map is a map generated based on the data to be processed, the data to be processed includes attribute information for a plurality of atoms in the target molecule, and the target molecule includes solute molecules and/or solvent molecules.
本申请的第八方面提供了一种设计装置,上述装置包括:溶剂化自由能确定模块,用于根据上述的方法,确定溶剂化自由能;设计模块,用于基于溶剂化自由能进行药物设计或者材料设计。The eighth aspect of the present application provides a design device, the device includes: a solvation free energy determination module, used to determine the solvation free energy according to the above method; a design module, used for drug design based on the solvation free energy Or Material Design.
本申请的第九方面提供了一种电子设备,包括:处理器;存储器,其上存储有可执行代码,当上述可执行代码被处理器执行时,使得处理器执行上述方法。A ninth aspect of the present application provides an electronic device, including: a processor; and a memory, on which executable code is stored, and when the executable code is executed by the processor, the processor is made to execute the above method.
本申请的第十方面还提供了一种计算机可读存储介质,其上存储有可执行代码,当可执行代码被电子设备的处理器执行时,使处理器执行上述方法。The tenth aspect of the present application also provides a computer-readable storage medium, on which executable codes are stored, and when the executable codes are executed by a processor of an electronic device, the processor is made to execute the above method.
本申请的第十一方面还提供了一种计算机程序产品,包括可执行代码,可执行代码被处理器执行时实现上述方法。The eleventh aspect of the present application further provides a computer program product, including executable codes, and the above method is implemented when the executable codes are executed by a processor.
本申请提供的数据处理方法、装置、模型训练方法和预测自由能方法,将待处理数据转换为针对目标分子的节点集合和节点位置集合,使得可以生成针对节点集合的节点标量特征N s和节点矢量特征N v,并且基于节点位置集合中各节点的坐标信息生成针对节点集合的边标量特征E s和边矢量特征E v;这些能够表示分子三维特征的描述符相对于相关技术中低维描述符,能够更完整地表示出目标分子的特征,有效提升了确定的溶剂化自由能的精准度。 The data processing method, device, model training method and prediction free energy method provided by the present application convert the data to be processed into a node set and a node position set for the target molecule, so that the node scalar feature N s and the node position set for the node set can be generated. Vector feature N v , and generate edge scalar feature E s and edge vector feature E v for the node set based on the coordinate information of each node in the node position set; these descriptors that can represent three-dimensional features of molecules are relatively low-dimensional descriptions in related technologies The symbol can more completely represent the characteristics of the target molecule and effectively improve the accuracy of the determined solvation free energy.
此外,以溶质分子特征向量与溶剂分子特征的矩阵乘积描述溶剂-溶质相互作用,相对于相关技术中将溶剂分子特征和溶质分子特征进行简单拼接、排列或加和而言,能够更好地显式描述溶剂-溶质相互作用,有效提升了确定的溶剂化自由能的精准度。In addition, the solvent-solute interaction is described by the matrix product of the solute molecule feature vector and the solvent molecule feature vector, which can better visualize The solvent-solute interaction is described by the formula, which effectively improves the accuracy of the determined solvation free energy.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
附图说明Description of drawings
图1示意性示出了根据本申请实施例的可以应用数据处理方法、装置、模型训练方法和预测自由能方法的一种示例性系统架构;FIG. 1 schematically shows an exemplary system architecture to which a data processing method, device, model training method and prediction free energy method can be applied according to an embodiment of the present application;
图2示意性示出了根据本申请实施例的数据处理方法的流程图;Fig. 2 schematically shows a flow chart of a data processing method according to an embodiment of the present application;
图3示意性示出了根据本申请实施例的基于虚拟分子图确定目标分子的分子特征的方法的流程图;FIG. 3 schematically shows a flow chart of a method for determining molecular characteristics of a target molecule based on a virtual molecular map according to an embodiment of the present application;
图4示意性示出了根据本申请实施例的基于虚拟分子图更新节点标量特征和节点矢量特征的逻辑图;Fig. 4 schematically shows a logic diagram for updating node scalar features and node vector features based on a virtual molecular graph according to an embodiment of the present application;
图5示意性示出了根据本申请实施例的另一种数据处理方法的流程图;FIG. 5 schematically shows a flow chart of another data processing method according to an embodiment of the present application;
图6示意性示出了根据本申请实施例的训练溶剂化自由能预测模型的方法的流程图;6 schematically shows a flow chart of a method for training a solvation free energy prediction model according to an embodiment of the present application;
图7示意性示出了根据本申请实施例的等变图卷积网络的结构示意图;FIG. 7 schematically shows a schematic structural diagram of an equivariant graph convolutional network according to an embodiment of the present application;
图8示意性示出了根据本申请实施例的全连接网络的结构示意图;FIG. 8 schematically shows a schematic structural diagram of a fully connected network according to an embodiment of the present application;
图9示意性示出了根据本申请实施例的确定溶剂化自由能的方法的流程图;Fig. 9 schematically shows a flowchart of a method for determining the free energy of solvation according to an embodiment of the present application;
图10示意性示出了根据本申请实施例的以溶剂种类划分数据集训练得到模型预测的溶剂化自由能与真实溶剂化自由能在训练集上的相关性图;Fig. 10 schematically shows the correlation diagram between the solvation free energy predicted by the model and the real solvation free energy on the training set according to the embodiment of the present application by dividing the data set training by solvent type;
图11示意性示出了根据本申请实施例的以溶剂种类划分数据集训练得到模型预测的溶剂化自由能与真实溶剂化自由能在测试集上的相关性图;Figure 11 schematically shows the correlation diagram between the solvation free energy predicted by the model and the real solvation free energy on the test set according to the solvent type division data set training according to the embodiment of the present application;
图12示意性示出了根据本申请实施例的以溶质种类划分数据集训练得到模型预测的溶剂化自由能与真实溶剂化自由能在训练集上的相关性图;Fig. 12 schematically shows the correlation diagram of the solvation free energy predicted by the model and the real solvation free energy on the training set obtained by dividing the data set by solute type according to the embodiment of the present application;
图13示意性示出了根据本申请实施例的以溶质种类划分数据集训练得到模型预测的溶剂化自由能与真实溶剂化自由能在测试集上的相关性图;Fig. 13 schematically shows the correlation diagram between the free energy of solvation predicted by the model and the real free energy of solvation on the test set obtained by dividing the data set according to the solute type according to the embodiment of the present application;
图14示意性示出了根据本申请实施例的设计方法的流程图;Fig. 14 schematically shows a flow chart of a design method according to an embodiment of the present application;
图15示意性示出了根据本申请实施例的数据处理装置的方框图;Fig. 15 schematically shows a block diagram of a data processing device according to an embodiment of the present application;
图16示意性示出了根据本申请实施例的训练溶剂化自由能预测模型的装置的方框图;Fig. 16 schematically shows a block diagram of a device for training a solvation free energy prediction model according to an embodiment of the present application;
图17示意性示出了根据本申请实施例的确定溶剂化自由能的装置的方框图;Fig. 17 schematically shows a block diagram of a device for determining the free energy of solvation according to an embodiment of the present application;
图18示意性示出了根据本申请实施例的设计装置的方框图;Fig. 18 schematically shows a block diagram of a design device according to an embodiment of the present application;
图19示意性示出了根据本申请实施例的一种电子设备的方框图。Fig. 19 schematically shows a block diagram of an electronic device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将参照附图更详细地描述本申请的实施方式。虽然附图中显示了本申请的实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在此使用的术语“包括”、“包含”等表明了特征、步骤、操作和/或部件的存在,但是并不排除存在或添加一个或多个其他特征、步骤、操作或部件。The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. The terms "comprising", "comprising", etc. used herein indicate the presence of features, steps, operations and/or components, but do not exclude the presence or addition of one or more other features, steps, operations or components.
在此使用的所有术语(包括技术和科学术语)具有本领域技术人员通常所理解的含义,除非另外定义。应注意,这里使用的术语应解释为具有与本说明书的上下文相一致的含义,而不应以理想化或过于刻板的方式来解释。All terms (including technical and scientific terms) used herein have the meaning commonly understood by one of ordinary skill in the art, unless otherwise defined. It should be noted that the terms used herein should be interpreted to have a meaning consistent with the context of this specification, and not be interpreted in an idealized or overly rigid manner.
应当理解,尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是 两个或两个以上,除非另有明确具体的限定。It should be understood that although the terms "first", "second", "third" and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present application, "plurality" means two or more, unless otherwise specifically defined.
在对本申请的技术方案进行描述之前,先对本申请涉及的本领域的部分技术术语进行说明。Before describing the technical solution of the application, some technical terms in the field involved in the application will be explained first.
分子描述符,是将分子表示为计算机程序可以处理的数据结构。A molecular descriptor is a representation of a molecule as a data structure that a computer program can process.
虚拟分子图,是一种分子描述符,将原子表示为节点,原子之间的关系表示为边;与普通分子图按照原子之间的成键信息建立边所不同的是,虚拟分子图是按照截断半径建立边。A virtual molecular graph is a molecular descriptor, which represents atoms as nodes and the relationship between atoms as edges; unlike ordinary molecular graphs that establish edges based on the bonding information between atoms, virtual molecular graphs are based on Cut off radius to create edges.
截断(Cutoff)半径,对于一个分子内的某一个原子,如果对该原子与其他所有原子建立边,则边的数目会很多,计算量过大。考虑到距离该原子越远的其他原子,对该原子的影响越小,因此取一个截断半径,只让该原子与距其距离小于截断半径的原子建立边,对于截断半径之外的原子,则忽略其相互作用。Cutoff (Cutoff) radius, for a certain atom in a molecule, if the atom is established with all other atoms, the number of edges will be too many, and the calculation will be too large. Considering that other atoms farther away from the atom have less influence on the atom, so a cut-off radius is taken, and only the atom is allowed to establish edges with atoms whose distance from it is smaller than the cut-off radius. For atoms outside the cut-off radius, then Ignore its interactions.
坐标系(Frame),是描述物体位置和姿态的参考,因此也被称为参考系或参照系。例如,坐标系可以是在进行仿真模拟时创建的坐标系,如笛卡尔坐标系(Cartesian coordinates)。A coordinate system (Frame) is a reference to describe the position and attitude of an object, so it is also called a frame of reference or a frame of reference. For example, the coordinate system may be a coordinate system created during simulation, such as Cartesian coordinates.
坐标,是用于表示对象在特定坐标系下的绝对位置,在数学上坐标的实质是有序对数。Coordinates are used to represent the absolute position of an object in a specific coordinate system. In mathematics, the essence of coordinates is an ordered logarithm.
溶剂化,是一种由溶质分子与溶剂分子相互作用所驱动的反应过程,是结晶成核、化学反应、药物代谢、药物相互作用与药物-受体相互作用等药物研发领域所关心过程的关键步骤。溶剂化作用的强弱通常由溶剂化自由能来表征,因此能快速而又准确的预测溶剂化自由能在药物研发领域具有极其重要的意义。相关技术中的溶剂化自由能预测过程可以通过两种路径来实现。Solvation is a reaction process driven by the interaction between solute molecules and solvent molecules. It is the key to the process of drug research and development, such as crystallization nucleation, chemical reaction, drug metabolism, drug interaction and drug-receptor interaction. step. The strength of solvation is usually characterized by the free energy of solvation, so it is of great significance to quickly and accurately predict the free energy of solvation in the field of drug development. The solvation free energy prediction process in the related art can be realized through two paths.
一种是基于实验与计算数据所拟合的经验力场,在分子动力学模拟中采用自由能微扰或热力学积分方法进行。申请人在进行实践的过程中,发现虽然这种方法得到的溶剂化自由能与实验结果相比误差较小,但是自由能微扰或热力学积分需要长时间的模拟,其计算成本较高,这使得快速预测溶剂化自由能的目标难以实现。One is based on the empirical force field fitted by experimental and computational data, which is carried out by using free energy perturbation or thermodynamic integration methods in molecular dynamics simulations. In the process of practice, the applicant found that although the solvation free energy obtained by this method has a smaller error compared with the experimental results, the free energy perturbation or thermodynamic integration requires long-term simulation, and its calculation cost is high. This makes the goal of quickly predicting the free energy of solvation difficult to achieve.
另一种是基于已有实验与计算数据,采用机器学习方法构建结构-溶剂化自由能模型进行。这种方法可以快速地预测溶剂化自由能,但是申请人在进行实践的过程中,发现在一些情况下其预测的溶剂化自由能的精确度,不足以用来解决与溶剂化相关的问题。The other is based on the existing experimental and calculation data, using machine learning methods to construct a structure-solvation free energy model. This method can quickly predict the free energy of solvation, but in the process of practice, the applicant found that in some cases the accuracy of the predicted free energy of solvation is not enough to solve the problems related to solvation.
经过申请人的大量研究分析,发现采用机器学习方法的预测结果的准确性低的原 因包括如下两方面:一是相关技术中的机器学习方法可以将分子表示为SMILES、MACCS、Morgan以及混合指纹等低维描述符,不能完整地表示分子的三维特征。二是这些方法所建立的模型在描述溶剂-溶质相互作用时,可以通过将溶剂分子特征和溶质分子特征进行简单拼接、排列或加和来实现,没有在具有物理意义的框架下显式地描述溶剂-溶质相互作用。After a lot of research and analysis by the applicant, it is found that the reasons for the low accuracy of the prediction results using machine learning methods include the following two aspects: First, the machine learning methods in related technologies can express molecules as SMILES, MACCS, Morgan and hybrid fingerprints, etc. Low-dimensional descriptors cannot fully represent the three-dimensional characteristics of molecules. The second is that when describing the solvent-solute interaction, the models established by these methods can be realized by simply splicing, arranging or summing the molecular features of the solvent and the solute molecules, without explicitly describing them in a physically meaningful framework Solvent-solute interactions.
本申请实施例为了实现快速而又准确的分子溶剂化自由能预测,考虑从分子的表示和/或溶剂-溶质相互作用的描述出发,设计能够表示分子三维特征的描述符和/或能够显式描述溶剂-溶质相互作用的机器学习模型,以提升预测的溶剂化自由能的精准度。In order to achieve fast and accurate prediction of molecular solvation free energy in the embodiments of the present application, it is considered to design descriptors that can represent three-dimensional features of molecules and/or be able to explicitly Machine learning models describing solvent-solute interactions to improve the accuracy of predicted free energies of solvation.
以下将通过图1至图19对本申请实施例的一种数据处理方法、装置、模型训练方法和预测自由能方法进行详细描述。A data processing method, device, model training method and prediction free energy method of the embodiments of the present application will be described in detail below with reference to FIGS. 1 to 19 .
图1示意性示出了根据本申请实施例的可以应用数据处理方法、装置、模型训练方法和预测自由能方法的一种示例性系统架构。需要注意的是,图1所示仅为可以应用本申请实施例的系统架构的示例,以帮助本领域技术人员理解本申请的技术内容,但并不意味着本申请实施例不可以用于其他设备、系统、环境或场景。Fig. 1 schematically shows an exemplary system architecture to which a data processing method, an apparatus, a model training method and a free energy prediction method can be applied according to an embodiment of the present application. It should be noted that Figure 1 is only an example of the system architecture to which the embodiment of the present application can be applied, to help those skilled in the art understand the technical content of the present application, but it does not mean that the embodiment of the present application cannot be used in other device, system, environment or scenario.
参见图1,根据该实施例的系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。Referring to FIG. 1 , a system architecture 100 according to this embodiment may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 . The network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 . Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
用户可以使用终端设备101、102、103通过网络104与其他终端设备和服务器105进行交互,以接收或发送信息等,如发送模型训练请求、自由能预测请求和接收模型训练结果、溶剂化自由能等。终端设备101、102、103可以安装有各种通讯客户端应用,例如,药物开发应用、材料设计应用、网页浏览器应用、数据库类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等应用等。Users can use terminal devices 101, 102, 103 to interact with other terminal devices and server 105 through network 104 to receive or send information, such as sending model training requests, free energy prediction requests and receiving model training results, solvation free energy wait. Terminal devices 101, 102, and 103 can be installed with various communication client applications, for example, drug development applications, material design applications, web browser applications, database applications, search applications, instant messaging tools, email clients, social platforms software and other applications.
终端设备101、102、103包括但不限于智能台式电脑、平板电脑、膝上型便携计算机等等可以支持上网、建模、分析计算、设计等功能的电子设备。 Terminal devices 101, 102, and 103 include, but are not limited to, smart desktop computers, tablet computers, laptop computers, and other electronic devices that can support functions such as surfing the Internet, modeling, analysis and calculation, and design.
服务器105可以接收模型训练请求、溶剂化自由能请求等,调整模型参数、存储模型拓扑结构、模型参数、预测溶剂化自由能等,还可以发送溶剂化自由能给终端设备101、102、103。例如,服务器105可以为后台管理服务器、服务器集群等。The server 105 can receive model training requests, solvation free energy requests, etc., adjust model parameters, store model topology, model parameters, predict solvation free energy, etc., and can also send solvation free energy to terminal devices 101, 102, 103. For example, the server 105 may be a background management server, a server cluster, and the like.
需要说明的是,终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要, 可以具有任意数目的终端设备、网络和云端。It should be noted that the numbers of terminal devices, networks and servers are only illustrative. According to implementation requirements, there can be any number of terminal devices, networks and clouds.
图2示意性示出了根据本申请实施例的数据处理方法的流程图。Fig. 2 schematically shows a flowchart of a data processing method according to an embodiment of the present application.
如图2所示,该实施例提供了一种数据处理的方法,该方法包括操作S210~操作S240,具体如下:As shown in Figure 2, this embodiment provides a method for data processing, the method includes operation S210 to operation S240, specifically as follows:
在操作S210中,获得待处理数据,待处理数据包括针对目标分子中的多个原子各自的属性信息。In operation S210, data to be processed is obtained, and the data to be processed includes property information for each of a plurality of atoms in the target molecule.
在本实施例中,待处理数据可以是字符串。属性信息可以用于表征目标分子和目标分子中的至少部分原子的属性。其中,属性包括但不限于:空间位置属性、分子种类、原子种类等。空间位置属性可以是迪卡尔坐标系或者极坐标系下的坐标。分子种类可以包括溶质分子、溶剂分子。原子种类可以根据原子中的质子数和/或中子数来确定。如氕、氘、氚可以被认为是相同的原子种类或者不同的原子种类。In this embodiment, the data to be processed may be a character string. Property information can be used to characterize properties of the target molecule and at least some of the atoms in the target molecule. Wherein, the attribute includes but not limited to: spatial position attribute, molecule type, atom type and so on. The spatial position attribute can be coordinates in Cartesian coordinate system or polar coordinate system. Molecular species may include solute molecules, solvent molecules. The atomic species can be determined from the number of protons and/or neutrons in the atom. For example, protium, deuterium, and tritium can be considered to be the same atomic species or different atomic species.
具体地,待处理数据可以是以x,y,z格式的字符串来表示的溶质分子、溶剂分子的三维构象。例如,字符串可以包括分子的三维构象以及分子中各原子的x,y,z坐标。Specifically, the data to be processed may be three-dimensional conformations of solute molecules and solvent molecules represented by strings in x, y, z format. For example, a string can include the three-dimensional conformation of a molecule and the x, y, and z coordinates of each atom in the molecule.
例如,读取分子的x,y,z字符串得到该分子所包含的原子种类信息与原子位置信息。For example, read the x, y, z character string of a molecule to get the atom type information and atom position information contained in the molecule.
此外,在计算分子溶剂化自由能和进行药物设计的过程中可以采用同一坐标系,特别是空间坐标系。例如,溶质分子的几何中心坐标、溶质分子中的原子坐标、溶剂分子的几何中心坐标、溶剂分子中的原子坐标中的部分或者全部是在同一坐标系下的坐标。应当能够理解的是,溶质分子的几何中心坐标、溶质分子中的原子坐标、溶剂分子的几何中心坐标、溶剂分子中的原子坐标中的部分或者全部也可以是在不同坐标系下的坐标,但是各坐标系下的坐标之间可以相互转换。In addition, the same coordinate system, especially the spatial coordinate system, can be used in the calculation of molecular solvation free energy and in the process of drug design. For example, some or all of the geometric center coordinates of the solute molecule, the atomic coordinates in the solute molecule, the geometric center coordinates of the solvent molecule, and the atomic coordinates in the solvent molecule are coordinates in the same coordinate system. It should be understood that some or all of the geometric center coordinates of the solute molecule, the atomic coordinates in the solute molecule, the geometric center coordinates of the solvent molecule, and the atomic coordinates in the solvent molecule may also be coordinates in different coordinate systems, but Coordinates in each coordinate system can be converted to each other.
在操作S220中,响应于多个原子各自的属性信息,生成针对目标分子的节点集合和节点位置集合,其中,节点集合中的多个节点分别表征特定原子类型的原子,节点位置集合包括节点集合中各节点在特定坐标系下的坐标信息。In operation S220, in response to the respective attribute information of a plurality of atoms, a node set and a node position set for the target molecule are generated, wherein a plurality of nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes a node set The coordinate information of each node in a specific coordinate system.
在本实施例中,对应生成节点集合{A i}与节点位置集合{(x i,y i,z i)},{(x i,y i,z i)},其中i=1,2,…N,N表示分子中包含的原子数,A表示原子类型。 In this embodiment, correspondingly generate node set {A i } and node position set {(xi , y i , zi ) }, {( xi , y i , zi ) }, where i=1,2 ,...N, N represents the number of atoms contained in the molecule, and A represents the type of atoms.
在操作S230中,生成针对节点集合的节点标量特征N s和节点矢量特征N v,并且基于节点位置集合中各节点的坐标信息生成针对节点集合的边标量特征E s和边矢量特征E vIn operation S230, node scalar features N s and node vector features N v for the node set are generated, and edge scalar features E s and edge vector features E v for the node set are generated based on coordinate information of each node in the node position set.
在本实施例中,目标分子包括N个原子,节点集合中的多个节点各自具有F维特 征。In this embodiment, the target molecule includes N atoms, and multiple nodes in the node set each have F-dimensional features.
相应地,节点标量特征N s的维度包括N×F×1维,节点矢量特征N v的维度包括N×F×3维,边标量特征E s的维度包括N×1×1维,边矢量特征E v的维度包括N×3×1维。 Correspondingly, the dimension of node scalar feature N s includes N×F×1 dimension, the dimension of node vector feature N v includes N×F×3 dimension, the dimension of edge scalar feature E s includes N×1×1 dimension, and the dimension of edge vector The dimensions of the feature E v include N×3×1 dimensions.
具体地,设定特征维度F,如F的值不小于64。可选地,F的值为2的整数次方。按照节点类型(如原子类型)对节点集合中的元素进行嵌入编码,将节点集合表示为N×F×1维的矩阵,该矩阵代表节点标量特征Ns,同时初始化一个全是零元素的N×F×3维矩阵,该矩阵代表节点矢量特征N v。其中,嵌入编码可以是指将节点随机表示为一个F维的向量,在后续模型训练时会对此向量进行更新。 Specifically, set the feature dimension F, for example, the value of F is not less than 64. Optionally, the value of F is an integer power of 2. According to the node type (such as atom type), the elements in the node set are embedded and coded, and the node set is expressed as an N×F×1-dimensional matrix, which represents the node scalar feature Ns, and at the same time initializes an N× F×3-dimensional matrix, which represents the node vector feature N v . Wherein, embedding coding may refer to randomly expressing nodes as an F-dimensional vector, and this vector will be updated during subsequent model training.
接下来,对于节点集合{A i}中的元素i,可以对各元素i与其余元素中的至少部分元素j建立边,形成边集合{(i,j)},其中i=1,2,…N,j∈A i,边的数目记为E。 Next, for the element i in the node set {A i }, an edge can be established between each element i and at least part of the elements j in the remaining elements to form an edge set {(i,j)}, where i=1,2, ...N, j∈A i , the number of sides is recorded as E.
对于边集合中的每条边,计算该边对应的两个节点之间的距离和位置矢量,得到边距离集合,如式(1)所示。For each edge in the edge set, calculate the distance and position vector between the two nodes corresponding to the edge to obtain the edge distance set, as shown in formula (1).
Figure PCTCN2021140134-appb-000001
Figure PCTCN2021140134-appb-000001
边位置矢量集合如式(2)所示。The set of edge position vectors is shown in formula (2).
{r v,ij|r v,ij=(x i-x j,y i-y j,z i-z j)}  式(2) {r v,ij |r v,ij =(x i -x j ,y i -y j ,z i -z j )} formula (2)
进一步将边距离集合表示为E×1×1维的矩阵代表边标量特征E s,将边位置矢量集合表示为E×3×1维的矩阵代表边矢量特征E vFurther, the set of edge distances is expressed as a matrix of dimension E×1×1 to represent the edge scalar feature E s , and the set of edge position vectors is expressed as a matrix of dimension E×3×1 to represent the feature of edge vector E v .
在某些实施例中,为了减少确定分子溶剂化自由能时所消耗的计算资源,可以从上述节点集合{A i}中选取满足截断半径要求的相邻节点集合{N i},然后确定针对相邻节点集合{N i}的边集合。 In some embodiments, in order to reduce the calculation resources consumed when determining the molecular solvation free energy, the adjacent node set {N i } that meets the requirement of the truncation radius can be selected from the above node set {A i }, and then determined for The edge set of the adjacent node set {N i }.
具体地,上述方法还可以包括如下操作。Specifically, the above method may further include the following operations.
首先,在响应于多个原子各自的属性信息,生成针对目标分子的节点集合和节点位置集合之后,确定截断半径r cut。其中,截断半径r cut可以是预设值,该预设值可以是基于专家经验或者仿真模拟结果来确定的值,如3埃
Figure PCTCN2021140134-appb-000002
Figure PCTCN2021140134-appb-000003
等。
First, after generating a node set and a node position set for the target molecule in response to the respective attribute information of a plurality of atoms, the truncation radius r cut is determined. Wherein, the truncation radius r cut can be a preset value, and the preset value can be a value determined based on expert experience or simulation results, such as 3 Angstroms
Figure PCTCN2021140134-appb-000002
or
Figure PCTCN2021140134-appb-000003
wait.
然后,从节点集合中确定节点之间距离小于或者等于截断半径r cut的目标节点,得到目标节点集合N i。其中,模板节点集合N i即相邻节点集合{N i}。 Then, determine the target nodes whose distance between nodes is less than or equal to the truncation radius r cut from the node set, and obtain the target node set N i . Wherein, the template node set N i is the adjacent node set {N i }.
具体地,可以先设定截断半径r cut,对于节点集合中的每个元素i,分别确定各元素i在截断半径内的相邻节点集合
Figure PCTCN2021140134-appb-000004
如式(3)所示。
Specifically, the truncation radius r cut can be set first, and for each element i in the node set, the adjacent node sets of each element i within the truncation radius are respectively determined
Figure PCTCN2021140134-appb-000004
As shown in formula (3).
Figure PCTCN2021140134-appb-000005
Figure PCTCN2021140134-appb-000005
对元素i与相邻节点集合中的所有节点j建立边,形成边集合{(i,j)},其中i=1,2,…N,j∈A i,边的数目记为E。 Establish edges between element i and all nodes j in the adjacent node set to form an edge set {(i,j)}, where i=1,2,...N, j∈A i , and the number of edges is denoted as E.
相应地,基于节点位置集合中各节点的坐标信息生成针对节点集合的边标量特征和边矢量特征包括:Correspondingly, generating the edge scalar feature and edge vector feature for the node set based on the coordinate information of each node in the node position set includes:
基于节点位置集合中针对目标节点的坐标信息生成针对目标节点集合N i的边标量特征E s和边矢量特征E vGenerate edge scalar feature E s and edge vector feature E v for the target node set N i based on the coordinate information for the target node in the node position set.
在某些实施例中,由于从节点集合{A i}中选取了相邻节点集合{N i},相应地,边标量特征和E s和边矢量特征E v的维度也会发生改变。 In some embodiments, since the adjacent node set {N i } is selected from the node set {A i }, correspondingly, the dimension of the edge scalar feature E s and the edge vector feature E v will also change.
具体地,目标节点集合包括E个节点,E个节点各自具有F维特征。节点标量特征N s的维度包括N×F×1维,节点矢量特征N v的维度包括N×F×3维,边标量特征E s的维度包括E×1×1维,边矢量特征E v的维度包括E×3×1维。 Specifically, the target node set includes E nodes, and each of the E nodes has F-dimensional features. The dimension of node scalar feature N s includes N×F×1 dimension, the dimension of node vector feature N v includes N×F×3 dimension, the dimension of edge scalar feature E s includes E×1×1 dimension, and the dimension of edge vector feature E v The dimensions of include E×3×1 dimensions.
在操作S240中,基于针对节点集合的节点标量特征N s、节点矢量特征N v、边标量特征E s和边矢量特征E v构建虚拟分子图,以基于虚拟分子图确定目标分子的分子特征X,便于至少基于目标分子的分子特征X确定溶剂化自由能。 In operation S240, construct a virtual molecular graph based on the node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v for the node set, to determine the molecular feature X of the target molecule based on the virtual molecular graph , facilitating the determination of the free energy of solvation based at least on the molecular characteristic X of the target molecule.
这样,由节点标量特征N s、节点矢量特征N v、边标量特征E s与边矢量征E v构成了虚拟分子图。分子特征X的确定方法可以采用多种相关技术,如采用类似于基于分子图提取分子特征X的方法等。 In this way, the virtual molecular graph is composed of node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v . The method for determining the molecular feature X can use a variety of related technologies, such as using a method similar to the method of extracting the molecular feature X based on a molecular map.
虚拟分子图包括了原子的标量与矢量信息,以及原子之间的标量与矢量信息,是一种普适而又准确的描述符,能有效提升确定的分子溶剂化自由能的准确度。The virtual molecular map includes the scalar and vector information of atoms, as well as the scalar and vector information between atoms. It is a universal and accurate descriptor that can effectively improve the accuracy of the determined molecular solvation free energy.
在本实施例中,将分子的三维信息表示为虚拟分子图,除了可以不损失分子的三维信息之外,也可以严格区分不同的分子构象,相比于相关技术的机器学习方法采用的二维描述符,对分子的描述更为准确。In this embodiment, the three-dimensional information of molecules is expressed as a virtual molecular graph. In addition to not losing the three-dimensional information of molecules, different molecular conformations can also be strictly distinguished. Compared with the two-dimensional Descriptors, which describe molecules more accurately.
图3示意性示出了根据本申请实施例的基于虚拟分子图确定目标分子的分子特征的方法的流程图。Fig. 3 schematically shows a flowchart of a method for determining molecular features of a target molecule based on a virtual molecular map according to an embodiment of the present application.
参见图3,基于虚拟分子图确定目标分子的分子特征的过程可以包括操作S310~操作S340。Referring to FIG. 3 , the process of determining the molecular characteristics of the target molecule based on the virtual molecular map may include operation S310 to operation S340.
在操作S310,基于虚拟分子图更新节点标量特征N s和节点矢量特征N v,得到更新节点标量特征New_N s和更新节点矢量特征New_N vIn operation S310, the node scalar feature N s and the node vector feature N v are updated based on the virtual molecular graph, and the updated node scalar feature New_N s and the updated node vector feature New_N v are obtained.
在操作S320,将更新节点标量特征New_N s和更新节点矢量特征New_N v分别作为 当前节点标量特征Now_N s和当前节点矢量特征Now_N vIn operation S320, the updated node scalar feature New_N s and the updated node vector feature New_N v are used as the current node scalar feature Now_N s and the current node vector feature Now_N v , respectively.
在操作S330,利用当前节点标量特征Now_N s、当前节点矢量特征Now_N v、边标量特征E s和边矢量特征E v构建更新虚拟分子图。 In operation S330, an updated virtual molecular graph is constructed using the current node scalar feature Now_N s , the current node vector feature Now_N v , the edge scalar feature E s , and the edge vector feature E v .
在操作S340,基于更新虚拟分子图更新更新节点标量特征New_N s和更新节点矢量特征New_N vIn operation S340, the updated node scalar feature New_N s and the updated node vector feature New_N v are updated based on the updated virtual molecular graph.
重复执行操作S320~操作S340,直至达到指定循环次数num_conv,以将达到指定循环次数num_conv时得到的更新节点标量特征New_N s,作为分子特征X。 Operation S320 to operation S340 are repeatedly performed until the specified number of cycles num_conv is reached, and the updated node scalar feature New_N s obtained when the specified cycle number num_conv is reached is used as the molecular feature X.
具体地,可以设置卷积层数num_conv,对于输入的(溶剂或溶质分子的)虚拟分子图,保持E s和E v不变,用NewN s和NewN v分别更新N s和N v,并且迭代num_conv次,将第num_conv次得到NewN s压缩掉最后一个维度,转换为N×F的矩阵代表分子特征X。 Specifically, the number of convolutional layers num_conv can be set. For the input (solvent or solute molecule) virtual molecular graph, keep E s and E v unchanged, update N s and N v with NewN s and NewN v respectively, and iterate num_conv time, compress the last dimension of NewN s obtained in num_conv time, and convert it into an N×F matrix to represent the molecular feature X.
以下对更新节点标量特征New_N s和更新节点矢量特征New_N v进行示例性说明。 The following is an exemplary description of the updated node scalar feature New_N s and the updated node vector feature New_N v .
图4示意性示出了根据本申请实施例的基于虚拟分子图更新节点标量特征和节点矢量特征的逻辑图。Fig. 4 schematically shows a logic diagram for updating node scalar features and node vector features based on a virtual molecular graph according to an embodiment of the present application.
参见图4,更新过程可以由四种基本操作:矩阵线性变换操作(Linear)、激活操作(如ReLU)、矩阵对应乘法操作,如哈达玛(Hadamard)积和矩阵加和操作(Sum)组合而成。其中,这四种基本操作在如pytorch程序框架中已有成熟的实现。Referring to Figure 4, the update process can be composed of four basic operations: matrix linear transformation operation (Linear), activation operation (such as ReLU), matrix corresponding multiplication operation, such as the combination of Hadamard product and matrix sum operation (Sum) become. Among them, these four basic operations have been maturely implemented in the program framework such as pytorch.
其中,矩阵线性变换操作(Linear)将输入变换到一个特征空间,提取输入中的有用信息并保留。Among them, the matrix linear transformation operation (Linear) transforms the input into a feature space, extracts useful information in the input and retains it.
激活操作(如ReLU)是一种非线性映射,赋予网络非线性表达能力。The activation operation (such as ReLU) is a nonlinear mapping that endows the network with nonlinear expressiveness.
矩阵对应乘法操作是两个维度相同的矩阵进行矩阵元素一一对应乘积,起特征缩放的作用。The matrix corresponding multiplication operation is the one-to-one correspondence product of two matrices with the same dimension, which plays the role of feature scaling.
矩阵加和操作(Sum)是两个维度相同的矩阵进行矩阵元素一一对应加和,起特征融合的作用。The matrix sum operation (Sum) is two matrices with the same dimension for one-to-one summation of matrix elements, which plays the role of feature fusion.
内积操作(Inner)是两个矢量的内积,将矢量信息转化为标量。The inner product operation (Inner) is the inner product of two vectors, which converts the vector information into a scalar.
具体地,上述基于虚拟分子图更新节点标量特征N s和节点矢量特征N v,得到更新节点标量特征New_N s和更新节点矢量特征New_N v,可以包括如下过程。 Specifically, updating the node scalar feature N s and the node vector feature N v based on the virtual molecular graph above to obtain the updated node scalar feature New_N s and the updated node vector feature New_N v may include the following process.
首先,对节点标量特征N s依序进行第一线性操作、第二激活函数和第二线性操作得到第一子处理结果Q1,并且,对边标量特征E s进行第三线性操作,得到第二子处理结果Q2。其中,得到第一子处理结果Q1的操作实现了提取N s中的有用信息并非线性映射到特征空间。得到第二子处理结果Q2的操作实现了提取E s中的有用信息并线性 映射到特征空间。 First, perform the first linear operation, the second activation function, and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and perform the third linear operation on the edge scalar feature E s to obtain the second Subprocessing result Q2. Among them, the operation of obtaining the first sub-processing result Q1 realizes the extraction of useful information in N s and nonlinear mapping to the feature space. The operation of obtaining the second sub-processing result Q2 realizes extracting useful information in E s and linearly mapping it to the feature space.
然后,对第一子处理结果Q1和第二子处理结果Q2进行第一矩阵对应乘法操作,得到第三子处理结果Q3。其中,得到第三子处理结果Q3的操作使用边的标量特征对节点的特征进行缩放,将边的信息融合进节点,使节点特征更具表达性。Then, the first matrix corresponding multiplication operation is performed on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain the third sub-processing result Q3. Among them, the operation of obtaining the third sub-processing result Q3 uses the scalar feature of the edge to scale the feature of the node, and integrates the information of the edge into the node, so that the feature of the node is more expressive.
接着,基于第三子处理结果Q3和节点矢量特征N v进行第二矩阵对应乘法操作,得到第四子处理结果Q4,并且,基于第三子处理结果Q3和边矢量特征E v进行第三矩阵对应乘法操作,得到第五子处理结果Q5。其中,得到第四子处理结果Q4的操作使用节点的矢量特征对节点的特征进行缩放,将矢量信息融合进节点,使节点特征更具表达性。得到第五子处理结果Q5的操作使用边的矢量特征对节点的特征进行缩放,将边的矢量信息融合进节点,使节点特征更具表达性。 Then, based on the third sub-processing result Q3 and the node vector feature Nv , the corresponding multiplication operation of the second matrix is performed to obtain the fourth sub-processing result Q4, and the third matrix is performed based on the third sub-processing result Q3 and the edge vector feature Ev Corresponding to the multiplication operation, the fifth sub-processing result Q5 is obtained. Among them, the operation of obtaining the fourth sub-processing result Q4 uses the vector feature of the node to scale the feature of the node, and integrates the vector information into the node to make the feature of the node more expressive. The operation of obtaining the fifth sub-processing result Q5 uses the vector feature of the edge to scale the feature of the node, and integrates the vector information of the edge into the node, so that the feature of the node is more expressive.
然后,对第四子处理结果Q4和第五子处理结果Q5进行第一矩阵加和操作,得到第六子处理结果Q6。其中,得到第六子处理结果Q6的操作实现了融合节点与边的矢量特征。Then, perform the first matrix addition operation on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain the sixth sub-processing result Q6. Among them, the operation of obtaining the sixth sub-processing result Q6 realizes the fusion of vector features of nodes and edges.
接着,对第六子处理结果Q6分别经过第四线性操作和第五线性操作,得到第七子处理结果Q7和第八子处理结果Q8。其中,第七子处理结果Q7是矢量特征,用于与标量特征相互作用。第八子处理结果Q8是矢量特征,用于与矢量特征相互作用。Next, the sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8. Wherein, the seventh sub-processing result Q7 is a vector feature, which is used to interact with a scalar feature. The eighth sub-processing result Q8 is a vector feature for interacting with the vector feature.
然后,对第七子处理结果Q7依序经过第六线性操作、第二激活函数、第七线性操作,得到第九子处理结果Q9,并且,基于第三子处理结果Q3、第七子处理结果Q7和第八子处理结果Q8进行内积操作Inner,得到第十子处理结果Q10;对第八子处理结果Q8和第九子处理结果Q9进行第五矩阵对应乘法操作,得到更新节点矢量特征NewN v。其中,得到第九子处理结果Q9的操作实现了用矢量信息更新标量特征。得到第十子处理结果Q10的操作实现了矢量信息转化为标量信息。得到更新节点矢量特征NewN v的操作实现了用标量信息更新矢量特征。 Then, the seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3 and the seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform the inner product operation Inner to obtain the tenth sub-processing result Q10; perform the fifth matrix corresponding multiplication operation on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v . Wherein, the operation of obtaining the ninth sub-processing result Q9 realizes updating scalar features with vector information. The operation of obtaining the tenth sub-processing result Q10 realizes the conversion of vector information into scalar information. The operation of obtaining the update node vector feature NewN v realizes updating the vector feature with scalar information.
接着,对第九子处理结果Q9和第十子处理结果Q10进行第四矩阵对应乘法操作,得到第十一子处理结果Q11。得到第十一子处理结果Q11的操作实现了用矢量内积操作得到的标量信息缩放标量特征。Next, the fourth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain the eleventh sub-processing result Q11. The operation of obtaining the eleventh sub-processing result Q11 realizes scaling the scalar feature by using the scalar information obtained by the vector inner product operation.
然后,对第九子处理结果Q9和第十一子处理结果Q11进行第二矩阵加和,得到更新节点标量特征NewN sThen, the second matrix addition is performed on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the updated node scalar feature NewN s .
本实施例中通过边的标量、矢量特征与节点的标量、矢量特征相互作用,对节点特征进行更新,输出新的节点的标量、矢量特征。具体地,将边信息融合进了节点信 息,形成了新的特征,提升了特征NewN s与NewN v对于结构的表示能力,使得该模型更容易提取到与溶剂化自由能相关的信息,最终使得预测结果更为准确。需要说明的是,基于更新虚拟分子图更新更新节点标量特征New_N s和更新节点矢量特征New_N v的逻辑与图4所示的逻辑相似,在此不再详述。 In this embodiment, the scalar and vector features of the edges interact with the scalar and vector features of the nodes to update the node features and output new scalar and vector features of the nodes. Specifically, the edge information is fused into the node information to form a new feature, which improves the representation ability of the features NewN s and NewN v for the structure, making it easier for the model to extract information related to the free energy of solvation, and finally makes The prediction results are more accurate. It should be noted that the logic of updating node scalar feature New_N s and updating node vector feature New_N v based on updating virtual molecular graph is similar to the logic shown in FIG. 4 , and will not be described in detail here.
通过如上所示的方式,可以以虚拟分子图作为描述符表示分子,该虚拟分子图包括较完整的分子三维特征,有助于提升确定的分子溶剂化自由能的准确度。Through the method shown above, the molecule can be represented by a virtual molecular graph as a descriptor, and the virtual molecular graph includes relatively complete three-dimensional characteristics of the molecule, which helps to improve the accuracy of the determined free energy of solvation of the molecule.
本实施例中,在卷积中除了采用标量特征之外,也使用了矢量特征,相比于相关技术中只使用标量特征的方法使得提取分子特征更容易,也更准确。In this embodiment, in addition to scalar features, vector features are also used in convolution, which makes extracting molecular features easier and more accurate than the method of using only scalar features in the related art.
在某些实施例中,可以基于上述确定分子特征的方式分别确定针对溶质分子的溶质分子特征和针对溶剂分子的溶剂分子特征。需要说明的是,一个溶质分子可以与多个相邻的溶剂分子之间存在相互作用力,可以先确定一个溶质分子与一个溶剂分子之间的作用力,再确定相对于多个溶剂分子的分子溶剂化自由能。此外,也可以直接确定一个溶质分子相对于多个相邻的溶剂分子的分子溶剂化自由能。In some embodiments, the solute molecular characteristics for solute molecules and the solvent molecular characteristics for solvent molecules can be respectively determined based on the above-mentioned manner of determining molecular characteristics. It should be noted that a solute molecule can have an interaction force with multiple adjacent solvent molecules, and the force between a solute molecule and a solvent molecule can be determined first, and then the molecular force relative to multiple solvent molecules can be determined. Free energy of solvation. In addition, it is also possible to directly determine the molecular solvation free energy of a solute molecule relative to multiple adjacent solvent molecules.
具体地,目标分子可以是溶质分子和/或溶剂分子。具体地,待处理数据中可以包括分子属性信息,如待处理数据是溶质分子的数据,和/或,溶剂分子的数据。Specifically, target molecules may be solute molecules and/or solvent molecules. Specifically, the data to be processed may include molecular attribute information, for example, the data to be processed is data of solute molecules, and/or data of solvent molecules.
相应地,上述方法还可以包括如下操作:确定溶质分子的溶质分子特征,以及与溶质分子相关联的至少一个溶剂分子的溶剂分子特征,以便基于溶质分子的溶质分子特征以及与溶质分子相关联的至少一个溶剂分子的溶剂分子特征,确定溶剂化自由能。需要说明的是,溶质分子特征和溶剂分子特征的特征维度可以相同。Correspondingly, the above method may also include the following operations: determining the solute molecular characteristics of the solute molecule, and the solvent molecular characteristics of at least one solvent molecule associated with the solute molecule, so that The solvent molecule characteristic of at least one solvent molecule determines the free energy of solvation. It should be noted that the feature dimensions of the solute molecular feature and the solvent molecular feature may be the same.
图5示意性示出了根据本申请实施例的另一种数据处理方法的流程图。Fig. 5 schematically shows a flowchart of another data processing method according to an embodiment of the present application.
参见图5,上述方法还可以包括操作S510~操作S520。Referring to FIG. 5 , the above method may further include operation S510 to operation S520.
在操作S510,在确定溶质分子的溶质分子特征,以及与溶质分子相关联的至少一个溶剂分子的溶剂分子特征之后,将溶剂分子特征和溶质分子特征的矩阵乘积作为溶剂分子和溶质分子之间的溶剂化作用矩阵。In operation S510, after determining the solute molecular signature of the solute molecule, and the solvent molecular signature of at least one solvent molecule associated with the solute molecule, the matrix product of the solvent molecular signature and the solute molecular signature is used as a matrix product between the solvent molecule and the solute molecule Solvation matrix.
在操作S520,基于溶剂化作用矩阵确定溶剂化作用特征。In operation S520, solvation characteristics are determined based on the solvation matrix.
在本实施例中,针对相关技术中没有显式描述溶剂-溶质相互作用,本实施例中以溶质分子特征与溶剂分子特征的矩阵乘积描述溶剂-溶质相互作用,实现了显式地描述溶剂-溶质相互作用,有助于提升确定的溶剂化作用特征的准确度,进而提升确定的分子溶剂化自由能的准确度。In this example, the solvent-solute interaction is not explicitly described in the related art. In this example, the solvent-solute interaction is described by the matrix product of the solute molecular characteristics and the solvent molecular characteristics, and the solvent-solute interaction is explicitly described. Solute interactions, which help to improve the accuracy of the solvation signature and, in turn, the solvation free energy of the molecule.
在某些实施例中,上述基于溶剂化作用矩阵确定溶剂化作用特征可以包括如下操 作。In some embodiments, the above-mentioned determination of solvation characteristics based on the solvation matrix may include the following operations.
首先,基于溶剂化作用矩阵计算与预设溶质权重对应的溶剂特征,并且基于溶剂化作用矩阵计算与预设溶剂权重对应的溶质特征。First, the solvent characteristics corresponding to the preset solute weights are calculated based on the solvation matrix, and the solute characteristics corresponding to the preset solvent weights are calculated based on the solvation matrix.
然后,将溶剂特征和溶质特征分别转换为一维的包括F个元素的行向量。Then, the solvent feature and the solute feature are respectively converted into one-dimensional row vectors including F elements.
接着,拼接行向量,得到溶剂化作用特征。Next, concatenate the row vectors to obtain the solvation feature.
例如,首先,读取溶剂分子特征X M和溶质分子特征X N,X M是M×F维的矩阵,X N是N×F维的矩阵,其中M和N分别是溶剂分子、溶质分子中包含的原子数。 For example, first, read solvent molecule feature X M and solute molecule feature X N , X M is an M×F dimensional matrix, X N is an N×F dimensional matrix, where M and N are solvent molecules, solute molecules The number of atoms involved.
计算溶剂分子特征和溶质分子特征的矩阵乘积
Figure PCTCN2021140134-appb-000006
Computes the matrix product of solvent molecular features and solute molecular features
Figure PCTCN2021140134-appb-000006
计算溶质权重下的溶剂特征X′ M=X MN·X N与溶剂权重下的溶质特征
Figure PCTCN2021140134-appb-000007
Figure PCTCN2021140134-appb-000008
Calculate the solvent characteristic under the solute weight X′ M = X MN · X N and the solute characteristic under the solvent weight
Figure PCTCN2021140134-appb-000007
Figure PCTCN2021140134-appb-000008
按照阵元权重对X′ M和X′ N进行权重求和,将X′ M和X′ N转换为一维包含F个元素的行向量,最后将两个行向量拼接为一个2F维的行向量I MN是溶剂化作用特征。假设X′ M=(1,2,3,…,F),X′ N=(1,2,3,…,F),则I MN=(1,2,3,…,F,1,2,3,…,F)。其中,阵元权重可以是基于注意力机制来确定的。 According to the weight of the array element, X′ M and X′ N are weighted and summed, X′ M and X′ N are converted into a one-dimensional row vector containing F elements, and finally the two row vectors are spliced into a 2F-dimensional row The vector I MN is the solvation signature. Suppose X′ M =(1,2,3,…,F), X′ N =(1,2,3,…,F), then I MN =(1,2,3,…,F,1, 2,3,...,F). Wherein, the array element weight may be determined based on an attention mechanism.
在某些实施例中,将溶剂特征和溶质特征分别转换为一维的包括F个元素的行向量可以包括如下操作。In some embodiments, converting the solvent feature and the solute feature into a one-dimensional row vector including F elements may include the following operations.
首先,确定溶剂特征中与溶剂分子的原子对应阵元的第一阵元权重,并且确定溶质特征中与溶质分子的原子对应阵元的第二阵元权重。Firstly, the weight of the first array element corresponding to the atom of the solvent molecule in the solvent feature is determined, and the weight of the second array element of the array element corresponding to the atom of the solute molecule in the solute feature is determined.
然后,基于第一阵元权重对溶剂特征进行加权求和,得到一维的包括F个元素的第一行向量,以及,基于第二阵元权重对溶质特征进行加权求和,得到一维的包括F个元素的第二行向量。Then, weighted and summed the solvent features based on the first array element weights to obtain a one-dimensional first row vector including F elements, and weighted and summed the solute features based on the second array element weights to obtain a one-dimensional Second row vector containing F elements.
例如,通过注意力机制,计算X′ M和X′ N中每个原子的注意力系数,按照注意力系数对X′ M和X′ N进行权重求和,将X′ M和X′ N转换为一维包含F个元素的行向量,最后将两个行向量拼接为一个2F维的行向量代表溶剂化作用特征I MNFor example, through the attention mechanism, calculate the attention coefficient of each atom in X'M and X'N , and sum the weights of X'M and X'N according to the attention coefficient, and convert X'M and X'N is a one-dimensional row vector containing F elements, and finally the two row vectors are concatenated into a 2F-dimensional row vector representing the solvation feature I MN .
在某些实施例中,对于输入的溶剂化特征I MN,通过对溶剂化特征I MN中各元素进行加权求和、偏置等处理,即可得到分子溶剂化自由能。例如,利用全连接网络处理就溶剂化特征I MN,就可以得到分子溶剂化自由能。 In some embodiments, for the input solvation feature I MN , the solvation free energy of the molecule can be obtained by performing weighted summation, offset and other processing on each element in the solvation feature I MN . For example, by using the fully connected network to process the solvation feature I MN , the molecular solvation free energy can be obtained.
在本实施例中,在溶剂化作用注意力网络中采用溶剂特征与溶质特征进行矩阵乘积之后,进行加和的注意力机制,体现了溶质权重与溶剂权重的物理意义,显式地描 述了溶剂化作用,提高了溶剂化自由能预测地准确度。In this example, after matrix multiplication of solvent features and solute features in the solvation attention network, the attention mechanism of summation embodies the physical meaning of solute weight and solvent weight, and explicitly describes the solvent The effect of solvation improves the accuracy of prediction of solvation free energy.
本申请的另一方面还提供了一种训练溶剂化自由能预测模型的方法。Another aspect of the present application also provides a method for training a solvation free energy prediction model.
在本实施例中,上述训练溶剂化自由能预测模型的方法可以包括:将基于如上述方法确定的虚拟分子图输入溶剂化自由能预测模型,通过调整模型参数使得损失函数收敛,得到经训练的溶剂化自由能预测模型,其中,虚拟分子图存在对应的溶剂化自由能标注信息,损失函数的输入包括预测得到的溶剂化自由能和溶剂化自由能标注信息中的溶剂化自由能。In this embodiment, the above-mentioned method for training the solvation free energy prediction model may include: inputting the virtual molecular graph determined based on the above method into the solvation free energy prediction model, and adjusting the model parameters so that the loss function converges to obtain the trained The solvation free energy prediction model, in which there is corresponding solvation free energy label information in the virtual molecular map, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information.
在某些实施例中,溶剂化自由能预测模型可以包括如下至少一种网络。In some embodiments, the solvation free energy prediction model may include at least one of the following networks.
分子编码网络,被配置为将包括溶质分子数据和/或溶剂分子数据的训练数据集合中的各训练数据,分别转换为针对溶质分子数据和/或针对溶剂分子数据的虚拟分子图,其中,训练数据具有溶剂化自由能标注信息。The molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein the training Data have free energy of solvation label information.
等变图卷积网络,被配置为将虚拟分子图转换为溶质分子特征和/或溶剂分子特征。Equivariant graph convolutional networks configured to convert virtual molecular graphs into solute molecular features and/or solvent molecular features.
溶剂化作用网络,被配置为将溶质分子特征和溶剂分子特征转换为溶剂化作用特征。A solvation network configured to convert solute molecular features and solvent molecular features into solvation features.
全连接网络,被配置为将溶剂化作用特征转换为溶剂化自由能。A fully connected network configured to convert solvation features into solvation free energies.
相应地,上述训练方法可以包括:将训练数据输入分子编码网络,通过调整模型参数(如网络参数),使得损失函数收敛,其中,损失函数的输入包括全连接网络输出的溶剂化自由能和溶剂化自由能标注信息中的溶剂化自由能。Correspondingly, the above training method may include: input the training data into the molecular encoding network, and adjust the model parameters (such as network parameters) to make the loss function converge, wherein the input of the loss function includes the solvation free energy and solvent The free energy of solvation in the free energy label information.
在某些实施例中,溶剂化作用网络包括自注意力网络,自注意力网络被配置为确定溶剂特征中与溶剂分子的原子对应阵元的第一阵元权重,并且确定溶质特征中与溶质分子的原子对应阵元的第二阵元权重,以便按照第一阵元权重对溶剂特征中与溶剂分子的各原子对应阵元进行融合和按照第二阵元权重对溶质特征中与溶质分子的各原子对应阵元进行融合,其中溶剂特征和溶质特征是基于溶剂化作用矩阵来确定的,溶剂化作用矩阵是基于溶质分子特征和溶剂分子特征来确定的。具体可以参见数据处理方法中相关部分内容,在此不再详述。In some embodiments, the solvation network includes a self-attention network configured to determine a first element weight of an element corresponding to an atom of a solvent molecule in a solvent feature, and to determine an element corresponding to an atom of a solute in a solute feature. The atoms of the molecule correspond to the second array element weight of the array element, so that according to the first array element weight, the corresponding array elements in the solvent feature and the solvent molecules are fused, and the solute feature and the solute molecule are fused according to the second array element weight. The corresponding array elements of each atom are fused, wherein the solvent characteristics and solute characteristics are determined based on the solvation matrix, and the solvation matrix is determined based on the solute molecular characteristics and solvent molecular characteristics. For details, please refer to the relevant part of the data processing method, which will not be described in detail here.
在某些实施例中,上述方法还可以包括如下操作。In some embodiments, the above method may further include the following operations.
首先,将训练数据集合分割为指定份数的子训练数据集合。例如,指定份数可以是基于专家经验或者预测分子溶剂化自由能的准确度来确定。如指定份数可以是3份、5份、8份、10份、13份、18份、20份等。First, the training data set is divided into sub-training data sets of a specified number. For example, the specified number of parts can be determined based on expert experience or the accuracy of prediction of the molecule's free energy of solvation. For example, the specified number of copies can be 3, 5, 8, 10, 13, 18, 20, etc.
然后,构建与指定份数相同数量的溶剂化自由能预测模型。这样可以训练多个溶剂化自由能预测模型,从其中选取预测溶剂化自由能的准确度好的模型,或者将多个模型的输出结果的均值作为最终预测结果。Then, build the same number of solvation free energy prediction models as the specified number of copies. In this way, multiple solvation free energy prediction models can be trained, and a model with good accuracy in predicting solvation free energy can be selected from among them, or the average value of the output results of multiple models can be used as the final prediction result.
相应地,将训练数据输入分子编码网络包括:分别将各子训练数据集合中的训练数据输入不同的溶剂化自由能预测模型的分子编码网络,以对不同的溶剂化自由能预测模型分别进行模型训练,得到多个经训练的与指定份数相同数量的溶剂化自由能预测模型。Correspondingly, inputting the training data into the molecular encoding network includes: respectively inputting the training data in each sub-training data set into the molecular encoding network of different solvation free energy prediction models, so as to model the different solvation free energy prediction models respectively Train to get as many trained solvation free energy prediction models as the specified number.
图6示意性示出了根据本申请实施例的训练溶剂化自由能预测模型的方法的流程图。Fig. 6 schematically shows a flowchart of a method for training a solvation free energy prediction model according to an embodiment of the present application.
参见图6,收集实验测量或理论计算得到的溶质在溶剂中的溶剂化自由能真实值的数据若干条。对于每一条数据,将溶质分子、溶剂分子的三维构象以x,y,z格式的字符串存入数据集(如将分子的三维构象以各原子的x,y,z坐标进行存储),将对应的溶剂化自由能以浮点数存入同一数据集。Referring to Fig. 6, several pieces of data of the true value of the solvation free energy of the solute in the solvent obtained by experimental measurement or theoretical calculation are collected. For each piece of data, store the three-dimensional conformation of solute molecules and solvent molecules in the data set as strings in x, y, z format (for example, store the three-dimensional conformation of molecules in x, y, z coordinates of each atom), and The corresponding free energies of solvation are stored in the same dataset as floating point numbers.
初始化溶剂化自由能预测模型(如等变图神经网络模型)。该模型包括四部分:将分子x,y,z字符串转换为虚拟分子图的分子编码网络。将虚拟分子图转换为分子特征的等变图卷积网络。将溶质分子、溶剂分子特征进行矩阵乘积与注意力聚合转换为溶剂化特征的溶剂化作用注意力网络(溶质分子与溶剂分子进行矩阵乘积之后得到的是分子中原子级别的特征,注意力聚合就是将这些原子级别的特征通过注意力机制聚合为分子级别的特征),将溶剂化特征转换为溶剂化自由能的全连接网络。Initialize the solvation free energy prediction model (such as the isotropic graph neural network model). The model consists of four parts: a molecular encoding network that converts molecular x,y,z strings into virtual molecular graphs. An equivariant graph convolutional network that converts virtual molecular graphs to molecular features. The solvation attention network converts the features of solute molecules and solvent molecules through matrix product and attention aggregation into solvation features (the matrix product of solute molecules and solvent molecules is obtained at the atomic level of the molecule, and attention aggregation is These atomic-level features are aggregated into molecular-level features through the attention mechanism), and the solvation features are converted into a fully connected network of solvation free energy.
设置损失函数(如均方差损失函数、绝对差损失函数、Huber损失函数等)。将数据集均等划分为十份,采用十折交叉验证方式训练模型。直到验证集的损失函数不再降低为止(即损失函数收敛,前后两次损失函数差距小于预设值,预设值可以取0.0005、0.001、0.002等,即收敛),得到十个等变图神经网络模型。需要说明的是,也可以采用五折交叉验证方式或k折交叉验证方式等。Set the loss function (such as mean square error loss function, absolute difference loss function, Huber loss function, etc.). The data set is equally divided into ten parts, and the model is trained by ten-fold cross-validation. Until the loss function of the verification set is no longer reduced (that is, the loss function converges, and the difference between the two loss functions before and after is less than the preset value, the preset value can be 0.0005, 0.001, 0.002, etc., that is, convergence), and ten equivariant graphs are obtained. network model. It should be noted that a 5-fold cross-validation method or a k-fold cross-validation method may also be used.
损失函数L的计算方式如式(4)所示。The calculation method of the loss function L is shown in formula (4).
Figure PCTCN2021140134-appb-000009
Figure PCTCN2021140134-appb-000009
其中,G i,pred是溶剂化自由能的预测值,G i,true是溶剂化自由能的真实值,n是训练所用溶剂-溶质对数目。 Among them, G i,pred is the predicted value of solvation free energy, G i,true is the real value of solvation free energy, n is the number of solvent-solute pairs used in training.
以下对溶剂化自由能预测模型的拓扑结构进行示例性说明。The topology of the solvation free energy prediction model is illustrated below.
图7示意性示出了根据本申请实施例的等变图卷积网络的结构示意图。Fig. 7 schematically shows a schematic structural diagram of an equivariant graph convolutional network according to an embodiment of the present application.
参见图7,等变图卷积网络包括指定循环次数num_conv层的卷积层,其中,当前卷积层的输出作为相邻的下一层卷积层的部分输入。Referring to FIG. 7 , the equivariant graph convolutional network includes a convolutional layer with a specified number of cycles num_conv layer, wherein the output of the current convolutional layer is used as part of the input of the adjacent convolutional layer of the next layer.
首个卷积层(参见第一卷积层)的输入包括:节点标量特征N s、节点矢量特征N v、边标量特征E s和边矢量特征E v。首个卷积层的输出包括:更新节点标量特征New_N s和更新节点矢量特征New_N vThe input of the first convolutional layer (refer to the first convolutional layer) includes: node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v . The output of the first convolutional layer includes: update node scalar feature New_N s and update node vector feature New_N v .
首个卷积层之外的卷积层(参见第二卷积层、第三卷积层、第四卷积层等)的输入包括:更新节点标量特征New_N s和更新节点矢量特征New_N v、边标量特征E s和边矢量特征E v。首个卷积层之外的卷积层的输出包括:更新节点标量特征New_N s和更新节点矢量特征New_N vThe input of the convolutional layer other than the first convolutional layer (see the second convolutional layer, the third convolutional layer, the fourth convolutional layer, etc.) includes: update node scalar feature New_N s and update node vector feature New_N v , Edge scalar features E s and edge vector features E v . The outputs of the convolutional layers other than the first convolutional layer include: updated node scalar feature New_N s and updated node vector feature New_N v .
通过等变图卷积网络可以将原子特征转换为分子特征。Atomic features can be transformed into molecular features through equivariant graph convolutional networks.
具体地,每个卷积层可以通过如下方式实现特征转换。Specifically, each convolutional layer can implement feature transformation as follows.
请一并参见图7和图4,等变图卷积网络由四种基本操作Linear、ReLU、Hadamard和Sum组合而成。其中,Linear是矩阵线性变换操作,ReLU是激活操作,Hadamard是矩阵对应乘法操作,Sum是矩阵加和操作。Please refer to Figure 7 and Figure 4 together. The equivariant graph convolutional network is composed of four basic operations Linear, ReLU, Hadamard and Sum. Among them, Linear is a matrix linear transformation operation, ReLU is an activation operation, Hadamard is a matrix corresponding multiplication operation, and Sum is a matrix addition operation.
具体地,卷积层被配置为执行如下操作。需要说明的是,第一线性操作可以是由第一线性层来实现的,第二线性操作可以是由第二线性层来实现的。其中,第一线性层和第二线性层可以是相同层或者不同层。Specifically, the convolutional layer is configured to perform the following operations. It should be noted that the first linear operation may be implemented by the first linear layer, and the second linear operation may be implemented by the second linear layer. Wherein, the first linear layer and the second linear layer may be the same layer or different layers.
对节点标量特征N s依序进行第一线性操作、第二激活函数和第二线性操作得到第一子处理结果Q1,并且,对边标量特征E s进行第三线性操作,得到第二子处理结果Q2。 Perform the first linear operation, the second activation function, and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and perform the third linear operation on the edge scalar feature E s to obtain the second sub-processing Results Q2.
对第一子处理结果Q1和第二子处理结果Q2进行第一矩阵对应乘法操作,得到第三子处理结果Q3。The first matrix corresponding multiplication operation is performed on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain the third sub-processing result Q3.
基于第三子处理结果Q3和节点矢量特征N v进行第二矩阵对应乘法操作,得到第四子处理结果Q4,并且,基于第三子处理结果Q3和边矢量特征E v进行第三矩阵对应乘法操作,得到第五子处理结果Q5。 Carry out the corresponding multiplication operation of the second matrix based on the third sub-processing result Q3 and the node vector feature Nv to obtain the fourth sub-processing result Q4, and perform the third matrix corresponding multiplication based on the third sub-processing result Q3 and the edge vector feature Ev operation to obtain the fifth sub-processing result Q5.
对第四子处理结果Q4和第五子处理结果Q5进行第一矩阵加和操作,得到第六子处理结果Q6。The first matrix addition operation is performed on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain the sixth sub-processing result Q6.
对第六子处理结果Q6分别经过第四线性操作和第五线性操作,得到第七子处理结果Q7和第八子处理结果Q8。The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8.
对第七子处理结果Q7依序经过第六线性操作、第二激活函数、第七线性操作, 得到第九子处理结果Q9,并且,基于第三子处理结果Q3、第七子处理结果Q7和第八子处理结果Q8进行内积操作Inner,得到第十子处理结果Q10。The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the seventh sub-processing result Q7 and The eighth sub-processing result Q8 is subjected to the inner product operation Inner to obtain the tenth sub-processing result Q10.
对第九子处理结果Q9和第十子处理结果Q10进行第四矩阵对应乘法操作,得到第十一子处理结果Q11。The fourth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain the eleventh sub-processing result Q11.
对第九子处理结果Q9和第十一子处理结果Q11进行第二矩阵加和,得到更新节点标量特征NewN sThe second matrix addition is performed on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the updated node scalar feature NewN s .
对第八子处理结果Q8和第九子处理结果Q9进行第五矩阵对应乘法操作,得到更新节点矢量特征NewN vThe fifth matrix corresponding multiplication operation is performed on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v .
其中,上述各操作的作用和效果可以参考以上实施例中相关内容,在此不再详述。For the functions and effects of the above operations, reference may be made to relevant content in the above embodiments, and details will not be described here.
在某些实施例中,训练数据中溶质分子或者溶剂分子的原子分别具有F维特征。In some embodiments, atoms of solute molecules or solvent molecules in the training data respectively have F-dimensional features.
图8示意性示出了根据本申请实施例的全连接网络的结构示意图。Fig. 8 schematically shows a schematic structural diagram of a fully connected network according to an embodiment of the present application.
参见图8,全连接网络可以包括:依序连接的第一线性层(如Linear)、第一激活函数层(如ReLU)、第二线性层、第二激活函数层和第三线性层,其中,第一线性层和第二线性层的输出维度是F维,第三线性层的输出维度是1维。第一线性层的输入是一个2F维的行向量,代表溶剂化作用特征I MN。对于输入的溶剂化作用特征I MN,经过全连接网络就可以转换为分子溶剂化自由能。 Referring to Figure 8, the fully connected network may include: a sequentially connected first linear layer (such as Linear), a first activation function layer (such as ReLU), a second linear layer, a second activation function layer, and a third linear layer, where , the output dimension of the first linear layer and the second linear layer is F dimension, and the output dimension of the third linear layer is 1 dimension. The input to the first linear layer is a 2F-dimensional row vector representing the solvation feature I MN . For the input solvation feature I MN , it can be converted into molecular solvation free energy through a fully connected network.
本申请的另一方面提供了一种确定溶剂化自由能的方法。Another aspect of the present application provides a method of determining the free energy of solvation.
在本实施例中,上述确定溶剂化自由能的方法可以包括如下操作,利用根据如上述方法训练的溶剂化自由能预测模型处理虚拟分子图,得到针对虚拟分子图的溶剂化自由能,其中,虚拟分子图是基于待处理数据生成的图,待处理数据包括针对目标分子中的多个原子各自的属性信息,目标分子包括溶质分子和/或溶剂分子。In this embodiment, the above-mentioned method for determining the free energy of solvation may include the following operations, using the free energy of solvation prediction model trained according to the above-mentioned method to process the virtual molecular graph to obtain the free energy of solvation for the virtual molecular graph, wherein, The virtual molecular graph is a graph generated based on the data to be processed, and the data to be processed includes attribute information for multiple atoms in the target molecule, and the target molecule includes solute molecules and/or solvent molecules.
在某些实施例中,溶剂化自由能预测模型可以包括如下至少一种网络。In some embodiments, the solvation free energy prediction model may include at least one of the following networks.
分子编码网络,被配置为将包括溶质分子数据和/或溶剂分子数据的训练数据集合中的各训练数据,分别转换为针对溶质分子数据和/或针对溶剂分子数据的虚拟分子图,其中,训练数据具有溶剂化自由能标注信息。The molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein the training Data have free energy of solvation label information.
等变图卷积网络,被配置为将虚拟分子图转换为溶质分子特征和/或溶剂分子特征。Equivariant graph convolutional networks configured to convert virtual molecular graphs into solute molecular features and/or solvent molecular features.
溶剂化作用网络,被配置为将溶质分子特征和溶剂分子特征转换为溶剂化作用特征。A solvation network configured to convert solute molecular features and solvent molecular features into solvation features.
全连接网络,被配置为将溶剂化作用特征转换为溶剂化自由能。A fully connected network configured to convert solvation features into solvation free energies.
相应地,上述方法可以包括如下操作,利用经训练的溶剂化自由能预测模型处理待处理数据,得到针对待处理数据的溶剂化自由能,其中,待处理数据包括针对目标分子中的多个原子各自的属性信息,目标分子包括溶质分子和/或溶剂分子。Correspondingly, the above method may include the following operations, using the trained solvation free energy prediction model to process the data to be processed to obtain the solvation free energy for the data to be processed, wherein the data to be processed includes The respective attribute information, target molecules include solute molecules and/or solvent molecules.
图9示意性示出了根据本申请实施例的确定溶剂化自由能的方法的流程图。Fig. 9 schematically shows a flowchart of a method for determining the free energy of solvation according to an embodiment of the present application.
参见图9,可以向溶剂化自由能预测模型输入网络参数以便于利用经训练的神经网络对溶剂构象、溶质构象进行处理。具体地,可以将溶剂构象(如可以表示为针对溶剂分子的xyz字符串)、溶质构象(如可以表示为针对溶质分子的xyz字符串)作为分子编码网络的输入。Referring to FIG. 9 , network parameters can be input into the solvation free energy prediction model so that the solvent conformation and solute conformation can be processed by the trained neural network. Specifically, the solvent conformation (eg, can be expressed as an xyz string for solvent molecules), and the solute conformation (eg, can be expressed as an xyz string for solute molecules) can be used as the input of the molecular encoding network.
在某些实施例中,上述方法还可以包括如下操作。In some embodiments, the above method may further include the following operations.
首先,将虚拟分子图或者待处理数据分别输入经训练的不同的指定个数的溶剂化自由能预测模型,得到指定个数的溶剂化自由能。其中,指定个数可以是经训练的溶剂化自由能预测模型的个数。Firstly, the virtual molecular map or the data to be processed are respectively input into different trained solvation free energy prediction models with a specified number to obtain the specified number of solvation free energies. Wherein, the specified number may be the number of trained solvation free energy prediction models.
然后,将指定个数的溶剂化自由能的加权平均值作为与待处理数据对应的溶剂化自由能。Then, take the weighted average of the specified number of solvation free energies as the solvation free energy corresponding to the data to be processed.
例如,将待预测溶剂分子、溶质分子分别以x,y,z格式依次输入十个模型,得到十个溶剂化自由能预测值,取它们的平均作为最终预测结果。For example, the solvent molecules and solute molecules to be predicted are respectively input into ten models in the format of x, y, and z, and ten predicted values of solvation free energy are obtained, and the average of them is taken as the final prediction result.
在一个具体实施例中,首先,收集11940种分子的分子构象共48776个(如可以通过pubchem等在线数据库收集分子构象),选择其中只具有单一分子构象(此处选择单一构象分子只是为了计算方便。也可以选择多构象的分子,但是需要对不同构象计算的结果进行统计平均)的水、四氢呋喃、三氯甲烷、二氯甲烷、二恶烷、甲苯、甲醇、丙酮、正庚烷、环己烷、乙醚、乙腈、二甲基甲酰胺、二甲亚砜和甲基叔丁基醚这15种(仅为示例,可以多于或少于15种)分子作为溶剂,使用COSMOtherm计算48776个构象分别在15种溶剂中的溶剂化自由能数据731640条。将48776个构象以x,y,z格式存入数据集,溶质构象与溶剂构象对应的731640条溶剂化自由能数据以浮点数存入数据集。选择以水为溶剂的体系48776个作为测试集,其他682864个体系作为训练集。In a specific embodiment, at first, a total of 48,776 molecular conformations of 11,940 molecules are collected (for example, molecular conformations can be collected through online databases such as pubchem), and only a single molecular conformation is selected among them (selecting a single conformation molecule here is just for calculation convenience) You can also choose molecules with multiple conformations, but you need to statistically average the results of calculations of different conformations) water, tetrahydrofuran, chloroform, dichloromethane, dioxane, toluene, methanol, acetone, n-heptane, cyclohexane Using COSMOtherm to calculate 48,776 conformations using 15 (just an example, more or less than 15) molecules of alkanes, diethyl ether, acetonitrile, dimethylformamide, dimethyl sulfoxide, and methyl tert-butyl ether as solvents There are 731640 pieces of solvation free energy data in 15 solvents. 48776 conformations are stored in the data set in x, y, z format, and 731640 solvation free energy data corresponding to solute conformation and solvent conformation are stored in the data set as floating point numbers. Select 48776 systems using water as the solvent as the test set, and the other 682864 systems as the training set.
然后,设置特征维度F为128,截断半径rcut为
Figure PCTCN2021140134-appb-000010
卷积层数num_conv为3,初始化等变图神经网络模型。
Then, set the feature dimension F to 128, and the truncation radius rcut to
Figure PCTCN2021140134-appb-000010
The number of convolutional layers num_conv is 3, and the equivariant graph neural network model is initialized.
接着,设置损失函数为均方误差。将训练集682864个体系划分为十份,设置不同的随机数种子,采用十折交叉验证方式训练模型,直到验证集的损失函数不再降低 为止,得到十个等变图神经网络模型,输出十个模型的网络参数。Next, set the loss function to mean square error. Divide the 682,864 systems in the training set into ten parts, set different random number seeds, and train the model with ten-fold cross-validation until the loss function of the verification set is no longer reduced. Ten equivariant graph neural network models are obtained, and ten network parameters of a model.
然后,将待预测的测试集48776个体系对应的溶剂分子、溶质分子以x,y,z格式依次输入上述的十个模型,得到十个溶剂化自由能预测值,取它们的平均作为最终预测结果。为了比较,同时将训练集的溶剂分子、溶质分子以x,y,z格式依次输入上述十个模型得到十个溶剂化自由能预测值,取它们的平均作为训练集最终预测结果。训练集与测试集真实值与模型预测值的相关性分别如图10、图11所示。其中,MAE是平均绝对误差,RMSE是均方根误差,R 2是决定系数。MAE与RMSE越小,表示模型误差越小。R 2是一个介于0~1的值,R 2越大,表示模型相关性越好。可以看到,模型在测试集与训练集上的相关性基本一致,预测平均绝对误差均小于1kJ/mol,远低于传统机器学习方法的误差。结果参见表1所示。 Then, input the solvent molecules and solute molecules corresponding to the 48,776 systems of the test set to be predicted into the above ten models in the format of x, y, and z in order to obtain ten predicted values of solvation free energy, and take their average as the final prediction result. For comparison, at the same time, the solvent molecules and solute molecules of the training set were input into the above ten models in the format of x, y, and z to obtain ten predicted values of solvation free energy, and the average of them was taken as the final prediction result of the training set. The correlation between the real value of the training set and the test set and the predicted value of the model is shown in Figure 10 and Figure 11, respectively. where MAE is the mean absolute error, RMSE is the root mean square error, and R2 is the coefficient of determination. The smaller the MAE and RMSE, the smaller the model error. R 2 is a value between 0 and 1, and the larger R 2 is, the better the model correlation is. It can be seen that the correlation between the test set and the training set of the model is basically the same, and the average absolute error of prediction is less than 1kJ/mol, which is much lower than the error of traditional machine learning methods. The results are shown in Table 1.
表1Table 1
模型Model CIGINCIGIN DelfosDelfos MPNNMPNN 本申请this application
MAE/(kJ/mol)MAE/(kJ/mol) 3.173.17 4.974.97 4.814.81 0.770.77
在另一个具体实施例中,首先,收集11940种分子的构象共48776个,选择其中只具有单一构象的水、四氢呋喃、三氯甲烷、二氯甲烷、二恶烷、甲苯、甲醇、丙酮、正庚烷、环己烷、乙醚、乙腈、二甲基甲酰胺、二甲亚砜和甲基叔丁基醚这15种分子作为溶剂,使用COSMOtherm计算48776个构象分别在15种溶剂中的溶剂化自由能数据731640条。将48776个构象以x,y,z格式存入数据集,溶质构象与溶剂构象对应的731640条溶剂化自由能数据以浮点数存入数据集。选择740种分子的的2765个构象体系的41475条溶质-溶剂-溶剂化自由能数据作为测试集,其他690165个体系作为训练集。In another specific embodiment, at first, a total of 48,776 conformations of 11,940 molecules are collected, and water, tetrahydrofuran, chloroform, dichloromethane, dioxane, toluene, methanol, acetone, n- Heptane, cyclohexane, diethyl ether, acetonitrile, dimethylformamide, dimethyl sulfoxide and methyl tert-butyl ether are used as solvents, and COSMOtherm is used to calculate the solvation of 48776 conformations in 15 solvents Free energy data 731640 items. 48776 conformations are stored in the data set in x, y, z format, and 731640 solvation free energy data corresponding to solute conformation and solvent conformation are stored in the data set as floating point numbers. 41475 pieces of solute-solvent-solvation free energy data of 2765 conformation systems of 740 kinds of molecules were selected as the test set, and the other 690165 systems were used as the training set.
然后,设置特征维度F为128,截断半径r cut
Figure PCTCN2021140134-appb-000011
卷积层数num_conv为3,初始化等变图神经网络模型。
Then, set the feature dimension F to 128, and the truncation radius r cut to
Figure PCTCN2021140134-appb-000011
The number of convolutional layers num_conv is 3, and the equivariant graph neural network model is initialized.
接着,设置损失函数为均方误差。将训练集690165个体系划分为十份,设置不同的随机数种子,采用十折交叉验证方式训练模型,直到验证集的损失函数不再降低为止,得到十个等变图神经网络模型,输出十个模型的网络参数。Next, set the loss function to mean square error. Divide the 690,165 systems in the training set into ten parts, set different random number seeds, and train the model with ten-fold cross-validation until the loss function of the verification set is no longer reduced. Ten equivariant graph neural network models are obtained, and ten network parameters of a model.
然后,将待预测的测试集41475个体系的溶剂、溶质分子以x,y,z格式依次输入十个模型,得到十个溶剂化自由能预测值,取它们的平均作为最终预测结果。为了比 较,同时将训练集的溶剂、溶质分子以x,y,z格式依次输入十个模型得到十个溶剂化自由能预测值,取它们的平均作为训练集最终预测结果。训练集与测试集真实值与模型预测值的相关性分别如图12、图13所示。可以看到,模型在测试集与训练集上的相关性基本一致,预测平均绝对误差均小于1kJ/mol。Then, input the solvent and solute molecules of 41,475 systems in the test set to be predicted into ten models sequentially in the format of x, y, and z to obtain ten predicted values of solvation free energy, and take their average as the final predicted result. For comparison, at the same time, the solvent and solute molecules of the training set are input into ten models in the format of x, y, and z to obtain ten predicted values of solvation free energy, and the average of them is taken as the final prediction result of the training set. The correlation between the real value of the training set and the test set and the predicted value of the model is shown in Figure 12 and Figure 13, respectively. It can be seen that the correlation of the model on the test set and the training set is basically the same, and the average absolute error of prediction is less than 1kJ/mol.
在本实施例中,针对相关技术中预测分子溶剂化自由能的缺陷与不足,提出了基于等变图神经网络来预测溶剂化自由能。其中,针对相关技术不能完整表示分子三维特征问题,本实施例以虚拟分子图作为描述符表示分子,针对相关技术没有显式描述溶剂-溶质相互作用,本实施例以溶质分子特征向量与溶剂分子特征向量的矩阵乘积描述溶剂-溶质相互作用。具体地,通过分子编码、等变图卷积、特征相互作用与自由能预测四个步骤组成。分子编码步将溶剂与溶质分子表示为带有特征编码的虚拟分子图。等变图卷积步将虚拟分子图转换为矩阵形式的特征表示。特征相互作用步将溶剂与溶质的特征表示进行矩阵乘积得到溶剂化作用的特征表示。自由能预测步基于溶剂化作用的特征表示通过全连接神经网络预测出分子溶剂化自由能,有效提升了预测出的分子溶剂化自由能的精准度。In this embodiment, aiming at the defects and insufficiencies in the prediction of molecular solvation free energy in related technologies, a neural network based on equivariant graphs is proposed to predict the solvation free energy. Among them, in view of the problem that related technologies cannot fully represent the three-dimensional characteristics of molecules, this embodiment uses virtual molecular graphs as descriptors to represent molecules. As related technologies do not explicitly describe solvent-solute interactions, this embodiment uses solute molecular feature vectors and solvent molecules A matrix product of eigenvectors describes the solvent-solute interaction. Specifically, it consists of four steps: molecular encoding, equivariant graph convolution, feature interaction and free energy prediction. The molecular encoding step represents solvent and solute molecules as virtual molecular graphs with feature encodings. The equivariant graph convolution step transforms the virtual molecular graph into a feature representation in matrix form. In the characteristic interaction step, the characteristic representation of solvent and solute is matrix multiplied to obtain the characteristic representation of solvation. The free energy prediction step is based on the characteristic representation of solvation to predict the molecular solvation free energy through the fully connected neural network, which effectively improves the accuracy of the predicted molecular solvation free energy.
本申请的另一方面还提供了一种设计方法。Another aspect of the present application also provides a design method.
图14示意性示出了根据本申请实施例的设计方法的流程图。Fig. 14 schematically shows a flowchart of a design method according to an embodiment of the present application.
参见图14,该设计方法可以包括操作S1410和操作S1420。Referring to FIG. 14, the design method may include operation S1410 and operation S1420.
在操作S1410,根据如上所示的方法,确定溶剂化自由能。In operation S1410, according to the method shown above, the free energy of solvation is determined.
在操作S1420,基于溶剂化自由能进行药物设计或者材料设计等。In operation S1420, drug design or material design, etc. are performed based on the free energy of solvation.
本申请的另一方面还提供了一种数据处理装置。Another aspect of the present application also provides a data processing device.
图15示意性示出了根据本申请实施例的数据处理装置的方框图。Fig. 15 schematically shows a block diagram of a data processing device according to an embodiment of the present application.
参见图15,该数据处理装置可以包括:待处理数据获得模块1510、集合生成模块1520、节点和边特征生成模块1530和虚拟分子构建模块1540。Referring to FIG. 15 , the data processing device may include: a module for obtaining data to be processed 1510 , a set generation module 1520 , a node and edge feature generation module 1530 , and a virtual molecule construction module 1540 .
待处理数据获得模块1510用于获得待处理数据,待处理数据包括针对目标分子中的多个原子各自的属性信息。The to-be-processed data obtaining module 1510 is used to obtain the to-be-processed data, and the to-be-processed data includes property information for multiple atoms in the target molecule.
集合生成模块1520用于响应于多个原子各自的属性信息,生成针对目标分子的节点集合和节点位置集合,其中,节点集合中的多个节点分别表征特定原子类型的原子,节点位置集合包括节点集合中各节点在特定坐标系下的坐标信息。The set generation module 1520 is used to generate a node set and a node position set for the target molecule in response to the respective attribute information of a plurality of atoms, wherein the multiple nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes nodes The coordinate information of each node in the collection in a specific coordinate system.
节点和边特征生成模块1530用于生成针对节点集合的节点标量特征N s和节点矢量特征N v,并且基于节点位置集合中各节点的坐标信息生成针对节点集合的边标量特 征E s和边矢量特征E vThe node and edge feature generation module 1530 is used to generate the node scalar feature N s and the node vector feature N v for the node set, and generate the edge scalar feature E s and the edge vector for the node set based on the coordinate information of each node in the node position set Features E v .
虚拟分子构建模块1540用于基于针对节点集合的节点标量特征N s、节点矢量特征N v、边标量特征E s和边矢量特征E v构建虚拟分子图,以基于虚拟分子图确定目标分子的分子特征X,便于至少基于目标分子的分子特征X确定溶剂化自由能。 The virtual molecule construction module 1540 is used to construct a virtual molecular graph based on the node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v for the node set, to determine the molecule of the target molecule based on the virtual molecular graph The feature X facilitates determining the free energy of solvation based at least on the molecular feature X of the target molecule.
在某些实施例中,目标分子包括N个原子,节点集合中的多个节点各自具有F维特征。In some embodiments, the target molecule includes N atoms, and the plurality of nodes in the node set each have F-dimensional features.
节点标量特征N s的维度包括N×F×1维,节点矢量特征N v的维度包括N×F×3维,边标量特征E s的维度包括N×1×1维,边矢量特征E v的维度包括N×3×1维。 The dimension of node scalar feature N s includes N×F×1 dimension, the dimension of node vector feature N v includes N×F×3 dimension, the dimension of edge scalar feature E s includes N×1×1 dimension, and the dimension of edge vector feature E v The dimensions of include N×3×1 dimensions.
在某些实施例中,上述装置1500还可以包括:截断半径确定模块和目标节点集合确定模块。In some embodiments, the above apparatus 1500 may further include: a truncation radius determination module and a target node set determination module.
截断半径确定模块被配置为在响应于多个原子各自的属性信息,生成针对目标分子的节点集合和节点位置集合之后,确定截断半径r cutThe cutoff radius determination module is configured to determine the cutoff radius r cut after generating a node set and a node position set for the target molecule in response to the respective attribute information of the plurality of atoms.
目标节点集合确定模块被配置为从节点集合中确定节点之间距离小于或者等于截断半径r cut的目标节点,得到目标节点集合N iThe target node set determining module is configured to determine the target nodes whose distance between nodes is less than or equal to the cutoff radius r cut from the node set to obtain the target node set N i .
相应地,节点和边特征生成模块1530具体被配置为基于节点位置集合中针对目标节点的坐标信息生成针对目标节点集合N i的边标量特征E s和边矢量特征E vCorrespondingly, the node and edge feature generation module 1530 is specifically configured to generate the edge scalar feature E s and the edge vector feature E v for the target node set N i based on the coordinate information for the target node in the node position set.
在某些实施例中,目标节点集合包括E个节点,E个节点各自具有F维特征。In some embodiments, the set of target nodes includes E nodes, each of which has F-dimensional features.
节点标量特征N s的维度包括N×F×1维,节点矢量特征N v的维度包括N×F×3维,边标量特征E s的维度包括E×1×1维,边矢量特征E v的维度包括E×3×1维。 The dimension of node scalar feature N s includes N×F×1 dimension, the dimension of node vector feature N v includes N×F×3 dimension, the dimension of edge scalar feature E s includes E×1×1 dimension, and the dimension of edge vector feature E v The dimensions of include E×3×1 dimensions.
在某些实施例中,上述装置1500还包括特征更新模块、循环模块。In some embodiments, the above apparatus 1500 further includes a feature update module and a loop module.
特征更新模块被配置为基于虚拟分子图更新节点标量特征N s和节点矢量特征N v,得到更新节点标量特征New_N s和更新节点矢量特征New_N vThe feature updating module is configured to update the node scalar feature N s and the node vector feature N v based on the virtual molecular graph, and obtain the updated node scalar feature New_N s and the updated node vector feature New_N v .
循环模块被配置为重复以下单元,直至达到指定循环次数num_conv,以将达到指定循环次数num_conv时得到的更新节点标量特征New_N s,作为分子特征X。 The cycle module is configured to repeat the following units until the specified number of cycles num_conv is reached, and the updated node scalar feature New_N s obtained when the specified cycle number num_conv is reached is used as the molecular feature X.
特征替换单元,被配置为将更新节点标量特征New_N s和更新节点矢量特征New_N v分别作为当前节点标量特征Now_N s和当前节点矢量特征Now_N vThe feature replacement unit is configured to use the updated node scalar feature New_N s and the updated node vector feature New_N v as the current node scalar feature Now_N s and the current node vector feature Now_N v respectively.
特征计算单元,被配置为利用当前节点标量特征Now_N s、当前节点矢量特征Now_N v、边标量特征E s和边矢量特征E v构建更新虚拟分子图。 The feature calculation unit is configured to use the current node scalar feature Now_N s , the current node vector feature Now_N v , the edge scalar feature E s and the edge vector feature E v to construct and update the virtual molecular graph.
特征更新单元,被配置为基于更新虚拟分子图更新更新节点标量特征New_N s和更新节点矢量特征New_N vThe feature updating unit is configured to update the updated node scalar feature New_N s and the updated node vector feature New_N v based on the updated virtual molecular graph.
在某些实施例中,特征更新模块具体被配置为执行以下操作。In some embodiments, the feature update module is specifically configured to perform the following operations.
对节点标量特征N s依序进行第一线性操作、第二激活函数和第二线性操作得到第一子处理结果Q1,并且,对边标量特征E s进行第三线性操作,得到第二子处理结果Q2。 Perform the first linear operation, the second activation function, and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and perform the third linear operation on the edge scalar feature E s to obtain the second sub-processing Results Q2.
对第一子处理结果Q1和第二子处理结果Q2进行第一矩阵对应乘法操作,得到第三子处理结果Q3。The first matrix corresponding multiplication operation is performed on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain the third sub-processing result Q3.
基于第三子处理结果Q3和节点矢量特征N v进行第二矩阵对应乘法操作,得到第四子处理结果Q4,并且,基于第三子处理结果Q3和边矢量特征E v进行第三矩阵对应乘法操作,得到第五子处理结果Q5。 Carry out the corresponding multiplication operation of the second matrix based on the third sub-processing result Q3 and the node vector feature Nv to obtain the fourth sub-processing result Q4, and perform the third matrix corresponding multiplication based on the third sub-processing result Q3 and the edge vector feature Ev operation to obtain the fifth sub-processing result Q5.
对第四子处理结果Q4和第五子处理结果Q5进行第一矩阵加和操作,得到第六子处理结果Q6。The first matrix addition operation is performed on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain the sixth sub-processing result Q6.
对第六子处理结果Q6分别经过第四线性操作和第五线性操作,得到第七子处理结果Q7和第八子处理结果Q8。The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8.
对第七子处理结果Q7依序经过第六线性操作、第二激活函数、第七线性操作,得到第九子处理结果Q9,并且,基于第三子处理结果Q3、第七子处理结果Q7和第八子处理结果Q8进行内积操作Inner,得到第十子处理结果Q10。The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the seventh sub-processing result Q7 and The eighth sub-processing result Q8 is subjected to the inner product operation Inner to obtain the tenth sub-processing result Q10.
对第九子处理结果Q9和第十子处理结果Q10进行第四矩阵对应乘法操作,得到第十一子处理结果Q11。The fourth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain the eleventh sub-processing result Q11.
对第八子处理结果Q8和第九子处理结果Q9进行第五矩阵对应乘法操作,得到更新节点矢量特征NewN vThe fifth matrix corresponding multiplication operation is performed on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v .
对第九子处理结果Q9和第十一子处理结果Q11进行第二矩阵加和,得到更新节点标量特征NewN sThe second matrix addition is performed on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the updated node scalar feature NewN s .
其中,上述各操作的作用和效果可以参考以上实施例中相关内容,在此不再详述。For the functions and effects of the above operations, reference may be made to relevant content in the above embodiments, and details will not be described here.
在某些实施例中,目标分子是溶质分子和/或溶剂分子。In certain embodiments, target molecules are solute molecules and/or solvent molecules.
上述装置1500还包括:溶质溶剂分子特征确定模块,被配置为确定溶质分子的溶质分子特征,以及与溶质分子相关联的至少一个溶剂分子的溶剂分子特征,以便基于溶质分子的溶质分子特征以及与溶质分子相关联的至少一个溶剂分子的溶剂分子特征,确定溶剂化自由能。The above-mentioned apparatus 1500 further includes: a solute-solvent molecular characteristic determination module configured to determine a solute molecular characteristic of the solute molecule, and a solvent molecular characteristic of at least one solvent molecule associated with the solute molecule, so that The solvent molecule characteristic of at least one solvent molecule associated with the solute molecule determines the free energy of solvation.
在某些实施例中,上述装置1500还包括:溶剂化作用矩阵确定模块和溶剂化作用特征确定模块。In some embodiments, the above-mentioned apparatus 1500 further includes: a solvation matrix determination module and a solvation characteristic determination module.
溶剂化作用矩阵确定模块被配置为在确定溶质分子的溶质分子特征,以及与溶质分子相关联的至少一个溶剂分子的溶剂分子特征之后,将溶剂分子特征和溶质分子特征的矩阵乘积作为溶剂分子和溶质分子之间的溶剂化作用矩阵。The solvation matrix determination module is configured to, after determining the solute molecular signature of the solute molecule, and the solvent molecular signature of at least one solvent molecule associated with the solute molecule, use the matrix product of the solvent molecular signature and the solute molecular signature as the solvent molecule and The solvation matrix between solute molecules.
溶剂化作用特征确定模块被配置为基于溶剂化作用矩阵确定溶剂化作用特征。The solvation signature determination module is configured to determine a solvation signature based on the solvation matrix.
在某些实施例中,溶剂化作用特征确定模块包括:溶剂特征确定单元、溶质特征确定单元和溶剂化作用特征确定单元。In certain embodiments, the solvation characterization module includes: a solvent characterization unit, a solute characterization unit, and a solvation characterization unit.
其中,溶剂特征确定单元被配置为基于溶剂化作用矩阵计算与预设溶质权重对应的溶剂特征,并且基于溶剂化作用矩阵计算与预设溶剂权重对应的溶质特征。Wherein, the solvent characteristic determining unit is configured to calculate the solvent characteristic corresponding to the preset solute weight based on the solvation matrix, and calculate the solute characteristic corresponding to the preset solvent weight based on the solvation matrix.
溶质特征确定单元被配置为将溶剂特征和溶质特征分别转换为一维的包括F个元素的行向量。The solute feature determination unit is configured to convert the solvent feature and the solute feature into a one-dimensional row vector including F elements, respectively.
溶剂化作用特征确定单元被配置为拼接行向量,得到溶剂化作用特征。The solvation signature determination unit is configured to concatenate row vectors to obtain solvation signatures.
在某些实施例中,溶质特征确定单元包括阵元权重确定子单元和加权求和子单元。In some embodiments, the solute feature determination unit includes an array element weight determination subunit and a weighted summation subunit.
阵元权重确定子单元被配置为确定溶剂特征中与溶剂分子的原子对应阵元的第一阵元权重,并且确定溶质特征中与溶质分子的原子对应阵元的第二阵元权重。The array element weight determining subunit is configured to determine the first array element weight of the array element corresponding to the atom of the solvent molecule in the solvent feature, and determine the second array element weight of the array element corresponding to the atom of the solute molecule in the solute feature.
加权求和子单元被配置为基于第一阵元权重对溶剂特征进行加权求和,得到一维的包括F个元素的第一行向量,以及,基于第二阵元权重对溶质特征进行加权求和,得到一维的包括F个元素的第二行向量。The weighted summation subunit is configured to perform weighted summation on the solvent features based on the first array element weights to obtain a one-dimensional first row vector including F elements, and to perform weighted summation on the solute features based on the second array element weights , to obtain a one-dimensional second row vector containing F elements.
本申请的另一方面还提供了一种训练溶剂化自由能预测模型的装置。Another aspect of the present application also provides a device for training a solvation free energy prediction model.
图16示意性示出了根据本申请实施例的训练溶剂化自由能预测模型的装置的方框图。Fig. 16 schematically shows a block diagram of an apparatus for training a solvation free energy prediction model according to an embodiment of the present application.
上述装置1600包括:模型训练模块1610,用于将基于如上述的方法确定的虚拟分子图输入溶剂化自由能预测模型,通过调整模型参数使得损失函数收敛,得到经训练的溶剂化自由能预测模型,其中,虚拟分子图存在对应的溶剂化自由能标注信息,损失函数的输入包括预测得到的溶剂化自由能和溶剂化自由能标注信息中的溶剂化自由能。The above-mentioned device 1600 includes: a model training module 1610, which is used to input the virtual molecular graph determined based on the above-mentioned method into the solvation free energy prediction model, and adjust the model parameters to make the loss function converge, so as to obtain the trained solvation free energy prediction model , where the virtual molecular graph has corresponding solvation free energy label information, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information.
在某些实施例中,上述溶剂化自由能预测模型包括:等变图卷积网络,被配置为将虚拟分子图转换为溶质分子特征和/或溶剂分子特征。In some embodiments, the above-mentioned solvation free energy prediction model includes: an equivariant graph convolutional network configured to convert a virtual molecular graph into solute molecular features and/or solvent molecular features.
其中,等变图卷积网络包括指定循环次数num_conv层的卷积层,其中,当前卷积层的输出作为相邻的下一层卷积层的部分输入;首个卷积层的输入包括:节点标量特征N s、节点矢量特征N v、边标量特征E s和边矢量特征E v,首个卷积层的输出包括: 更新节点标量特征New_N s和更新节点矢量特征New_N v;首个卷积层之外的卷积层的输入包括:更新节点标量特征New_N s和更新节点矢量特征New_N v、边标量特征E s和边矢量特征E v;首个卷积层之外的卷积层的输出包括:更新节点标量特征New_N s和更新节点矢量特征New_N vAmong them, the equivariant graph convolutional network includes a convolutional layer with a specified number of cycles num_conv layer, where the output of the current convolutional layer is used as part of the input of the adjacent convolutional layer of the next layer; the input of the first convolutional layer includes: Node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v , the output of the first convolutional layer includes: update node scalar feature New_N s and update node vector feature New_N v ; the first volume The input of the convolutional layer other than the product layer includes: update node scalar feature New_N s and update node vector feature New_N v , edge scalar feature E s and edge vector feature E v ; the convolutional layer other than the first convolutional layer The output includes: updated node scalar feature New_N s and updated node vector feature New_N v .
在某些实施例中,卷积层被配置为执行以下操作。In some embodiments, the convolutional layer is configured to perform the following operations.
对节点标量特征N s依序进行第一线性操作、第二激活函数和第二线性操作得到第一子处理结果Q1,并且,对边标量特征E s进行第三线性操作,得到第二子处理结果Q2。 Perform the first linear operation, the second activation function, and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and perform the third linear operation on the edge scalar feature E s to obtain the second sub-processing Results Q2.
对第一子处理结果Q1和第二子处理结果Q2进行第一矩阵对应乘法操作,得到第三子处理结果Q3。The first matrix corresponding multiplication operation is performed on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain the third sub-processing result Q3.
基于第三子处理结果Q3和节点矢量特征N v进行第二矩阵对应乘法操作,得到第四子处理结果Q4,并且,基于第三子处理结果Q3和边矢量特征E v进行第三矩阵对应乘法操作,得到第五子处理结果Q5。 Carry out the corresponding multiplication operation of the second matrix based on the third sub-processing result Q3 and the node vector feature Nv to obtain the fourth sub-processing result Q4, and perform the third matrix corresponding multiplication based on the third sub-processing result Q3 and the edge vector feature Ev operation to obtain the fifth sub-processing result Q5.
对第四子处理结果Q4和第五子处理结果Q5进行第一矩阵加和操作,得到第六子处理结果Q6。The first matrix addition operation is performed on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain the sixth sub-processing result Q6.
对第六子处理结果Q6分别经过第四线性操作和第五线性操作,得到第七子处理结果Q7和第八子处理结果Q8。The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8.
对第七子处理结果Q7依序经过第六线性操作、第二激活函数、第七线性操作,得到第九子处理结果Q9,并且,基于第三子处理结果Q3、第七子处理结果Q7和第八子处理结果Q8进行内积操作Inner,得到第十子处理结果Q10。The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the seventh sub-processing result Q7 and The eighth sub-processing result Q8 is subjected to the inner product operation Inner to obtain the tenth sub-processing result Q10.
对第九子处理结果Q9和第十子处理结果Q10进行第四矩阵对应乘法操作,得到第十一子处理结果Q11。The fourth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain the eleventh sub-processing result Q11.
对第九子处理结果Q9和第十一子处理结果Q11进行第二矩阵加和,得到更新节点标量特征NewN sThe second matrix addition is performed on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the updated node scalar feature NewN s .
对第九子处理结果Q9和第八子处理结果Q8进行第五矩阵对应乘法操作,得到更新节点矢量特征NewN vThe fifth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the eighth sub-processing result Q8 to obtain the updated node vector feature NewN v .
其中,上述各操作的作用和效果可以参考以上实施例中相关内容,在此不再详述。For the functions and effects of the above operations, reference may be made to relevant content in the above embodiments, and details will not be described here.
在某些实施例中,溶剂化自由能预测模型包括:分子编码网络。In certain embodiments, the solvation free energy prediction model includes: a molecular encoding network.
分子编码网络被配置为将包括溶质分子数据和/或溶剂分子数据的训练数据集合中的各训练数据,分别转换为针对溶质分子数据和/或针对溶剂分子数据的虚拟分子 图,其中,训练数据具有溶剂化自由能标注信息,其中,训练数据中溶质分子或者溶剂分子的原子分别具有F维特征。The molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein the training data It has solvation free energy labeling information, where the atoms of solute molecules or solvent molecules in the training data have F-dimensional features respectively.
在某些实施例中,溶剂化自由能预测模型包括:溶剂化作用网络。In certain embodiments, the solvation free energy prediction model includes: a solvation network.
溶剂化作用网络被配置为将溶质分子特征和溶剂分子特征转换为溶剂化作用特征。The solvation network is configured to convert solute molecular features and solvent molecular features into solvation features.
例如,溶剂化作用网络包括自注意力网络,自注意力网络被配置为确定溶剂特征中与溶剂分子的原子对应阵元的第一阵元权重,并且确定溶质特征中与溶质分子的原子对应阵元的第二阵元权重,以便按照第一阵元权重对溶剂特征中与溶剂分子的各原子对应阵元进行融合和按照第二阵元权重对溶质特征中与溶质分子的各原子对应阵元进行融合,其中溶剂特征和溶质特征是基于溶剂化作用矩阵来确定的,溶剂化作用矩阵是基于溶质分子特征和溶剂分子特征来确定的。For example, the solvation network includes a self-attention network configured to determine the weight of the first element in the solvent feature corresponding to the atoms of the solvent molecule and to determine the weight of the first element in the solute feature corresponding to the atom of the solute molecule The weight of the second array element of the element, in order to fuse the array elements corresponding to each atom of the solvent molecule in the solvent feature according to the first array element weight and to fuse the array elements corresponding to each atom of the solute molecule in the solute feature according to the second array element weight A fusion is performed in which solvent and solute characteristics are determined based on a solvation matrix, and the solvation matrix is determined based on solute molecular characteristics and solvent molecular characteristics.
在某些实施例中,溶剂化自由能预测模型包括:全连接网络。全连接网络被配置为将溶剂化作用特征转换为溶剂化自由能。In some embodiments, the solvation free energy prediction model includes: a fully connected network. A fully connected network is configured to convert solvation features into solvation free energies.
其中,全连接网络包括:依序连接的第一线性层、第一激活函数层、第二线性层、第二激活函数层和第三线性层,其中,第一线性层和第二线性层的输出维度是F维,第三线性层的输出维度是1维。Among them, the fully connected network includes: the first linear layer, the first activation function layer, the second linear layer, the second activation function layer and the third linear layer connected in sequence, wherein the first linear layer and the second linear layer The output dimension is F dimension, and the output dimension of the third linear layer is 1 dimension.
在某些实施例中,上述装置1600还包括:训练集分割模块和模型构建模块。In some embodiments, the above-mentioned apparatus 1600 further includes: a training set segmentation module and a model building module.
训练集分割模块被配置为将训练数据集合分割为指定份数的子训练数据集合。The training set splitting module is configured to split the training data set into a specified number of sub-training data sets.
模型构建模块被配置为构建与指定份数相同数量的溶剂化自由能预测模型。The model building block is configured to build as many free energy of solvation prediction models as specified.
模型训练模块1610具体被配置为分别将各子训练数据集合中的训练数据输入不同的溶剂化自由能预测模型的分子编码网络,以对不同的溶剂化自由能预测模型分别进行模型训练,得到多个经训练的与指定份数相同数量的溶剂化自由能预测模型。The model training module 1610 is specifically configured to input the training data in each sub-training data set into the molecular encoding network of different solvation free energy prediction models, so as to perform model training on different solvation free energy prediction models respectively, and obtain multiple A model trained to predict the free energy of solvation with the same number of copies as specified.
本申请的另一方面还提供了一种确定溶剂化自由能的装置。Another aspect of the present application also provides an apparatus for determining the free energy of solvation.
图17示意性示出了根据本申请实施例的确定溶剂化自由能的装置的方框图。Fig. 17 schematically shows a block diagram of an apparatus for determining the free energy of solvation according to an embodiment of the present application.
上述装置1700包括:自由能预测模块1710,用于利用经训练的溶剂化自由能预测模型处理待处理数据,得到针对待处理数据的溶剂化自由能,其中,待处理数据包括针对目标分子中的多个原子各自的属性信息,目标分子包括溶质分子和/或溶剂分子。The above device 1700 includes: a free energy prediction module 1710, configured to use the trained solvation free energy prediction model to process the data to be processed to obtain the solvation free energy for the data to be processed, wherein the data to be processed includes The attribute information of multiple atoms, target molecules include solute molecules and/or solvent molecules.
例如,溶剂化自由能预测模型包括以下至少一种网络:分子编码网络,被配置为将包括溶质分子数据和/或溶剂分子数据的训练数据集合中的各训练数据,分别转换 为针对溶质分子数据和/或针对溶剂分子数据的虚拟分子图,其中,训练数据具有溶剂化自由能标注信息;等变图卷积网络,被配置为将虚拟分子图转换为溶质分子特征和/或溶剂分子特征;溶剂化作用网络,被配置为将溶质分子特征和溶剂分子特征转换为溶剂化作用特征;全连接网络,被配置为将溶剂化作用特征转换为溶剂化自由能。For example, the solvation free energy prediction model includes at least one of the following networks: a molecular encoding network configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into and/or a virtual molecular map for solvent molecular data, wherein the training data has solvation free energy annotation information; an equivariant map convolutional network configured to convert the virtual molecular map into solute molecular features and/or solvent molecular features; A solvation network configured to convert solute molecular features and solvent molecular features into solvation features; a fully connected network configured to convert solvation features into solvation free energy.
在某些实施例中,上述装置1700还包括:多模型处理模块和加权处理模块。In some embodiments, the above-mentioned apparatus 1700 further includes: a multi-model processing module and a weighting processing module.
多模型处理模块被配置为将待处理数据分别输入经训练的不同的指定个数的溶剂化自由能预测模型,得到指定个数的溶剂化自由能;The multi-model processing module is configured to input the data to be processed into different trained solvation free energy prediction models of a specified number to obtain a specified number of solvation free energies;
加权处理模块被配置为将指定个数的溶剂化自由能的加权平均值作为与待处理数据对应的溶剂化自由能。需要说明的是,指定个数的溶剂化自由能各自的权重可以相同或不同。例如,在测试数据集上预测结果准确度高的模型得到的溶剂化自由能的权重,可以高于其它模型得到的溶剂化自由能的权重。The weighting processing module is configured to take the weighted average of the specified number of solvation free energies as the solvation free energy corresponding to the data to be processed. It should be noted that the respective weights of the specified number of solvation free energies may be the same or different. For example, the weight of the solvation free energy obtained by the model with high prediction accuracy on the test data set can be higher than the weight of the solvation free energy obtained by other models.
本申请的另一方面还提供了一种设计装置。Another aspect of the present application also provides a design device.
图18示意性示出了根据本申请实施例的设计装置的方框图。Fig. 18 schematically shows a block diagram of a design device according to an embodiment of the present application.
参见图18,该装置1800可以包括:溶剂化自由能确定模块1810和设计模块1820。Referring to FIG. 18 , the apparatus 1800 may include: a solvation free energy determination module 1810 and a design module 1820 .
其中,溶剂化自由能确定模块1810被配置为根据上述的方法,确定溶剂化自由能。Wherein, the solvation free energy determination module 1810 is configured to determine the solvation free energy according to the above method.
设计模块1820用于基于溶剂化自由能进行药物设计或者材料设计。Design module 1820 is used for drug design or material design based on free energy of solvation.
关于上述实施例中的装置1500、1600、1700、1800,其中各个模块、单元执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不再做详细阐述说明。Regarding the devices 1500 , 1600 , 1700 , and 1800 in the above embodiments, the specific manner in which each module and unit performs operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
本申请的另一方面还提供了一种电子设备。Another aspect of the present application also provides an electronic device.
图19示意性示出了实现本申请实施例的一种电子设备的方框图。Fig. 19 schematically shows a block diagram of an electronic device implementing an embodiment of the present application.
参见图19,电子设备1900包括存储器1910和处理器1920。Referring to FIG. 19 , an electronic device 1900 includes a memory 1910 and a processor 1920 .
处理器1920可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。Processor 1920 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), on-site Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
存储器1910可以包括各种类型的存储单元,例如系统内存、只读存储器(ROM) 和永久存储装置。其中,ROM可以存储处理器1920或者计算机的其他模块需要的静态数据或者指令。永久存储装置可以是可读写的存储装置。永久存储装置可以是即使计算机断电后也不会失去存储的指令和数据的非易失性存储设备。在一些实施方式中,永久性存储装置采用大容量存储装置(例如磁或光盘、闪存)作为永久存储装置。另外一些实施方式中,永久性存储装置可以是可移除的存储设备(例如软盘、光驱)。系统内存可以是可读写存储设备或者易失性可读写存储设备,例如动态随机访问内存。系统内存可以存储一些或者所有处理器在运行时需要的指令和数据。此外,存储器1910可以包括任意计算机可读存储媒介的组合,包括各种类型的半导体存储芯片(例如DRAM,SRAM,SDRAM,闪存,可编程只读存储器),磁盘和/或光盘也可以采用。在一些实施方式中,存储器1910可以包括可读和/或写的可移除的存储设备,例如激光唱片(CD)、只读数字多功能光盘(例如DVD-ROM,双层DVD-ROM)、只读蓝光光盘、超密度光盘、闪存卡(例如SD卡、min SD卡、Micro-SD卡等)、磁性软盘等。计算机可读存储媒介不包含载波和通过无线或有线传输的瞬间电子信号。The memory 1910 may include various types of storage units such as system memory, read only memory (ROM), and persistent storage. Wherein, the ROM can store static data or instructions required by the processor 1920 or other modules of the computer. The persistent storage device may be a readable and writable storage device. Persistent storage may be a non-volatile storage device that does not lose stored instructions and data even if the computer is powered off. In some embodiments, the permanent storage device adopts a mass storage device (such as a magnetic or optical disk, flash memory) as the permanent storage device. In some other implementations, the permanent storage device may be a removable storage device (such as a floppy disk, an optical drive). The system memory can be a readable and writable storage device or a volatile readable and writable storage device, such as dynamic random access memory. System memory can store some or all of the instructions and data that the processor needs at runtime. In addition, memory 1910 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (eg, DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and magnetic and/or optical disks may also be used. In some embodiments, memory 1910 may include a readable and/or writable removable storage device, such as a compact disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Super Density Disc, Flash memory card (such as SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, etc. Computer-readable storage media do not contain carrier waves and transient electronic signals transmitted by wireless or wire.
存储器1910上存储有可执行代码,当可执行代码被处理器1920处理时,可以使处理器1920执行上文述及的方法中的部分或全部。Executable codes are stored in the memory 1910 , and when the executable codes are processed by the processor 1920 , the processor 1920 may execute part or all of the methods mentioned above.
此外,根据本申请的方法还可以实现为一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括用于执行本申请的上述方法中部分或全部步骤的计算机程序代码指令。In addition, the method according to the present application can also be implemented as a computer program or computer program product, the computer program or computer program product including computer program code instructions for executing some or all of the steps in the above method of the present application.
或者,本申请还可以实施为一种计算机可读存储介质(或非暂时性机器可读存储介质或机器可读存储介质),其上存储有可执行代码(或计算机程序或计算机指令代码),当可执行代码(或计算机程序或计算机指令代码)被电子设备(或服务器等)的处理器执行时,使处理器执行根据本申请的上述方法的各个步骤的部分或全部。Alternatively, the present application may also be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium), on which executable code (or computer program or computer instruction code) is stored, When the executable code (or computer program or computer instruction code) is executed by the processor of the electronic device (or server, etc.), the processor is made to perform part or all of the steps of the above-mentioned method according to the present application.
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文披露的各实施例。Having described various embodiments of the present application above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims (38)

  1. 一种数据处理方法,其特征在于,包括:A data processing method, characterized in that, comprising:
    获得待处理数据,所述待处理数据包括针对目标分子中的多个原子各自的属性信息;Obtaining data to be processed, the data to be processed includes attribute information for each of a plurality of atoms in the target molecule;
    响应于所述多个原子各自的属性信息,生成针对所述目标分子的节点集合和节点位置集合,其中,所述节点集合中的多个节点分别表征特定原子类型的原子,所述节点位置集合包括所述节点集合中各节点在特定坐标系下的坐标信息;In response to the respective attribute information of the plurality of atoms, generate a node set and a node position set for the target molecule, wherein the plurality of nodes in the node set represent atoms of a specific atom type, and the node position set including coordinate information of each node in the node set in a specific coordinate system;
    生成针对所述节点集合的节点标量特征N s和节点矢量特征N v,并且基于所述节点位置集合中各节点的坐标信息生成针对所述节点集合的边标量特征E s和边矢量特征E vGenerate node scalar feature N s and node vector feature N v for the node set, and generate edge scalar feature E s and edge vector feature E v for the node set based on the coordinate information of each node in the node position set ;
    基于针对所述节点集合的节点标量特征N s、节点矢量特征N v、边标量特征E s和边矢量特征E v构建虚拟分子图,以基于所述虚拟分子图确定所述目标分子的分子特征X,便于至少基于所述目标分子的分子特征X确定溶剂化自由能。 constructing a virtual molecular graph based on the node scalar feature N s , node vector feature N v , edge scalar feature E s , and edge vector feature E v for the node set, to determine molecular features of the target molecule based on the virtual molecular graph X, facilitates determining the free energy of solvation based at least on the molecular characteristic X of said target molecule.
  2. 根据权利要求1所述的方法,其特征在于,所述目标分子包括N个原子,所述节点集合中的多个节点各自具有F维特征;The method according to claim 1, wherein the target molecule comprises N atoms, and a plurality of nodes in the node set each have F-dimensional features;
    所述节点标量特征N s的维度包括N×F×1维,所述节点矢量特征N v的维度包括N×F×3维,所述边标量特征E s的维度包括N×1×1维,所述边矢量特征E v的维度包括N×3×1维。 The dimension of the node scalar feature N s includes N×F×1 dimension, the dimension of the node vector feature N v includes N×F×3 dimension, and the dimension of the edge scalar feature E s includes N×1×1 dimension , the dimensions of the edge vector feature E v include N×3×1 dimensions.
  3. 根据权利要求1所述的方法,其特征在于,还包括:在所述响应于所述多个原子各自的属性信息,生成针对所述目标分子的节点集合和节点位置集合之后,The method according to claim 1, further comprising: after generating a node set and a node position set for the target molecule in response to the respective attribute information of the plurality of atoms,
    确定截断半径r cutDetermine the cut-off radius r cut ;
    从所述节点集合中确定节点之间距离小于或者等于所述截断半径r cut的目标节点,得到目标节点集合N iDetermining target nodes whose distance between nodes is less than or equal to the truncation radius r cut from the node set to obtain a target node set N i ;
    所述基于所述节点位置集合中各节点的坐标信息生成针对所述节点集合的边标量特征和边矢量特征包括:The generation of edge scalar features and edge vector features for the node set based on the coordinate information of each node in the node position set includes:
    基于所述节点位置集合中针对所述目标节点的坐标信息生成针对所述目标节点集合N i的边标量特征E s和边矢量特征E vGenerate an edge scalar feature E s and an edge vector feature E v for the target node set N i based on the coordinate information for the target node in the node position set.
  4. 根据权利要求3所述的方法,其特征在于,所述目标节点集合包括E个节点,所述E个节点各自具有F维特征;The method according to claim 3, wherein the target node set includes E nodes, and each of the E nodes has F-dimensional features;
    所述节点标量特征N s的维度包括N×F×1维,所述节点矢量特征N v的维度包括N ×F×3维,所述边标量特征E s的维度包括E×1×1维,所述边矢量特征E v的维度包括E×3×1维。 The dimension of the node scalar feature N s includes N × F × 1 dimension, the dimension of the node vector feature N v includes N × F × 3 dimension, and the dimension of the edge scalar feature E s includes E × 1 × 1 dimension , the dimension of the edge vector feature E v includes E×3×1 dimension.
  5. 根据权利要求2或4所述的方法,其特征在于,所述基于所述虚拟分子图确定所述目标分子的分子特征X,包括:The method according to claim 2 or 4, wherein the determining the molecular feature X of the target molecule based on the virtual molecular map comprises:
    基于所述虚拟分子图更新所述节点标量特征N s和所述节点矢量特征N v,得到更新节点标量特征New_N s和更新节点矢量特征New_N vUpdating the node scalar feature N s and the node vector feature N v based on the virtual molecular graph to obtain an updated node scalar feature New_N s and an updated node vector feature New_N v ;
    重复以下操作,直至达到指定循环次数num_conv,以将达到所述指定循环次数num_conv时得到的更新节点标量特征New_N s,作为所述分子特征X: Repeat the following operations until the specified number of cycles num_conv is reached, so that the updated node scalar feature New_N s obtained when the specified number of cycles num_conv is reached is used as the molecular feature X:
    将所述更新节点标量特征New_N s和所述更新节点矢量特征New_N v分别作为当前节点标量特征Now_N s和当前节点矢量特征Now_N vUsing the updated node scalar feature New_N s and the updated node vector feature New_N v as the current node scalar feature Now_N s and the current node vector feature Now_N v respectively;
    利用所述当前节点标量特征Now_N s、所述当前节点矢量特征Now_N v、所述边标量特征E s和所述边矢量特征E v构建更新虚拟分子图; Using the current node scalar feature Now_N s , the current node vector feature Now_N v , the edge scalar feature E s and the edge vector feature E v to construct an updated virtual molecular graph;
    基于所述更新虚拟分子图更新所述更新节点标量特征New_N s和所述更新节点矢量特征New_N vThe updated node scalar feature New_N s and the updated node vector feature New_N v are updated based on the updated virtual molecular graph.
  6. 根据权利要求5所述的方法,其特征在于,所述基于所述虚拟分子图更新所述节点标量特征N s和所述节点矢量特征N v,得到更新节点标量特征New_N s和更新节点矢量特征New_N v,包括: The method according to claim 5, wherein the node scalar feature N s and the node vector feature N v are updated based on the virtual molecular graph to obtain the updated node scalar feature New_N s and the updated node vector feature New_N v , including:
    对所述节点标量特征N s依序进行第一线性操作、第二激活函数和第二线性操作得到第一子处理结果Q1,并且,对所述边标量特征E s进行第三线性操作,得到第二子处理结果Q2; Performing the first linear operation, the second activation function and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and performing the third linear operation on the edge scalar feature E s to obtain The second sub-processing result Q2;
    对所述第一子处理结果Q1和所述第二子处理结果Q2进行第一矩阵对应乘法操作,得到第三子处理结果Q3;Performing a first matrix corresponding multiplication operation on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain a third sub-processing result Q3;
    基于所述第三子处理结果Q3和所述节点矢量特征N v进行第二矩阵对应乘法操作,得到第四子处理结果Q4,并且,基于所述第三子处理结果Q3和所述边矢量特征E v进行第三矩阵对应乘法操作,得到第五子处理结果Q5; Perform a second matrix corresponding multiplication operation based on the third sub-processing result Q3 and the node vector feature N v to obtain a fourth sub-processing result Q4, and, based on the third sub-processing result Q3 and the edge vector feature E v performs the multiplication operation corresponding to the third matrix to obtain the fifth sub-processing result Q5;
    对所述第四子处理结果Q4和所述第五子处理结果Q5进行第一矩阵加和操作,得到第六子处理结果Q6;performing a first matrix addition operation on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain a sixth sub-processing result Q6;
    对所述第六子处理结果Q6分别经过第四线性操作和第五线性操作,得到第七子处理结果Q7和第八子处理结果Q8;The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8;
    对所述第七子处理结果Q7依序经过第六线性操作、第二激活函数、第七线性操 作,得到第九子处理结果Q9,并且,基于所述第三子处理结果Q3、所述第七子处理结果Q7和所述第八子处理结果Q8进行内积操作Inner,得到第十子处理结果Q10;The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the first The seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform an inner product operation Inner to obtain the tenth sub-processing result Q10;
    对所述第九子处理结果Q9和所述第十子处理结果Q10进行第四矩阵对应乘法操作,得到第十一子处理结果Q11;Performing a fourth matrix corresponding multiplication operation on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain an eleventh sub-processing result Q11;
    对所述第八子处理结果Q8和所述第九子处理结果Q9进行第五矩阵乘法操作,得到所述更新节点矢量特征NewN vPerforming a fifth matrix multiplication operation on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v ;
    对所述第九子处理结果Q9和所述第十一子处理结果Q11进行第二矩阵加和操作,得到所述更新节点标量特征NewN sPerforming a second matrix addition operation on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the update node scalar feature NewN s .
  7. 根据权利要求5所述的方法,其特征在于,所述目标分子是溶质分子和/或溶剂分子;The method according to claim 5, wherein the target molecule is a solute molecule and/or a solvent molecule;
    所述方法还包括:The method also includes:
    确定溶质分子的溶质分子特征,以及与所述溶质分子相关联的至少一个溶剂分子的溶剂分子特征,以便基于所述溶质分子的溶质分子特征以及与所述溶质分子相关联的至少一个溶剂分子的溶剂分子特征,确定所述溶剂化自由能。determining a solute molecular characteristic of the solute molecule, and a solvent molecular characteristic of at least one solvent molecule associated with the solute molecule, such that based on the solute molecular characteristic of the solute molecule and the at least one solvent molecule associated with the solute molecule Solvent molecular characteristics, determine the solvation free energy.
  8. 根据权利要求7所述的方法,其特征在于,还包括:在所述确定溶质分子的溶质分子特征,以及与所述溶质分子相关联的至少一个溶剂分子的溶剂分子特征之后,The method according to claim 7, further comprising: after said determining the solute molecular characteristic of the solute molecule and the solvent molecular characteristic of at least one solvent molecule associated with the solute molecule,
    将所述溶剂分子特征和所述溶质分子特征的矩阵乘积作为所述溶剂分子和所述溶质分子之间的溶剂化作用矩阵;using the matrix product of the solvent molecular characteristics and the solute molecular characteristics as the solvation matrix between the solvent molecules and the solute molecules;
    基于所述溶剂化作用矩阵确定溶剂化作用特征。A solvation profile is determined based on the solvation matrix.
  9. 根据权利要求8所述的方法,其特征在于,其中,所述基于所述溶剂化作用矩阵确定溶剂化作用特征包括:The method according to claim 8, wherein said determining solvation characteristics based on said solvation matrix comprises:
    基于所述溶剂化作用矩阵计算与预设溶质权重对应的溶剂特征,并且基于所述溶剂化作用矩阵计算与预设溶剂权重对应的溶质特征;calculating solvent characteristics corresponding to preset solute weights based on the solvation matrix, and calculating solute characteristics corresponding to preset solvent weights based on the solvation matrix;
    将所述溶剂特征和所述溶质特征分别转换为一维的包括F个元素的行向量;converting the solvent feature and the solute feature into a one-dimensional row vector including F elements;
    拼接所述行向量,得到所述溶剂化作用特征。The row vectors are concatenated to obtain the solvation feature.
  10. 根据权利要求9所述的方法,其特征在于,所述将所述溶剂特征和所述溶质特征分别转换为一维的包括F个元素的行向量,包括:The method according to claim 9, wherein said converting said solvent feature and said solute feature into a one-dimensional row vector comprising F elements comprises:
    确定所述溶剂特征中与所述溶剂分子的原子对应阵元的第一阵元权重,并且确定所述溶质特征中与所述溶质分子的原子对应阵元的第二阵元权重;determining a first element weight of an element corresponding to an atom of the solvent molecule in the solvent feature, and determining a second element weight of an element corresponding to an atom of the solute molecule in the solute feature;
    基于所述第一阵元权重对所述溶剂特征进行加权求和,得到一维的包括F个元素 的第一行向量,以及,基于所述第二阵元权重对所述溶质特征进行加权求和,得到一维的包括F个元素的第二行向量。Perform weighted summation on the solvent features based on the first array element weights to obtain a one-dimensional first row vector including F elements, and perform weighted summation on the solute features based on the second array element weights And, get a one-dimensional second row vector containing F elements.
  11. 一种训练溶剂化自由能预测模型的方法,其特征在于:A method for training a solvation free energy prediction model, characterized in that:
    将基于权利要求1至10任一项所述的方法确定的虚拟分子图输入所述溶剂化自由能预测模型,通过调整模型参数使得损失函数收敛,得到经训练的溶剂化自由能预测模型,其中,所述虚拟分子图存在对应的溶剂化自由能标注信息,所述损失函数的输入包括预测得到的溶剂化自由能和所述溶剂化自由能标注信息中的溶剂化自由能。Input the virtual molecular graph determined based on the method according to any one of claims 1 to 10 into the solvation free energy prediction model, and adjust the model parameters so that the loss function converges to obtain a trained solvation free energy prediction model, wherein , the virtual molecular graph has corresponding solvation free energy label information, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information.
  12. 根据权利要求11所述的方法,其特征在于,所述溶剂化自由能预测模型包括:The method according to claim 11, wherein the solvation free energy prediction model comprises:
    等变图卷积网络,被配置为将所述虚拟分子图转换为溶质分子特征和/或溶剂分子特征;an equivariant graph convolutional network configured to convert said virtual molecular graph into solute molecular features and/or solvent molecular features;
    其中,所述等变图卷积网络包括指定循环次数num_conv层的卷积层,其中,当前卷积层的输出作为相邻的下一层卷积层的部分输入;首个卷积层的输入包括:节点标量特征N s、节点矢量特征N v、边标量特征E s和边矢量特征E v,所述首个卷积层的输出包括:更新节点标量特征New_N s和更新节点矢量特征New_N v;所述首个卷积层之外的卷积层的输入包括:更新节点标量特征New_N s和更新节点矢量特征New_N v、边标量特征E s和边矢量特征E v;所述首个卷积层之外的卷积层的输出包括:更新节点标量特征New_N s和更新节点矢量特征New_N vWherein, the equivariant graph convolutional network includes a convolutional layer specifying the number of cycles num_conv layer, wherein the output of the current convolutional layer is used as part of the input of the next adjacent convolutional layer; the input of the first convolutional layer Including: node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v , the output of the first convolutional layer includes: updated node scalar feature New_N s and updated node vector feature New_N v ; The input of the convolution layer outside the first convolution layer includes: update node scalar feature New_N s and update node vector feature New_N v , edge scalar feature E s and edge vector feature E v ; the first convolution The output of the convolutional layer outside the layer includes: updated node scalar feature New_N s and updated node vector feature New_N v .
  13. 根据权利要求12所述的方法,其特征在于,所述卷积层被配置为:The method according to claim 12, wherein the convolutional layer is configured as:
    对所述节点标量特征N s依序进行第一线性操作、第二激活函数和第二线性操作得到第一子处理结果Q1,并且,对所述边标量特征E s进行第三线性操作,得到第二子处理结果Q2; Performing the first linear operation, the second activation function and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and performing the third linear operation on the edge scalar feature E s to obtain The second sub-processing result Q2;
    对所述第一子处理结果Q1和所述第二子处理结果Q2进行第一矩阵对应乘法操作,得到第三子处理结果Q3;Performing a first matrix corresponding multiplication operation on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain a third sub-processing result Q3;
    基于所述第三子处理结果Q3和所述节点矢量特征N v进行第二矩阵对应乘法操作,得到第四子处理结果Q4,并且,基于所述第三子处理结果Q3和所述边矢量特征E v进行第三矩阵对应乘法操作,得到第五子处理结果Q5; Perform a second matrix corresponding multiplication operation based on the third sub-processing result Q3 and the node vector feature N v to obtain a fourth sub-processing result Q4, and, based on the third sub-processing result Q3 and the edge vector feature E v performs the multiplication operation corresponding to the third matrix to obtain the fifth sub-processing result Q5;
    对所述第四子处理结果Q4和所述第五子处理结果Q5进行第一矩阵加和操作,得到第六子处理结果Q6;performing a first matrix addition operation on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain a sixth sub-processing result Q6;
    对所述第六子处理结果Q6分别经过第四线性操作和第五线性操作,得到第七子 处理结果Q7和第八子处理结果Q8;The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8;
    对所述第七子处理结果Q7依序经过第六线性操作、第二激活函数、第七线性操作,得到第九子处理结果Q9,并且,基于所述第三子处理结果Q3、所述第七子处理结果Q7和所述第八子处理结果Q8进行内积操作Inner,得到第十子处理结果Q10;The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the first The seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform an inner product operation Inner to obtain the tenth sub-processing result Q10;
    对所述第九子处理结果Q9和所述第十子处理结果Q10进行第四矩阵对应乘法操作,得到第十一子处理结果Q11;Performing a fourth matrix corresponding multiplication operation on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain an eleventh sub-processing result Q11;
    对所述第九子处理结果Q9和所述第十一子处理结果Q11进行第二矩阵加和操作,得到所述更新节点标量特征NewN sPerforming a second matrix addition operation on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the update node scalar feature NewN s ;
    对所述第八子处理结果Q8和所述第九子处理结果Q9进行第五矩阵对应乘法操作,得到所述更新节点矢量特征NewN vThe fifth matrix corresponding multiplication operation is performed on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v .
  14. 根据权利要求11所述的方法,其特征在于:The method according to claim 11, characterized in that:
    所述溶剂化自由能预测模型包括:The solvation free energy prediction model includes:
    分子编码网络,被配置为将包括溶质分子数据和/或溶剂分子数据的训练数据集合中的各训练数据,分别转换为针对溶质分子数据和/或针对溶剂分子数据的虚拟分子图,其中,所述训练数据具有溶剂化自由能标注信息,所述训练数据中溶质分子或者溶剂分子的原子分别具有F维特征;The molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein, The training data has solvation free energy labeling information, and atoms of solute molecules or solvent molecules in the training data have F-dimensional features respectively;
    和/或and / or
    所述溶剂化自由能预测模型包括:The solvation free energy prediction model includes:
    溶剂化作用网络,被配置为将溶质分子特征和溶剂分子特征转换为溶剂化作用特征;a solvation network configured to convert solute molecular signatures and solvent molecular signatures into solvation signatures;
    其中,所述溶剂化作用网络包括自注意力网络,所述自注意力网络被配置为确定溶剂特征中与所述溶剂分子的原子对应阵元的第一阵元权重,并且确定溶质特征中与所述溶质分子的原子对应阵元的第二阵元权重,以便按照所述第一阵元权重对所述溶剂特征中与所述溶剂分子的各原子对应阵元进行融合和按照所述第二阵元权重对所述溶质特征中与所述溶质分子的各原子对应阵元进行融合,其中所述溶剂特征和所述溶质特征是基于溶剂化作用矩阵来确定的,所述溶剂化作用矩阵是基于所述溶质分子特征和所述溶剂分子特征来确定的;Wherein, the solvation network includes a self-attention network, and the self-attention network is configured to determine the first element weight of the element corresponding to the atom of the solvent molecule in the solvent feature, and determine the weight of the first element corresponding to the atom in the solute feature The atoms of the solute molecules correspond to the second element weights of the elements, so that the elements corresponding to the atoms of the solvent molecules in the solvent feature are fused according to the first element weights and the elements corresponding to the atoms of the solvent molecules are fused according to the second element weights. The array element weight fuses the array elements corresponding to each atom of the solute molecule in the solute feature, wherein the solvent feature and the solute feature are determined based on a solvation matrix, and the solvation matrix is determined based on said solute molecular characteristics and said solvent molecular characteristics;
    和/或and / or
    所述溶剂化自由能预测模型包括:The solvation free energy prediction model includes:
    全连接网络,被配置为将所述溶剂化作用特征转换为溶剂化自由能;所述全连接 网络包括:依序连接的第一线性层、第一激活函数层、第二线性层、第二激活函数层和第三线性层,其中,第一线性层和第二线性层的输出维度是F维,所述第三线性层的输出维度是1维。A fully connected network configured to convert the solvation feature into a free energy of solvation; the fully connected network includes: a first linear layer connected in sequence, a first activation function layer, a second linear layer, a second The activation function layer and the third linear layer, wherein the output dimension of the first linear layer and the second linear layer is F dimension, and the output dimension of the third linear layer is 1 dimension.
  15. 根据权利要求14所述的方法,其特征在于,还包括:The method according to claim 14, further comprising:
    将所述训练数据集合分割为指定份数的子训练数据集合;dividing the training data set into sub-training data sets of a specified number;
    构建与所述指定份数相同数量的溶剂化自由能预测模型;Constructing the same number of solvation free energy prediction models as the specified number of copies;
    所述将所述训练数据输入所述分子编码网络包括:Said inputting said training data into said molecular encoding network comprises:
    分别将各子训练数据集合中的训练数据输入不同的溶剂化自由能预测模型的分子编码网络,以对不同的溶剂化自由能预测模型分别进行模型训练,得到多个经训练的与所述指定份数相同数量的溶剂化自由能预测模型。Input the training data in each sub-training data set into the molecular encoding network of different solvation free energy prediction models, so as to perform model training on different solvation free energy prediction models respectively, and obtain multiple trained and specified Copies of the same number of solvation free energy prediction models.
  16. 一种确定溶剂化自由能的方法,其特征在于,包括:A method for determining free energy of solvation, comprising:
    利用根据权利要求11至15任一项所述的方法训练的溶剂化自由能预测模型处理虚拟分子图,得到针对所述虚拟分子图的溶剂化自由能,其中,所述虚拟分子图是基于待处理数据生成的图,所述待处理数据包括针对目标分子中的多个原子各自的属性信息,所述目标分子包括溶质分子和/或溶剂分子。Utilize the free energy of solvation prediction model trained according to the method according to any one of claims 11 to 15 to process the virtual molecular graph to obtain the free energy of solvation for the virtual molecular graph, wherein the virtual molecular graph is based on the A graph generated by processing data, the data to be processed includes respective attribute information for a plurality of atoms in a target molecule, the target molecule including solute molecules and/or solvent molecules.
  17. 根据权利要求16所述的方法,其特征在于,还包括:The method according to claim 16, further comprising:
    将所述虚拟分子图或者所述待处理数据分别输入经训练的不同的指定个数的溶剂化自由能预测模型,得到指定个数的溶剂化自由能;Inputting the virtual molecular map or the data to be processed into different trained solvation free energy prediction models of a specified number to obtain a specified number of solvation free energies;
    将所述指定个数的溶剂化自由能的加权平均值作为与所述待处理数据对应的溶剂化自由能。The weighted average of the specified number of solvation free energies is used as the solvation free energy corresponding to the data to be processed.
  18. 一种设计方法,其特征在于,所述方法包括:A design method, characterized in that the method comprises:
    根据权利要求1至17中任一项所述的方法,确定溶剂化自由能;Determining the free energy of solvation according to the method of any one of claims 1 to 17;
    基于所述溶剂化自由能进行药物设计或者材料设计。Drug design or material design is performed based on the solvation free energy.
  19. 一种数据处理装置,其特征在于,包括:A data processing device, characterized in that it comprises:
    待处理数据获得模块用于获得待处理数据,所述待处理数据包括针对目标分子中的多个原子各自的属性信息;The data to be processed obtaining module is used to obtain the data to be processed, and the data to be processed includes attribute information for a plurality of atoms in the target molecule;
    集合生成模块用于响应于所述多个原子各自的属性信息,生成针对所述目标分子的节点集合和节点位置集合,其中,所述节点集合中的多个节点分别表征特定原子类型的原子,所述节点位置集合包括所述节点集合中各节点在特定坐标系下的坐标信息;The set generation module is used to generate a set of nodes and a set of node positions for the target molecule in response to the respective attribute information of the plurality of atoms, wherein the multiple nodes in the set of nodes respectively represent atoms of a specific atom type, The node position set includes coordinate information of each node in the node set in a specific coordinate system;
    节点和边特征生成模块用于生成针对所述节点集合的节点标量特征N s和节点矢 量特征N v,并且基于所述节点位置集合中各节点的坐标信息生成针对所述节点集合的边标量特征E s和边矢量特征E vThe node and edge feature generation module is used to generate the node scalar feature N s and the node vector feature N v for the node set, and generate the edge scalar feature for the node set based on the coordinate information of each node in the node position set E s and edge vector features E v ;
    虚拟分子构建模块用于基于针对所述节点集合的节点标量特征N s、节点矢量特征N v、边标量特征E s和边矢量特征E v构建虚拟分子图,以基于所述虚拟分子图确定所述目标分子的分子特征X,便于至少基于所述目标分子的分子特征X确定溶剂化自由能。 The virtual molecular building block is used to construct a virtual molecular graph based on the node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v for the node set, to determine all molecular graphs based on the virtual molecular graph. The molecular characteristic X of the target molecule facilitates determining the free energy of solvation based at least on the molecular characteristic X of the target molecule.
  20. 根据权利要求19所述的装置,其特征在于,所述目标分子包括N个原子,所述节点集合中的多个节点各自具有F维特征;The device according to claim 19, wherein the target molecule comprises N atoms, and each of the multiple nodes in the node set has F-dimensional features;
    所述节点标量特征N s的维度包括N×F×1维,所述节点矢量特征N v的维度包括N×F×3维,所述边标量特征E s的维度包括N×1×1维,所述边矢量特征E v的维度包括N×3×1维。 The dimension of the node scalar feature N s includes N×F×1 dimension, the dimension of the node vector feature N v includes N×F×3 dimension, and the dimension of the edge scalar feature E s includes N×1×1 dimension , the dimensions of the edge vector feature E v include N×3×1 dimensions.
  21. 根据权利要求19所述的装置,其特征在于,还包括:The device according to claim 19, further comprising:
    截断半径确定模块用于在所述响应于所述多个原子各自的属性信息,生成针对所述目标分子的节点集合和节点位置集合之后,确定截断半径r cutThe truncation radius determination module is used to determine the truncation radius r cut after generating the node set and node position set for the target molecule in response to the respective attribute information of the plurality of atoms;
    目标节点集合确定模块用于从所述节点集合中确定节点之间距离小于或者等于所述截断半径r cut的目标节点,得到目标节点集合N iThe target node set determination module is used to determine the target nodes whose distance between nodes is less than or equal to the truncation radius r cut from the node set to obtain the target node set N i ;
    所述目标节点集合确定模块具体用于基于所述节点位置集合中针对所述目标节点的坐标信息生成针对所述目标节点集合N i的边标量特征E s和边矢量特征E vThe target node set determining module is specifically configured to generate an edge scalar feature E s and an edge vector feature E v for the target node set N i based on the coordinate information for the target node in the node position set.
  22. 根据权利要求21所述的装置,其特征在于,所述目标节点集合包括E个节点,所述E个节点各自具有F维特征;The device according to claim 21, wherein the target node set includes E nodes, and each of the E nodes has F-dimensional features;
    所述节点标量特征N s的维度包括N×F×1维,所述节点矢量特征N v的维度包括N×F×3维,所述边标量特征E s的维度包括E×1×1维,所述边矢量特征E v的维度包括E×3×1维。 The dimension of the node scalar feature N s includes N×F×1 dimension, the dimension of the node vector feature N v includes N×F×3 dimension, and the dimension of the edge scalar feature E s includes E×1×1 dimension , the dimension of the edge vector feature E v includes E×3×1 dimension.
  23. 根据权利要求20或22所述的装置,其特征在于,所述装置包括:The device according to claim 20 or 22, wherein the device comprises:
    特征更新模块用于基于所述虚拟分子图更新所述节点标量特征N s和所述节点矢量特征N v,得到更新节点标量特征New_N s和更新节点矢量特征New_N vThe feature update module is used to update the node scalar feature N s and the node vector feature N v based on the virtual molecular graph, and obtain the updated node scalar feature New_N s and the updated node vector feature New_N v ;
    循环模块用于重复以下单元,直至达到指定循环次数num_conv,以将达到所述指定循环次数num_conv时得到的更新节点标量特征New_N s,作为所述分子特征X: The cycle module is used to repeat the following units until the specified number of cycles num_conv is reached, so that the updated node scalar feature New_N s obtained when the specified cycle number num_conv is reached is used as the molecular feature X:
    特征替换单元,被配置为将所述更新节点标量特征New_N s和所述更新节点矢量特征New_N v分别作为当前节点标量特征Now_N s和当前节点矢量特征Now_N vA feature replacement unit configured to use the updated node scalar feature New_N s and the updated node vector feature New_N v as the current node scalar feature Now_N s and the current node vector feature Now_N v respectively;
    特征计算单元,被配置为利用所述当前节点标量特征Now_N s、所述当前节点矢量特征Now_N v、所述边标量特征E s和所述边矢量特征E v构建更新虚拟分子图; A feature calculation unit configured to use the current node scalar feature Now_N s , the current node vector feature Now_N v , the edge scalar feature E s and the edge vector feature E v to construct and update a virtual molecular graph;
    特征更新单元,被配置为基于所述更新虚拟分子图更新所述更新节点标量特征New_N s和所述更新节点矢量特征New_N vA feature updating unit configured to update the updated node scalar feature New_N s and the updated node vector feature New_N v based on the updated virtual molecular graph.
  24. 根据权利要求23所述的装置,其特征在于,所述特征更新模块具体被配置为执行以下操作:The device according to claim 23, wherein the feature update module is specifically configured to perform the following operations:
    对所述节点标量特征N s依序进行第一线性操作、第二激活函数和第二线性操作得到第一子处理结果Q1,并且,对所述边标量特征E s进行第三线性操作,得到第二子处理结果Q2; Performing the first linear operation, the second activation function and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and performing the third linear operation on the edge scalar feature E s to obtain The second sub-processing result Q2;
    对所述第一子处理结果Q1和所述第二子处理结果Q2进行第一矩阵对应乘法操作,得到第三子处理结果Q3;Performing a first matrix corresponding multiplication operation on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain a third sub-processing result Q3;
    基于所述第三子处理结果Q3和所述节点矢量特征N v进行第二矩阵对应乘法操作,得到第四子处理结果Q4,并且,基于所述第三子处理结果Q3和所述边矢量特征E v进行第三矩阵对应乘法操作,得到第五子处理结果Q5; Perform a second matrix corresponding multiplication operation based on the third sub-processing result Q3 and the node vector feature N v to obtain a fourth sub-processing result Q4, and, based on the third sub-processing result Q3 and the edge vector feature E v performs the multiplication operation corresponding to the third matrix to obtain the fifth sub-processing result Q5;
    对所述第四子处理结果Q4和所述第五子处理结果Q5进行第一矩阵加和操作,得到第六子处理结果Q6;performing a first matrix addition operation on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain a sixth sub-processing result Q6;
    对所述第六子处理结果Q6分别经过第四线性操作和第五线性操作,得到第七子处理结果Q7和第八子处理结果Q8;The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8;
    对所述第七子处理结果Q7依序经过第六线性操作、第二激活函数、第七线性操作,得到第九子处理结果Q9,并且,基于所述第三子处理结果Q3、所述第七子处理结果Q7和所述第八子处理结果Q8进行内积操作Inner,得到第十子处理结果Q10;The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the first The seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform an inner product operation Inner to obtain the tenth sub-processing result Q10;
    对所述第九子处理结果Q9和所述第十子处理结果Q10进行第四矩阵对应乘法操作,得到第十一子处理结果Q11;Performing a fourth matrix corresponding multiplication operation on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain an eleventh sub-processing result Q11;
    对所述第八子处理结果Q8和所述第九子处理结果Q9进行第五矩阵乘法操作,得到所述更新节点矢量特征NewN vPerforming a fifth matrix multiplication operation on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v ;
    对所述第九子处理结果Q9和所述第十一子处理结果Q11进行第二矩阵加和操作,得到所述更新节点标量特征NewN sPerforming a second matrix addition operation on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the update node scalar feature NewN s .
  25. 根据权利要求23所述的装置,其特征在于,所述目标分子是溶质分子和/或溶剂分子;The device according to claim 23, wherein the target molecule is a solute molecule and/or a solvent molecule;
    所述装置还包括:The device also includes:
    溶质溶剂分子特征确定模块,用于确定溶质分子的溶质分子特征,以及与所述溶质分子相关联的至少一个溶剂分子的溶剂分子特征,以便基于所述溶质分子的溶质分子特征以及与所述溶质分子相关联的至少一个溶剂分子的溶剂分子特征,确定所述溶剂化自由能。a solute solvent molecular characteristic determination module for determining a solute molecular characteristic of a solute molecule, and a solvent molecular characteristic of at least one solvent molecule associated with said solute molecule, so that A solvent molecule characteristic of at least one solvent molecule associated with the molecule determines the free energy of solvation.
  26. 根据权利要求25所述的装置,其特征在于,还包括:The device according to claim 25, further comprising:
    溶剂化作用矩阵确定模块用于在所述确定溶质分子的溶质分子特征,以及与所述溶质分子相关联的至少一个溶剂分子的溶剂分子特征之后,将所述溶剂分子特征和所述溶质分子特征的矩阵乘积作为所述溶剂分子和所述溶质分子之间的溶剂化作用矩阵;a solvation matrix determination module for combining said solvent molecular signature and said solute molecular signature after said determining a solute molecular signature of a solute molecule, and a solvent molecular signature of at least one solvent molecule The matrix product of is used as the solvation matrix between said solvent molecule and said solute molecule;
    溶剂化作用特征确定模块用于基于所述溶剂化作用矩阵确定溶剂化作用特征。The solvation signature determination module is for determining a solvation signature based on the solvation matrix.
  27. 根据权利要求26所述的装置,其特征在于,其中,所述溶剂化作用特征确定模块包括:The device according to claim 26, wherein the solvation characteristic determination module comprises:
    溶剂特征确定单元用于基于所述溶剂化作用矩阵计算与预设溶质权重对应的溶剂特征,并且基于所述溶剂化作用矩阵计算与预设溶剂权重对应的溶质特征;The solvent characteristic determining unit is used to calculate the solvent characteristic corresponding to the preset solute weight based on the solvation matrix, and calculate the solute characteristic corresponding to the preset solvent weight based on the solvation matrix;
    溶质特征确定单元用于将所述溶剂特征和所述溶质特征分别转换为一维的包括F个元素的行向量;The solute feature determination unit is used to convert the solvent feature and the solute feature into a one-dimensional row vector including F elements;
    溶剂化作用特征确定单元用于拼接所述行向量,得到所述溶剂化作用特征。The solvation feature determination unit is used to concatenate the row vectors to obtain the solvation features.
  28. 根据权利要求27所述的装置,其特征在于,所述溶质特征确定单元包括:The device according to claim 27, wherein the solute characteristic determination unit comprises:
    阵元权重确定子单元用于确定所述溶剂特征中与所述溶剂分子的原子对应阵元的第一阵元权重,并且确定所述溶质特征中与所述溶质分子的原子对应阵元的第二阵元权重;The array element weight determining subunit is used to determine the first array element weight corresponding to the atoms of the solvent molecule in the solvent feature, and determine the first array element weight corresponding to the atom of the solute molecule in the solute feature Second array element weight;
    加权求和子单元用于基于所述第一阵元权重对所述溶剂特征进行加权求和,得到一维的包括F个元素的第一行向量,以及,基于所述第二阵元权重对所述溶质特征进行加权求和,得到一维的包括F个元素的第二行向量。The weighted summation subunit is configured to perform weighted summation on the solvent features based on the first array element weights to obtain a one-dimensional first row vector including F elements, and to calculate all the solvent features based on the second array element weights The above solute characteristics are weighted and summed to obtain a one-dimensional second row vector including F elements.
  29. 一种训练溶剂化自由能预测模型的装置,其特征在于:A device for training a solvation free energy prediction model, characterized in that:
    模型训练模块,用于将基于权利要求19至28任一项所述的装置确定的虚拟分子图输入所述溶剂化自由能预测模型,通过调整模型参数使得损失函数收敛,得到经训练的溶剂化自由能预测模型,其中,所述虚拟分子图存在对应的溶剂化自由能标注信息,所述损失函数的输入包括预测得到的溶剂化自由能和所述溶剂化自由能标注信息 中的溶剂化自由能。The model training module is used to input the virtual molecular graph determined based on the device according to any one of claims 19 to 28 into the solvation free energy prediction model, and adjust the model parameters so that the loss function converges to obtain the trained solvation A free energy prediction model, wherein the virtual molecular graph has corresponding solvation free energy label information, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information able.
  30. 根据权利要求29所述的装置,其特征在于,所述溶剂化自由能预测模型包括:The device according to claim 29, wherein the solvation free energy prediction model comprises:
    等变图卷积网络,被配置为将所述虚拟分子图转换为溶质分子特征和/或溶剂分子特征;an equivariant graph convolutional network configured to convert said virtual molecular graph into solute molecular features and/or solvent molecular features;
    其中,所述等变图卷积网络包括指定循环次数num_conv层的卷积层,其中,当前卷积层的输出作为相邻的下一层卷积层的部分输入;首个卷积层的输入包括:节点标量特征N s、节点矢量特征N v、边标量特征E s和边矢量特征E v,所述首个卷积层的输出包括:更新节点标量特征New_N s和更新节点矢量特征New_N v;所述首个卷积层之外的卷积层的输入包括:更新节点标量特征New_N s和更新节点矢量特征New_N v、边标量特征E s和边矢量特征E v;所述首个卷积层之外的卷积层的输出包括:更新节点标量特征New_N s和更新节点矢量特征New_N vWherein, the equivariant graph convolutional network includes a convolutional layer specifying the number of cycles num_conv layer, wherein the output of the current convolutional layer is used as part of the input of the next adjacent convolutional layer; the input of the first convolutional layer Including: node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v , the output of the first convolutional layer includes: updated node scalar feature New_N s and updated node vector feature New_N v ; The input of the convolution layer outside the first convolution layer includes: update node scalar feature New_N s and update node vector feature New_N v , edge scalar feature E s and edge vector feature E v ; the first convolution The output of the convolutional layer outside the layer includes: updated node scalar feature New_N s and updated node vector feature New_N v .
  31. 根据权利要求30所述的装置,其特征在于,所述卷积层被配置为:The device according to claim 30, wherein the convolutional layer is configured as:
    对所述节点标量特征N s依序进行第一线性操作、第二激活函数和第二线性操作得到第一子处理结果Q1,并且,对所述边标量特征E s进行第三线性操作,得到第二子处理结果Q2; Performing the first linear operation, the second activation function and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and performing the third linear operation on the edge scalar feature E s to obtain The second sub-processing result Q2;
    对所述第一子处理结果Q1和所述第二子处理结果Q2进行第一矩阵对应乘法操作,得到第三子处理结果Q3;Performing a first matrix corresponding multiplication operation on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain a third sub-processing result Q3;
    基于所述第三子处理结果Q3和所述节点矢量特征N v进行第二矩阵对应乘法操作,得到第四子处理结果Q4,并且,基于所述第三子处理结果Q3和所述边矢量特征E v进行第三矩阵对应乘法操作,得到第五子处理结果Q5; Perform a second matrix corresponding multiplication operation based on the third sub-processing result Q3 and the node vector feature N v to obtain a fourth sub-processing result Q4, and, based on the third sub-processing result Q3 and the edge vector feature E v performs the multiplication operation corresponding to the third matrix to obtain the fifth sub-processing result Q5;
    对所述第四子处理结果Q4和所述第五子处理结果Q5进行第一矩阵加和操作,得到第六子处理结果Q6;performing a first matrix addition operation on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain a sixth sub-processing result Q6;
    对所述第六子处理结果Q6分别经过第四线性操作和第五线性操作,得到第七子处理结果Q7和第八子处理结果Q8;The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8;
    对所述第七子处理结果Q7依序经过第六线性操作、第二激活函数、第七线性操作,得到第九子处理结果Q9,并且,基于所述第三子处理结果Q3、所述第七子处理结果Q7和所述第八子处理结果Q8进行内积操作Inner,得到第十子处理结果Q10;The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the first The seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform an inner product operation Inner to obtain the tenth sub-processing result Q10;
    对所述第九子处理结果Q9和所述第十子处理结果Q10进行第四矩阵对应乘法操作,得到第十一子处理结果Q11;Performing a fourth matrix corresponding multiplication operation on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain an eleventh sub-processing result Q11;
    对所述第九子处理结果Q9和所述第十一子处理结果Q11进行第二矩阵加和操作,得到所述更新节点标量特征NewN sPerforming a second matrix addition operation on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the update node scalar feature NewN s ;
    对所述第八子处理结果Q8和所述第九子处理结果Q9进行第五矩阵对应乘法操作,得到所述更新节点矢量特征NewN vThe fifth matrix corresponding multiplication operation is performed on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v .
  32. 根据权利要求29所述的装置,其特征在于:The device according to claim 29, characterized in that:
    所述溶剂化自由能预测模型包括:The solvation free energy prediction model includes:
    分子编码网络,被配置为将包括溶质分子数据和/或溶剂分子数据的训练数据集合中的各训练数据,分别转换为针对溶质分子数据和/或针对溶剂分子数据的虚拟分子图,其中,所述训练数据具有溶剂化自由能标注信息,所述训练数据中溶质分子或者溶剂分子的原子分别具有F维特征;The molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein, The training data has solvation free energy labeling information, and atoms of solute molecules or solvent molecules in the training data have F-dimensional features respectively;
    和/或and / or
    所述溶剂化自由能预测模型包括:The solvation free energy prediction model includes:
    溶剂化作用网络,被配置为将溶质分子特征和溶剂分子特征转换为溶剂化作用特征;a solvation network configured to convert solute molecular signatures and solvent molecular signatures into solvation signatures;
    其中,所述溶剂化作用网络包括自注意力网络,所述自注意力网络被配置为确定溶剂特征中与所述溶剂分子的原子对应阵元的第一阵元权重,并且确定溶质特征中与所述溶质分子的原子对应阵元的第二阵元权重,以便按照所述第一阵元权重对所述溶剂特征中与所述溶剂分子的各原子对应阵元进行融合和按照所述第二阵元权重对所述溶质特征中与所述溶质分子的各原子对应阵元进行融合,其中所述溶剂特征和所述溶质特征是基于溶剂化作用矩阵来确定的,所述溶剂化作用矩阵是基于所述溶质分子特征和所述溶剂分子特征来确定的;Wherein, the solvation network includes a self-attention network, and the self-attention network is configured to determine the first element weight of the element corresponding to the atom of the solvent molecule in the solvent feature, and determine the weight of the first element corresponding to the atom in the solute feature The atoms of the solute molecules correspond to the second element weights of the elements, so that the elements corresponding to the atoms of the solvent molecules in the solvent feature are fused according to the first element weights and the elements corresponding to the atoms of the solvent molecules are fused according to the second element weights. The array element weight fuses the array elements corresponding to each atom of the solute molecule in the solute feature, wherein the solvent feature and the solute feature are determined based on a solvation matrix, and the solvation matrix is determined based on said solute molecular characteristics and said solvent molecular characteristics;
    和/或and / or
    所述溶剂化自由能预测模型包括:The solvation free energy prediction model includes:
    全连接网络,被配置为将所述溶剂化作用特征转换为溶剂化自由能;所述全连接网络包括:依序连接的第一线性层、第一激活函数层、第二线性层、第二激活函数层和第三线性层,其中,第一线性层和第二线性层的输出维度是F维,所述第三线性层的输出维度是1维。A fully connected network configured to convert the solvation feature into a free energy of solvation; the fully connected network includes: a first linear layer connected in sequence, a first activation function layer, a second linear layer, a second The activation function layer and the third linear layer, wherein the output dimension of the first linear layer and the second linear layer is F dimension, and the output dimension of the third linear layer is 1 dimension.
  33. 根据权利要求32所述的装置,其特征在于,还包括:The apparatus of claim 32, further comprising:
    训练集分割模块,用于将所述训练数据集合分割为指定份数的子训练数据集合;A training set segmentation module, configured to divide the training data set into sub-training data sets of a specified number of copies;
    模型构建模块,用于构建与所述指定份数相同数量的溶剂化自由能预测模型;A model building block for constructing the same number of solvation free energy prediction models as the specified number of copies;
    所述模型训练模块具体用于分别将各子训练数据集合中的训练数据输入不同的溶剂化自由能预测模型的分子编码网络,以对不同的溶剂化自由能预测模型分别进行模型训练,得到多个经训练的与所述指定份数相同数量的溶剂化自由能预测模型。The model training module is specifically used to input the training data in each sub-training data set into the molecular encoding network of different solvation free energy prediction models, so as to perform model training on different solvation free energy prediction models respectively, and obtain multiple A number of trained solvation free energy prediction models equal to the specified number of copies.
  34. 一种确定溶剂化自由能的装置,其特征在于,包括:A device for determining free energy of solvation, characterized in that it comprises:
    自由能预测模块,用于利用根据权利要求29至33任一项所述的装置训练的溶剂化自由能预测模型处理虚拟分子图,得到针对所述虚拟分子图的溶剂化自由能,其中,所述虚拟分子图是基于待处理数据生成的图,所述待处理数据包括针对目标分子中的多个原子各自的属性信息,所述目标分子包括溶质分子和/或溶剂分子。The free energy prediction module is used to process the virtual molecular graph using the solvation free energy prediction model trained by the device according to any one of claims 29 to 33 to obtain the solvation free energy for the virtual molecular graph, wherein the The virtual molecular graph is a graph generated based on the data to be processed, and the data to be processed includes attribute information for a plurality of atoms in a target molecule, and the target molecule includes solute molecules and/or solvent molecules.
  35. 根据权利要求34所述的装置,其特征在于,还包括:The apparatus of claim 34, further comprising:
    多模型处理模块,用于将所述虚拟分子图或者所述待处理数据分别输入经训练的不同的指定个数的溶剂化自由能预测模型,得到指定个数的溶剂化自由能;A multi-model processing module, configured to input the virtual molecular map or the data to be processed into different trained solvation free energy prediction models of a specified number to obtain a specified number of solvation free energies;
    加权处理模块,用于将所述指定个数的溶剂化自由能的加权平均值作为与所述待处理数据对应的溶剂化自由能。A weighting processing module, configured to use the weighted average of the specified number of solvation free energies as the solvation free energy corresponding to the data to be processed.
  36. 一种设计装置,其特征在于,所述装置包括:A design device, characterized in that said device comprises:
    溶剂化自由能确定模块,用于根据权利要求19至35中任一项所述的装置,确定溶剂化自由能;A solvation free energy determination module, for determining the solvation free energy according to the device according to any one of claims 19 to 35;
    设计模块,用于基于所述溶剂化自由能进行药物设计或者材料设计。The design module is used for drug design or material design based on the solvation free energy.
  37. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    处理器;以及processor; and
    存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行根据权利要求1-17中任一项所述的方法。A memory on which executable code is stored, which, when executed by the processor, causes the processor to perform the method according to any one of claims 1-17.
  38. 一种计算机可读存储介质,其特征在于,其上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行根据权利要求1-17中任一项所述的方法。A computer-readable storage medium, which is characterized in that executable code is stored thereon, and when the executable code is executed by a processor of an electronic device, the processor is made to execute any one of the following claims 1-17. method described in the item.
PCT/CN2021/140134 2021-12-21 2021-12-21 Data processing method and apparatus, model training method and free energy prediction method WO2023115343A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/140134 WO2023115343A1 (en) 2021-12-21 2021-12-21 Data processing method and apparatus, model training method and free energy prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/140134 WO2023115343A1 (en) 2021-12-21 2021-12-21 Data processing method and apparatus, model training method and free energy prediction method

Publications (1)

Publication Number Publication Date
WO2023115343A1 true WO2023115343A1 (en) 2023-06-29

Family

ID=86900930

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/140134 WO2023115343A1 (en) 2021-12-21 2021-12-21 Data processing method and apparatus, model training method and free energy prediction method

Country Status (1)

Country Link
WO (1) WO2023115343A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705197A (en) * 2023-08-02 2023-09-05 北京深势科技有限公司 Method and device for processing synthetic and inverse synthetic molecular diagram prediction model
CN116991459A (en) * 2023-08-18 2023-11-03 中南大学 Software multi-defect information prediction method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200365235A1 (en) * 2019-05-15 2020-11-19 International Business Machines Corporation Feature vector feasibility estimation
CN113011282A (en) * 2021-02-26 2021-06-22 腾讯科技(深圳)有限公司 Graph data processing method and device, electronic equipment and computer storage medium
CN113409893A (en) * 2021-06-25 2021-09-17 成都职业技术学院 Molecular feature extraction and performance prediction method based on image convolution
CN113571122A (en) * 2021-02-02 2021-10-29 腾讯科技(深圳)有限公司 Electronic density map determining method and device, electronic equipment and storage medium
CN113707235A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning
CN113707236A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on graph neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200365235A1 (en) * 2019-05-15 2020-11-19 International Business Machines Corporation Feature vector feasibility estimation
CN113571122A (en) * 2021-02-02 2021-10-29 腾讯科技(深圳)有限公司 Electronic density map determining method and device, electronic equipment and storage medium
CN113011282A (en) * 2021-02-26 2021-06-22 腾讯科技(深圳)有限公司 Graph data processing method and device, electronic equipment and computer storage medium
CN113409893A (en) * 2021-06-25 2021-09-17 成都职业技术学院 Molecular feature extraction and performance prediction method based on image convolution
CN113707235A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning
CN113707236A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on graph neural network

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705197A (en) * 2023-08-02 2023-09-05 北京深势科技有限公司 Method and device for processing synthetic and inverse synthetic molecular diagram prediction model
CN116705197B (en) * 2023-08-02 2023-11-17 北京深势科技有限公司 Method and device for processing synthetic and inverse synthetic molecular diagram prediction model
CN116991459A (en) * 2023-08-18 2023-11-03 中南大学 Software multi-defect information prediction method and system
CN116991459B (en) * 2023-08-18 2024-04-26 中南大学 Software multi-defect information prediction method and system

Similar Documents

Publication Publication Date Title
US20190370659A1 (en) Optimizing neural network architectures
WO2023115343A1 (en) Data processing method and apparatus, model training method and free energy prediction method
US11151335B2 (en) Machine translation using attention model and hypernetwork
CN110023966B (en) Simulation material using quantum computation
Liu et al. MapReduce based parallel neural networks in enabling large scale machine learning
Wallace The case for black hole thermodynamics part II: Statistical mechanics
US20190095400A1 (en) Analytic system to incrementally update a support vector data description for outlier identification
US11373117B1 (en) Artificial intelligence service for scalable classification using features of unlabeled data and class descriptors
CN113254716B (en) Video clip retrieval method and device, electronic equipment and readable storage medium
US20230177089A1 (en) Identifying similar content in a multi-item embedding space
CN114219076A (en) Quantum neural network training method and device, electronic device and medium
JP2023552048A (en) Neural architecture scaling for hardware acceleration
CN113535912A (en) Text association method based on graph convolution network and attention mechanism and related equipment
CN116127164B (en) Training method of codebook quantization model, search data quantization method and device thereof
CN110019875A (en) The generation method and device of index file
CN112989024A (en) Method, device and equipment for extracting relation of text content and storage medium
CN116682506A (en) Data processing method, training method, determining method, designing method and device
WO2023113693A2 (en) Optimal knowledge distillation scheme
Xu et al. Parallelizing gene expression programming algorithm in enabling large‐scale classification
WO2022252596A1 (en) Method for constructing ai integrated model, and inference method and apparatus of ai integrated model
CN110019096A (en) The generation method and device of index file
JP2024504179A (en) Method and system for lightweighting artificial intelligence inference models
Zhang et al. Small files storing and computing optimization in Hadoop parallel rendering
CN111459990B (en) Object processing method, system, computer readable storage medium and computer device
CN114218869A (en) Data processing method and device, model training method and free energy prediction method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21968495

Country of ref document: EP

Kind code of ref document: A1