WO2023115343A1

WO2023115343A1 - Data processing method and apparatus, model training method and free energy prediction method

Info

Publication number: WO2023115343A1
Application number: PCT/CN2021/140134
Authority: WO
Inventors: 付文博; 曾群
Original assignee: 深圳晶泰科技有限公司
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2023-06-29

Abstract

A data processing method and apparatus, a model training method, and a free energy prediction method. The data processing method comprises: obtaining data to be processed, the data to be processed comprising attribute information for each of multiple atoms in a target molecule; in response to the attribute information of each of the multiple atoms, generating a node set and a node position set for the target molecule; generating a node scalar feature Ns and a node vector feature Nv for the node set, and generating an edge scalar feature Es and an edge vector feature Ev for the node set on the basis of coordinate information of each node in the node position set; constructing a virtual molecular diagram on the basis of the node scalar feature Ns, the node vector feature Nv, the edge scalar feature Es and the edge vector feature Ev for the node set, so as to determine a molecular feature X of the target molecule on the basis of the virtual molecular diagram, and determine solvation free energy on the basis of the molecular feature X of the target molecule. According to the present application, the accuracy of determined solvation free energy can be improved.

Description

Data processing method, device, model training method and prediction free energy method

technical field

The present application relates to the technical field of computer simulation, in particular to a data processing method, device, model training method and free energy prediction method.

Background technique

With the rapid development of computer technology and artificial intelligence technology, computer simulation technology has been applied to more and more scenarios, such as material design, drug design, etc.

However, the applicant found that the accuracy of molecular solvation free energy obtained by related techniques is low.

Contents of the invention

In order to solve or partially solve the problems existing in related technologies, this application provides a data processing method, device, model training method and free energy prediction method, which can effectively improve the accuracy of the obtained molecular solvation free energy.

The first aspect of the present application provides a data processing method, including: obtaining the data to be processed, the data to be processed includes the respective attribute information of a plurality of atoms in the target molecule; in response to the respective attribute information of the plurality of atoms, generating A node set and a node position set for the target molecule, wherein multiple nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes coordinate information of each node in the node set in a specific coordinate system; node scalar feature N _s and node vector feature N _v , and generate edge scalar feature E _s and edge vector feature E _v for the node set based on the coordinate information of each node in the node position set; based on the node scalar feature N for the node set _s , node vector feature N _v , edge scalar feature E _s and edge vector feature E _v construct a virtual molecular graph to determine the molecular feature X of the target molecule based on the virtual molecular graph, which facilitates the determination of solvation freedom at least based on the molecular feature X of the target molecule able.

The second aspect of the present application provides a method for training a prediction model of free energy of solvation, including: inputting the virtual molecular graph determined based on the above-mentioned method into the prediction model of free energy of solvation, and adjusting the model parameters to make the loss function converge , to obtain a trained solvation free energy prediction model, in which there is corresponding solvation free energy label information in the virtual molecular graph, and the input of the loss function includes the predicted solvation free energy and solvation free energy in the solvation free energy label information Free Energy.

The third aspect of the present application provides a method for determining the free energy of solvation, comprising: processing a virtual molecular graph with a trained solvation free energy prediction model to obtain the solvation free energy for the virtual molecular graph, wherein, the virtual molecule The graph is a graph generated based on the data to be processed, and the data to be processed includes attribute information for multiple atoms in the target molecule, and the target molecule includes solute molecules and/or solvent molecules.

The fourth aspect of the present application provides a design method, including: determining the free energy of solvation according to the above-mentioned method; performing drug design or material design based on the free energy of solvation.

The fifth aspect of the present application provides a data processing device, including: a module for obtaining data to be processed, for obtaining data to be processed, the data to be processed includes attribute information for each of multiple atoms in the target molecule; a set generation module, It is used to generate a node set and a node position set for the target molecule in response to the respective attribute information of multiple atoms, wherein the multiple nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes each node in the node set Coordinate information in a specific coordinate system; node and edge feature generation module, used to generate node scalar feature N _s and node vector feature N _v for the node set, and generate node set based on the coordinate information of each node in the node position set The edge scalar feature E _s _and _edge vector feature E _v of the virtual _molecular _building block for constructing virtual A molecular map to determine a molecular characteristic X of the target molecule based on the virtual molecular map, facilitating determination of a free energy of solvation based at least on the molecular characteristic X of the target molecule.

The sixth aspect of the present application provides a device for training a solvation free energy prediction model, including: a model training module, which is used to input the virtual molecular graph determined based on the above method into the solvation free energy prediction model, by adjusting the model parameters so that The loss function converges, and a trained solvation free energy prediction model is obtained, in which there is corresponding solvation free energy labeling information in the virtual molecular map, and the input of the loss function includes the predicted solvation free energy and solvation free energy labeling information free energy of solvation.

The seventh aspect of the present application provides a device for determining the free energy of solvation, including: a free energy prediction module, which is used to process a virtual molecular graph using a trained solvation free energy prediction model to obtain a solvent for the virtual molecular graph The chemical free energy, wherein, the virtual molecular map is a map generated based on the data to be processed, the data to be processed includes attribute information for a plurality of atoms in the target molecule, and the target molecule includes solute molecules and/or solvent molecules.

The eighth aspect of the present application provides a design device, the device includes: a solvation free energy determination module, used to determine the solvation free energy according to the above method; a design module, used for drug design based on the solvation free energy Or Material Design.

A ninth aspect of the present application provides an electronic device, including: a processor; and a memory, on which executable code is stored, and when the executable code is executed by the processor, the processor is made to execute the above method.

The tenth aspect of the present application also provides a computer-readable storage medium, on which executable codes are stored, and when the executable codes are executed by a processor of an electronic device, the processor is made to execute the above method.

The eleventh aspect of the present application further provides a computer program product, including executable codes, and the above method is implemented when the executable codes are executed by a processor.

The data processing method, device, model training method and prediction free energy method provided by the present application convert the data to be processed into a node set and a node position set for the target molecule, so that the node scalar feature N _s and the node position set for the node set can be generated. Vector feature N _v , and generate edge scalar feature E _s and edge vector feature E _v for the node set based on the coordinate information of each node in the node position set; these descriptors that can represent three-dimensional features of molecules are relatively low-dimensional descriptions in related technologies The symbol can more completely represent the characteristics of the target molecule and effectively improve the accuracy of the determined solvation free energy.

In addition, the solvent-solute interaction is described by the matrix product of the solute molecule feature vector and the solvent molecule feature vector, which can better visualize The solvent-solute interaction is described by the formula, which effectively improves the accuracy of the determined solvation free energy.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Description of drawings

FIG. 1 schematically shows an exemplary system architecture to which a data processing method, device, model training method and prediction free energy method can be applied according to an embodiment of the present application;

Fig. 2 schematically shows a flow chart of a data processing method according to an embodiment of the present application;

FIG. 3 schematically shows a flow chart of a method for determining molecular characteristics of a target molecule based on a virtual molecular map according to an embodiment of the present application;

Fig. 4 schematically shows a logic diagram for updating node scalar features and node vector features based on a virtual molecular graph according to an embodiment of the present application;

FIG. 5 schematically shows a flow chart of another data processing method according to an embodiment of the present application;

6 schematically shows a flow chart of a method for training a solvation free energy prediction model according to an embodiment of the present application;

FIG. 7 schematically shows a schematic structural diagram of an equivariant graph convolutional network according to an embodiment of the present application;

FIG. 8 schematically shows a schematic structural diagram of a fully connected network according to an embodiment of the present application;

Fig. 9 schematically shows a flowchart of a method for determining the free energy of solvation according to an embodiment of the present application;

Fig. 10 schematically shows the correlation diagram between the solvation free energy predicted by the model and the real solvation free energy on the training set according to the embodiment of the present application by dividing the data set training by solvent type;

Figure 11 schematically shows the correlation diagram between the solvation free energy predicted by the model and the real solvation free energy on the test set according to the solvent type division data set training according to the embodiment of the present application;

Fig. 12 schematically shows the correlation diagram of the solvation free energy predicted by the model and the real solvation free energy on the training set obtained by dividing the data set by solute type according to the embodiment of the present application;

Fig. 13 schematically shows the correlation diagram between the free energy of solvation predicted by the model and the real free energy of solvation on the test set obtained by dividing the data set according to the solute type according to the embodiment of the present application;

Fig. 14 schematically shows a flow chart of a design method according to an embodiment of the present application;

Fig. 15 schematically shows a block diagram of a data processing device according to an embodiment of the present application;

Fig. 16 schematically shows a block diagram of a device for training a solvation free energy prediction model according to an embodiment of the present application;

Fig. 17 schematically shows a block diagram of a device for determining the free energy of solvation according to an embodiment of the present application;

Fig. 18 schematically shows a block diagram of a design device according to an embodiment of the present application;

Fig. 19 schematically shows a block diagram of an electronic device according to an embodiment of the present application.

Detailed ways

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.

The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. The terms "comprising", "comprising", etc. used herein indicate the presence of features, steps, operations and/or components, but do not exclude the presence or addition of one or more other features, steps, operations or components.

All terms (including technical and scientific terms) used herein have the meaning commonly understood by one of ordinary skill in the art, unless otherwise defined. It should be noted that the terms used herein should be interpreted to have a meaning consistent with the context of this specification, and not be interpreted in an idealized or overly rigid manner.

It should be understood that although the terms "first", "second", "third" and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present application, "plurality" means two or more, unless otherwise specifically defined.

Before describing the technical solution of the application, some technical terms in the field involved in the application will be explained first.

A molecular descriptor is a representation of a molecule as a data structure that a computer program can process.

A virtual molecular graph is a molecular descriptor, which represents atoms as nodes and the relationship between atoms as edges; unlike ordinary molecular graphs that establish edges based on the bonding information between atoms, virtual molecular graphs are based on Cut off radius to create edges.

Cutoff (Cutoff) radius, for a certain atom in a molecule, if the atom is established with all other atoms, the number of edges will be too many, and the calculation will be too large. Considering that other atoms farther away from the atom have less influence on the atom, so a cut-off radius is taken, and only the atom is allowed to establish edges with atoms whose distance from it is smaller than the cut-off radius. For atoms outside the cut-off radius, then Ignore its interactions.

A coordinate system (Frame) is a reference to describe the position and attitude of an object, so it is also called a frame of reference or a frame of reference. For example, the coordinate system may be a coordinate system created during simulation, such as Cartesian coordinates.

Coordinates are used to represent the absolute position of an object in a specific coordinate system. In mathematics, the essence of coordinates is an ordered logarithm.

Solvation is a reaction process driven by the interaction between solute molecules and solvent molecules. It is the key to the process of drug research and development, such as crystallization nucleation, chemical reaction, drug metabolism, drug interaction and drug-receptor interaction. step. The strength of solvation is usually characterized by the free energy of solvation, so it is of great significance to quickly and accurately predict the free energy of solvation in the field of drug development. The solvation free energy prediction process in the related art can be realized through two paths.

One is based on the empirical force field fitted by experimental and computational data, which is carried out by using free energy perturbation or thermodynamic integration methods in molecular dynamics simulations. In the process of practice, the applicant found that although the solvation free energy obtained by this method has a smaller error compared with the experimental results, the free energy perturbation or thermodynamic integration requires long-term simulation, and its calculation cost is high. This makes the goal of quickly predicting the free energy of solvation difficult to achieve.

The other is based on the existing experimental and calculation data, using machine learning methods to construct a structure-solvation free energy model. This method can quickly predict the free energy of solvation, but in the process of practice, the applicant found that in some cases the accuracy of the predicted free energy of solvation is not enough to solve the problems related to solvation.

After a lot of research and analysis by the applicant, it is found that the reasons for the low accuracy of the prediction results using machine learning methods include the following two aspects: First, the machine learning methods in related technologies can express molecules as SMILES, MACCS, Morgan and hybrid fingerprints, etc. Low-dimensional descriptors cannot fully represent the three-dimensional characteristics of molecules. The second is that when describing the solvent-solute interaction, the models established by these methods can be realized by simply splicing, arranging or summing the molecular features of the solvent and the solute molecules, without explicitly describing them in a physically meaningful framework Solvent-solute interactions.

In order to achieve fast and accurate prediction of molecular solvation free energy in the embodiments of the present application, it is considered to design descriptors that can represent three-dimensional features of molecules and/or be able to explicitly Machine learning models describing solvent-solute interactions to improve the accuracy of predicted free energies of solvation.

A data processing method, device, model training method and prediction free energy method of the embodiments of the present application will be described in detail below with reference to FIGS. 1 to 19 .

Fig. 1 schematically shows an exemplary system architecture to which a data processing method, an apparatus, a model training method and a free energy prediction method can be applied according to an embodiment of the present application. It should be noted that Figure 1 is only an example of the system architecture to which the embodiment of the present application can be applied, to help those skilled in the art understand the technical content of the present application, but it does not mean that the embodiment of the present application cannot be used in other device, system, environment or scenario.

Referring to FIG. 1 , a system architecture 100 according to this embodiment may include

terminal devices

101 , 102 , 103 , a network 104 and a server 105 . The network 104 is used as a medium for providing communication links between the

terminal devices

101 , 102 , 103 and the server 105 . Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.

Users can use

terminal devices

101, 102, 103 to interact with other terminal devices and server 105 through network 104 to receive or send information, such as sending model training requests, free energy prediction requests and receiving model training results, solvation free energy wait.

Terminal devices

101, 102, and 103 can be installed with various communication client applications, for example, drug development applications, material design applications, web browser applications, database applications, search applications, instant messaging tools, email clients, social platforms software and other applications.

Terminal devices

101, 102, and 103 include, but are not limited to, smart desktop computers, tablet computers, laptop computers, and other electronic devices that can support functions such as surfing the Internet, modeling, analysis and calculation, and design.

The server 105 can receive model training requests, solvation free energy requests, etc., adjust model parameters, store model topology, model parameters, predict solvation free energy, etc., and can also send solvation free energy to

terminal devices

101, 102, 103. For example, the server 105 may be a background management server, a server cluster, and the like.

It should be noted that the numbers of terminal devices, networks and servers are only illustrative. According to implementation requirements, there can be any number of terminal devices, networks and clouds.

Fig. 2 schematically shows a flowchart of a data processing method according to an embodiment of the present application.

As shown in Figure 2, this embodiment provides a method for data processing, the method includes operation S210 to operation S240, specifically as follows:

In operation S210, data to be processed is obtained, and the data to be processed includes property information for each of a plurality of atoms in the target molecule.

In this embodiment, the data to be processed may be a character string. Property information can be used to characterize properties of the target molecule and at least some of the atoms in the target molecule. Wherein, the attribute includes but not limited to: spatial position attribute, molecule type, atom type and so on. The spatial position attribute can be coordinates in Cartesian coordinate system or polar coordinate system. Molecular species may include solute molecules, solvent molecules. The atomic species can be determined from the number of protons and/or neutrons in the atom. For example, protium, deuterium, and tritium can be considered to be the same atomic species or different atomic species.

Specifically, the data to be processed may be three-dimensional conformations of solute molecules and solvent molecules represented by strings in x, y, z format. For example, a string can include the three-dimensional conformation of a molecule and the x, y, and z coordinates of each atom in the molecule.

For example, read the x, y, z character string of a molecule to get the atom type information and atom position information contained in the molecule.

In addition, the same coordinate system, especially the spatial coordinate system, can be used in the calculation of molecular solvation free energy and in the process of drug design. For example, some or all of the geometric center coordinates of the solute molecule, the atomic coordinates in the solute molecule, the geometric center coordinates of the solvent molecule, and the atomic coordinates in the solvent molecule are coordinates in the same coordinate system. It should be understood that some or all of the geometric center coordinates of the solute molecule, the atomic coordinates in the solute molecule, the geometric center coordinates of the solvent molecule, and the atomic coordinates in the solvent molecule may also be coordinates in different coordinate systems, but Coordinates in each coordinate system can be converted to each other.

In operation S220, in response to the respective attribute information of a plurality of atoms, a node set and a node position set for the target molecule are generated, wherein a plurality of nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes a node set The coordinate information of each node in a specific coordinate system.

In this embodiment, correspondingly generate node set {A _i } and node position set {(xi _, y _i , zi ₎ }, {( _xi , y _i , zi ₎ }, where i=1,2 ,...N, N represents the number of atoms contained in the molecule, and A represents the type of atoms.

In operation S230, node scalar features N _s and node vector features N _v for the node set are generated, and edge scalar features E _s and edge vector features E _v for the node set are generated based on coordinate information of each node in the node position set.

In this embodiment, the target molecule includes N atoms, and multiple nodes in the node set each have F-dimensional features.

Correspondingly, the dimension of node scalar feature N _s includes N×F×1 dimension, the dimension of node vector feature N _v includes N×F×3 dimension, the dimension of edge scalar feature E _s includes N×1×1 dimension, and the dimension of edge vector The dimensions of the feature E _v include N×3×1 dimensions.

Specifically, set the feature dimension F, for example, the value of F is not less than 64. Optionally, the value of F is an integer power of 2. According to the node type (such as atom type), the elements in the node set are embedded and coded, and the node set is expressed as an N×F×1-dimensional matrix, which represents the node scalar feature Ns, and at the same time initializes an N× F×3-dimensional matrix, which represents the node vector feature N _v . Wherein, embedding coding may refer to randomly expressing nodes as an F-dimensional vector, and this vector will be updated during subsequent model training.

Next, for the element i in the node set {A _i }, an edge can be established between each element i and at least part of the elements j in the remaining elements to form an edge set {(i,j)}, where i=1,2, ...N, j∈A _i , the number of sides is recorded as E.

For each edge in the edge set, calculate the distance and position vector between the two nodes corresponding to the edge to obtain the edge distance set, as shown in formula (1).

The set of edge position vectors is shown in formula (2).

{r _v,ij |r _v,ij ＝(x _i -x _j ,y _i -y _j ,z _i -z _j )} formula (2)

Further, the set of edge distances is expressed as a matrix of dimension E×1×1 to represent the edge scalar feature E _s , and the set of edge position vectors is expressed as a matrix of dimension E×3×1 to represent the feature of edge vector E _v .

In some embodiments, in order to reduce the calculation resources consumed when determining the molecular solvation free energy, the adjacent node set {N _i } that meets the requirement of the truncation radius can be selected from the above node set {A _i }, and then determined for The edge set of the adjacent node set {N _i }.

Specifically, the above method may further include the following operations.

First, after generating a node set and a node position set for the target molecule in response to the respective attribute information of a plurality of atoms, the truncation radius r _cut is determined. Wherein, the truncation radius r _cut can be a preset value, and the preset value can be a value determined based on expert experience or simulation results, such as 3 Angstroms

or

wait.

Then, determine the target nodes whose distance between nodes is less than or equal to the truncation radius r _cut from the node set, and obtain the target node set N _i . Wherein, the template node set N _i is the adjacent node set {N _i }.

Specifically, the truncation radius r _cut can be set first, and for each element i in the node set, the adjacent node sets of each element i within the truncation radius are respectively determined

As shown in formula (3).

Establish edges between element i and all nodes j in the adjacent node set to form an edge set {(i,j)}, where i=1,2,...N, j∈A _i , and the number of edges is denoted as E.

Correspondingly, generating the edge scalar feature and edge vector feature for the node set based on the coordinate information of each node in the node position set includes:

Generate edge scalar feature E _s and edge vector feature E _v for the target node set N _i based on the coordinate information for the target node in the node position set.

In some embodiments, since the adjacent node set {N _i } is selected from the node set {A _i }, correspondingly, the dimension of the edge scalar feature E _s and the edge vector feature E _v will also change.

Specifically, the target node set includes E nodes, and each of the E nodes has F-dimensional features. The dimension of node scalar feature N _s includes N×F×1 dimension, the dimension of node vector feature N _v includes N×F×3 dimension, the dimension of edge scalar feature E _s includes E×1×1 dimension, and the dimension of edge vector feature E _v The dimensions of include E×3×1 dimensions.

In operation S240, construct a virtual molecular graph based on the node scalar feature N _s , node vector feature N _v , edge scalar feature E _s and edge vector feature E _v for the node set, to determine the molecular feature X of the target molecule based on the virtual molecular graph , facilitating the determination of the free energy of solvation based at least on the molecular characteristic X of the target molecule.

In this way, the virtual molecular graph is composed of node scalar feature N _s , node vector feature N _v , edge scalar feature E _s and edge vector feature E _v . The method for determining the molecular feature X can use a variety of related technologies, such as using a method similar to the method of extracting the molecular feature X based on a molecular map.

The virtual molecular map includes the scalar and vector information of atoms, as well as the scalar and vector information between atoms. It is a universal and accurate descriptor that can effectively improve the accuracy of the determined molecular solvation free energy.

In this embodiment, the three-dimensional information of molecules is expressed as a virtual molecular graph. In addition to not losing the three-dimensional information of molecules, different molecular conformations can also be strictly distinguished. Compared with the two-dimensional Descriptors, which describe molecules more accurately.

Fig. 3 schematically shows a flowchart of a method for determining molecular features of a target molecule based on a virtual molecular map according to an embodiment of the present application.

Referring to FIG. 3 , the process of determining the molecular characteristics of the target molecule based on the virtual molecular map may include operation S310 to operation S340.

In operation S310, the node scalar feature N _s and the node vector feature N _v are updated based on the virtual molecular graph, and the updated node scalar feature New_N _s and the updated node vector feature New_N _v are obtained.

In operation S320, the updated node scalar feature New_N _s and the updated node vector feature New_N _v are used as the current node scalar feature Now_N _s and the current node vector feature Now_N _v , respectively.

In operation S330, an updated virtual molecular graph is constructed using the current node scalar feature Now_N _s , the current node vector feature Now_N _v , the edge scalar feature E _s , and the edge vector feature E _v .

In operation S340, the updated node scalar feature New_N _s and the updated node vector feature New_N _v are updated based on the updated virtual molecular graph.

Operation S320 to operation S340 are repeatedly performed until the specified number of cycles num_conv is reached, and the updated node scalar feature New_N _s obtained when the specified cycle number num_conv is reached is used as the molecular feature X.

Specifically, the number of convolutional layers num_conv can be set. For the input (solvent or solute molecule) virtual molecular graph, keep E _s and E _v unchanged, update N _s and N _v with NewN _s and NewN _v respectively, and iterate num_conv time, compress the last dimension of NewN _s obtained in num_conv time, and convert it into an N×F matrix to represent the molecular feature X.

The following is an exemplary description of the updated node scalar feature New_N _s and the updated node vector feature New_N _v .

Fig. 4 schematically shows a logic diagram for updating node scalar features and node vector features based on a virtual molecular graph according to an embodiment of the present application.

Referring to Figure 4, the update process can be composed of four basic operations: matrix linear transformation operation (Linear), activation operation (such as ReLU), matrix corresponding multiplication operation, such as the combination of Hadamard product and matrix sum operation (Sum) become. Among them, these four basic operations have been maturely implemented in the program framework such as pytorch.

Among them, the matrix linear transformation operation (Linear) transforms the input into a feature space, extracts useful information in the input and retains it.

The activation operation (such as ReLU) is a nonlinear mapping that endows the network with nonlinear expressiveness.

The matrix corresponding multiplication operation is the one-to-one correspondence product of two matrices with the same dimension, which plays the role of feature scaling.

The matrix sum operation (Sum) is two matrices with the same dimension for one-to-one summation of matrix elements, which plays the role of feature fusion.

The inner product operation (Inner) is the inner product of two vectors, which converts the vector information into a scalar.

Specifically, updating the node scalar feature N _s and the node vector feature N _v based on the virtual molecular graph above to obtain the updated node scalar feature New_N _s and the updated node vector feature New_N _v may include the following process.

First, perform the first linear operation, the second activation function, and the second linear operation on the node scalar feature N _s in order to obtain the first sub-processing result Q1, and perform the third linear operation on the edge scalar feature E _s to obtain the second Subprocessing result Q2. Among them, the operation of obtaining the first sub-processing result Q1 realizes the extraction of useful information in N _s and nonlinear mapping to the feature space. The operation of obtaining the second sub-processing result Q2 realizes extracting useful information in E _s and linearly mapping it to the feature space.

Then, the first matrix corresponding multiplication operation is performed on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain the third sub-processing result Q3. Among them, the operation of obtaining the third sub-processing result Q3 uses the scalar feature of the edge to scale the feature of the node, and integrates the information of the edge into the node, so that the feature of the node is more expressive.

Then, based on the third sub-processing result Q3 and the node vector feature _Nv , the corresponding multiplication operation of the second matrix is performed to obtain the fourth sub-processing result Q4, and the third matrix is performed based on the third sub-processing result Q3 and the edge vector feature _Ev Corresponding to the multiplication operation, the fifth sub-processing result Q5 is obtained. Among them, the operation of obtaining the fourth sub-processing result Q4 uses the vector feature of the node to scale the feature of the node, and integrates the vector information into the node to make the feature of the node more expressive. The operation of obtaining the fifth sub-processing result Q5 uses the vector feature of the edge to scale the feature of the node, and integrates the vector information of the edge into the node, so that the feature of the node is more expressive.

Then, perform the first matrix addition operation on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain the sixth sub-processing result Q6. Among them, the operation of obtaining the sixth sub-processing result Q6 realizes the fusion of vector features of nodes and edges.

Next, the sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8. Wherein, the seventh sub-processing result Q7 is a vector feature, which is used to interact with a scalar feature. The eighth sub-processing result Q8 is a vector feature for interacting with the vector feature.

Then, the seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3 and the seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform the inner product operation Inner to obtain the tenth sub-processing result Q10; perform the fifth matrix corresponding multiplication operation on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN _v . Wherein, the operation of obtaining the ninth sub-processing result Q9 realizes updating scalar features with vector information. The operation of obtaining the tenth sub-processing result Q10 realizes the conversion of vector information into scalar information. The operation of obtaining the update node vector feature NewN _v realizes updating the vector feature with scalar information.

Next, the fourth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain the eleventh sub-processing result Q11. The operation of obtaining the eleventh sub-processing result Q11 realizes scaling the scalar feature by using the scalar information obtained by the vector inner product operation.

Then, the second matrix addition is performed on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the updated node scalar feature NewN _s .

In this embodiment, the scalar and vector features of the edges interact with the scalar and vector features of the nodes to update the node features and output new scalar and vector features of the nodes. Specifically, the edge information is fused into the node information to form a new feature, which improves the representation ability of the features NewN _s and NewN _v for the structure, making it easier for the model to extract information related to the free energy of solvation, and finally makes The prediction results are more accurate. It should be noted that the logic of updating node scalar feature New_N _s and updating node vector feature New_N _v based on updating virtual molecular graph is similar to the logic shown in FIG. 4 , and will not be described in detail here.

Through the method shown above, the molecule can be represented by a virtual molecular graph as a descriptor, and the virtual molecular graph includes relatively complete three-dimensional characteristics of the molecule, which helps to improve the accuracy of the determined free energy of solvation of the molecule.

In this embodiment, in addition to scalar features, vector features are also used in convolution, which makes extracting molecular features easier and more accurate than the method of using only scalar features in the related art.

In some embodiments, the solute molecular characteristics for solute molecules and the solvent molecular characteristics for solvent molecules can be respectively determined based on the above-mentioned manner of determining molecular characteristics. It should be noted that a solute molecule can have an interaction force with multiple adjacent solvent molecules, and the force between a solute molecule and a solvent molecule can be determined first, and then the molecular force relative to multiple solvent molecules can be determined. Free energy of solvation. In addition, it is also possible to directly determine the molecular solvation free energy of a solute molecule relative to multiple adjacent solvent molecules.

Specifically, target molecules may be solute molecules and/or solvent molecules. Specifically, the data to be processed may include molecular attribute information, for example, the data to be processed is data of solute molecules, and/or data of solvent molecules.

Correspondingly, the above method may also include the following operations: determining the solute molecular characteristics of the solute molecule, and the solvent molecular characteristics of at least one solvent molecule associated with the solute molecule, so that The solvent molecule characteristic of at least one solvent molecule determines the free energy of solvation. It should be noted that the feature dimensions of the solute molecular feature and the solvent molecular feature may be the same.

Fig. 5 schematically shows a flowchart of another data processing method according to an embodiment of the present application.

Referring to FIG. 5 , the above method may further include operation S510 to operation S520.

In operation S510, after determining the solute molecular signature of the solute molecule, and the solvent molecular signature of at least one solvent molecule associated with the solute molecule, the matrix product of the solvent molecular signature and the solute molecular signature is used as a matrix product between the solvent molecule and the solute molecule Solvation matrix.

In operation S520, solvation characteristics are determined based on the solvation matrix.

In this example, the solvent-solute interaction is not explicitly described in the related art. In this example, the solvent-solute interaction is described by the matrix product of the solute molecular characteristics and the solvent molecular characteristics, and the solvent-solute interaction is explicitly described. Solute interactions, which help to improve the accuracy of the solvation signature and, in turn, the solvation free energy of the molecule.

In some embodiments, the above-mentioned determination of solvation characteristics based on the solvation matrix may include the following operations.

First, the solvent characteristics corresponding to the preset solute weights are calculated based on the solvation matrix, and the solute characteristics corresponding to the preset solvent weights are calculated based on the solvation matrix.

Then, the solvent feature and the solute feature are respectively converted into one-dimensional row vectors including F elements.

Next, concatenate the row vectors to obtain the solvation feature.

For example, first, read solvent molecule feature X _M and solute molecule feature X _N , X _M is an M×F dimensional matrix, X _N is an N×F dimensional matrix, where M and N are solvent molecules, solute molecules The number of atoms involved.

Computes the matrix product of solvent molecular features and solute molecular features

Calculate the solvent characteristic under the solute weight X′ _M = X _MN · X _N and the solute characteristic under the solvent weight

According to the weight of the array element, X′ _M and X′ _N are weighted and summed, X′ _M and X′ _N are converted into a one-dimensional row vector containing F elements, and finally the two row vectors are spliced into a 2F-dimensional row The vector I _MN is the solvation signature. Suppose X′ _M =(1,2,3,…,F), X′ _N =(1,2,3,…,F), then I _MN =(1,2,3,…,F,1, 2,3,...,F). Wherein, the array element weight may be determined based on an attention mechanism.

In some embodiments, converting the solvent feature and the solute feature into a one-dimensional row vector including F elements may include the following operations.

Firstly, the weight of the first array element corresponding to the atom of the solvent molecule in the solvent feature is determined, and the weight of the second array element of the array element corresponding to the atom of the solute molecule in the solute feature is determined.

Then, weighted and summed the solvent features based on the first array element weights to obtain a one-dimensional first row vector including F elements, and weighted and summed the solute features based on the second array element weights to obtain a one-dimensional Second row vector containing F elements.

For example, through the attention mechanism, calculate the attention coefficient of each atom in _X'M and _X'N , and sum the weights of _X'M and _X'N according to the attention coefficient, and convert _X'M and _X'N is a one-dimensional row vector containing F elements, and finally the two row vectors are concatenated into a 2F-dimensional row vector representing the solvation feature I _MN .

In some embodiments, for the input solvation feature I _MN , the solvation free energy of the molecule can be obtained by performing weighted summation, offset and other processing on each element in the solvation feature I _MN . For example, by using the fully connected network to process the solvation feature I _MN , the molecular solvation free energy can be obtained.

In this example, after matrix multiplication of solvent features and solute features in the solvation attention network, the attention mechanism of summation embodies the physical meaning of solute weight and solvent weight, and explicitly describes the solvent The effect of solvation improves the accuracy of prediction of solvation free energy.

Another aspect of the present application also provides a method for training a solvation free energy prediction model.

In this embodiment, the above-mentioned method for training the solvation free energy prediction model may include: inputting the virtual molecular graph determined based on the above method into the solvation free energy prediction model, and adjusting the model parameters so that the loss function converges to obtain the trained The solvation free energy prediction model, in which there is corresponding solvation free energy label information in the virtual molecular map, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information.

In some embodiments, the solvation free energy prediction model may include at least one of the following networks.

The molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein the training Data have free energy of solvation label information.

Equivariant graph convolutional networks configured to convert virtual molecular graphs into solute molecular features and/or solvent molecular features.

A solvation network configured to convert solute molecular features and solvent molecular features into solvation features.

A fully connected network configured to convert solvation features into solvation free energies.

Correspondingly, the above training method may include: input the training data into the molecular encoding network, and adjust the model parameters (such as network parameters) to make the loss function converge, wherein the input of the loss function includes the solvation free energy and solvent The free energy of solvation in the free energy label information.

In some embodiments, the solvation network includes a self-attention network configured to determine a first element weight of an element corresponding to an atom of a solvent molecule in a solvent feature, and to determine an element corresponding to an atom of a solute in a solute feature. The atoms of the molecule correspond to the second array element weight of the array element, so that according to the first array element weight, the corresponding array elements in the solvent feature and the solvent molecules are fused, and the solute feature and the solute molecule are fused according to the second array element weight. The corresponding array elements of each atom are fused, wherein the solvent characteristics and solute characteristics are determined based on the solvation matrix, and the solvation matrix is determined based on the solute molecular characteristics and solvent molecular characteristics. For details, please refer to the relevant part of the data processing method, which will not be described in detail here.

In some embodiments, the above method may further include the following operations.

First, the training data set is divided into sub-training data sets of a specified number. For example, the specified number of parts can be determined based on expert experience or the accuracy of prediction of the molecule's free energy of solvation. For example, the specified number of copies can be 3, 5, 8, 10, 13, 18, 20, etc.

Then, build the same number of solvation free energy prediction models as the specified number of copies. In this way, multiple solvation free energy prediction models can be trained, and a model with good accuracy in predicting solvation free energy can be selected from among them, or the average value of the output results of multiple models can be used as the final prediction result.

Correspondingly, inputting the training data into the molecular encoding network includes: respectively inputting the training data in each sub-training data set into the molecular encoding network of different solvation free energy prediction models, so as to model the different solvation free energy prediction models respectively Train to get as many trained solvation free energy prediction models as the specified number.

Fig. 6 schematically shows a flowchart of a method for training a solvation free energy prediction model according to an embodiment of the present application.

Referring to Fig. 6, several pieces of data of the true value of the solvation free energy of the solute in the solvent obtained by experimental measurement or theoretical calculation are collected. For each piece of data, store the three-dimensional conformation of solute molecules and solvent molecules in the data set as strings in x, y, z format (for example, store the three-dimensional conformation of molecules in x, y, z coordinates of each atom), and The corresponding free energies of solvation are stored in the same dataset as floating point numbers.

Initialize the solvation free energy prediction model (such as the isotropic graph neural network model). The model consists of four parts: a molecular encoding network that converts molecular x,y,z strings into virtual molecular graphs. An equivariant graph convolutional network that converts virtual molecular graphs to molecular features. The solvation attention network converts the features of solute molecules and solvent molecules through matrix product and attention aggregation into solvation features (the matrix product of solute molecules and solvent molecules is obtained at the atomic level of the molecule, and attention aggregation is These atomic-level features are aggregated into molecular-level features through the attention mechanism), and the solvation features are converted into a fully connected network of solvation free energy.

Set the loss function (such as mean square error loss function, absolute difference loss function, Huber loss function, etc.). The data set is equally divided into ten parts, and the model is trained by ten-fold cross-validation. Until the loss function of the verification set is no longer reduced (that is, the loss function converges, and the difference between the two loss functions before and after is less than the preset value, the preset value can be 0.0005, 0.001, 0.002, etc., that is, convergence), and ten equivariant graphs are obtained. network model. It should be noted that a 5-fold cross-validation method or a k-fold cross-validation method may also be used.

The calculation method of the loss function L is shown in formula (4).

Among them, G _i,pred is the predicted value of solvation free energy, G _i,true is the real value of solvation free energy, n is the number of solvent-solute pairs used in training.

The topology of the solvation free energy prediction model is illustrated below.

Fig. 7 schematically shows a schematic structural diagram of an equivariant graph convolutional network according to an embodiment of the present application.

Referring to FIG. 7 , the equivariant graph convolutional network includes a convolutional layer with a specified number of cycles num_conv layer, wherein the output of the current convolutional layer is used as part of the input of the adjacent convolutional layer of the next layer.

The input of the first convolutional layer (refer to the first convolutional layer) includes: node scalar feature N _s , node vector feature N _v , edge scalar feature E _s and edge vector feature E _v . The output of the first convolutional layer includes: update node scalar feature New_N _s and update node vector feature New_N _v .

The input of the convolutional layer other than the first convolutional layer (see the second convolutional layer, the third convolutional layer, the fourth convolutional layer, etc.) includes: update node scalar feature New_N _s and update node vector feature New_N _v , Edge scalar features E _s and edge vector features E _v . The outputs of the convolutional layers other than the first convolutional layer include: updated node scalar feature New_N _s and updated node vector feature New_N _v .

Atomic features can be transformed into molecular features through equivariant graph convolutional networks.

Specifically, each convolutional layer can implement feature transformation as follows.

Please refer to Figure 7 and Figure 4 together. The equivariant graph convolutional network is composed of four basic operations Linear, ReLU, Hadamard and Sum. Among them, Linear is a matrix linear transformation operation, ReLU is an activation operation, Hadamard is a matrix corresponding multiplication operation, and Sum is a matrix addition operation.

Specifically, the convolutional layer is configured to perform the following operations. It should be noted that the first linear operation may be implemented by the first linear layer, and the second linear operation may be implemented by the second linear layer. Wherein, the first linear layer and the second linear layer may be the same layer or different layers.

Perform the first linear operation, the second activation function, and the second linear operation on the node scalar feature N _s in order to obtain the first sub-processing result Q1, and perform the third linear operation on the edge scalar feature E _s to obtain the second sub-processing Results Q2.

The first matrix corresponding multiplication operation is performed on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain the third sub-processing result Q3.

Carry out the corresponding multiplication operation of the second matrix based on the third sub-processing result Q3 and the node vector feature _Nv to obtain the fourth sub-processing result Q4, and perform the third matrix corresponding multiplication based on the third sub-processing result Q3 and the edge vector feature _Ev operation to obtain the fifth sub-processing result Q5.

The first matrix addition operation is performed on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain the sixth sub-processing result Q6.

The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8.

The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the seventh sub-processing result Q7 and The eighth sub-processing result Q8 is subjected to the inner product operation Inner to obtain the tenth sub-processing result Q10.

The fourth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain the eleventh sub-processing result Q11.

The second matrix addition is performed on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the updated node scalar feature NewN _s .

The fifth matrix corresponding multiplication operation is performed on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN _v .

For the functions and effects of the above operations, reference may be made to relevant content in the above embodiments, and details will not be described here.

In some embodiments, atoms of solute molecules or solvent molecules in the training data respectively have F-dimensional features.

Fig. 8 schematically shows a schematic structural diagram of a fully connected network according to an embodiment of the present application.

Referring to Figure 8, the fully connected network may include: a sequentially connected first linear layer (such as Linear), a first activation function layer (such as ReLU), a second linear layer, a second activation function layer, and a third linear layer, where , the output dimension of the first linear layer and the second linear layer is F dimension, and the output dimension of the third linear layer is 1 dimension. The input to the first linear layer is a 2F-dimensional row vector representing the solvation feature I _MN . For the input solvation feature I _MN , it can be converted into molecular solvation free energy through a fully connected network.

Another aspect of the present application provides a method of determining the free energy of solvation.

In this embodiment, the above-mentioned method for determining the free energy of solvation may include the following operations, using the free energy of solvation prediction model trained according to the above-mentioned method to process the virtual molecular graph to obtain the free energy of solvation for the virtual molecular graph, wherein, The virtual molecular graph is a graph generated based on the data to be processed, and the data to be processed includes attribute information for multiple atoms in the target molecule, and the target molecule includes solute molecules and/or solvent molecules.

Correspondingly, the above method may include the following operations, using the trained solvation free energy prediction model to process the data to be processed to obtain the solvation free energy for the data to be processed, wherein the data to be processed includes The respective attribute information, target molecules include solute molecules and/or solvent molecules.

Fig. 9 schematically shows a flowchart of a method for determining the free energy of solvation according to an embodiment of the present application.

Referring to FIG. 9 , network parameters can be input into the solvation free energy prediction model so that the solvent conformation and solute conformation can be processed by the trained neural network. Specifically, the solvent conformation (eg, can be expressed as an xyz string for solvent molecules), and the solute conformation (eg, can be expressed as an xyz string for solute molecules) can be used as the input of the molecular encoding network.

Firstly, the virtual molecular map or the data to be processed are respectively input into different trained solvation free energy prediction models with a specified number to obtain the specified number of solvation free energies. Wherein, the specified number may be the number of trained solvation free energy prediction models.

Then, take the weighted average of the specified number of solvation free energies as the solvation free energy corresponding to the data to be processed.

For example, the solvent molecules and solute molecules to be predicted are respectively input into ten models in the format of x, y, and z, and ten predicted values of solvation free energy are obtained, and the average of them is taken as the final prediction result.

In a specific embodiment, at first, a total of 48,776 molecular conformations of 11,940 molecules are collected (for example, molecular conformations can be collected through online databases such as pubchem), and only a single molecular conformation is selected among them (selecting a single conformation molecule here is just for calculation convenience) You can also choose molecules with multiple conformations, but you need to statistically average the results of calculations of different conformations) water, tetrahydrofuran, chloroform, dichloromethane, dioxane, toluene, methanol, acetone, n-heptane, cyclohexane Using COSMOtherm to calculate 48,776 conformations using 15 (just an example, more or less than 15) molecules of alkanes, diethyl ether, acetonitrile, dimethylformamide, dimethyl sulfoxide, and methyl tert-butyl ether as solvents There are 731640 pieces of solvation free energy data in 15 solvents. 48776 conformations are stored in the data set in x, y, z format, and 731640 solvation free energy data corresponding to solute conformation and solvent conformation are stored in the data set as floating point numbers. Select 48776 systems using water as the solvent as the test set, and the other 682864 systems as the training set.

Then, set the feature dimension F to 128, and the truncation radius rcut to

The number of convolutional layers num_conv is 3, and the equivariant graph neural network model is initialized.

Next, set the loss function to mean square error. Divide the 682,864 systems in the training set into ten parts, set different random number seeds, and train the model with ten-fold cross-validation until the loss function of the verification set is no longer reduced. Ten equivariant graph neural network models are obtained, and ten network parameters of a model.

Then, input the solvent molecules and solute molecules corresponding to the 48,776 systems of the test set to be predicted into the above ten models in the format of x, y, and z in order to obtain ten predicted values of solvation free energy, and take their average as the final prediction result. For comparison, at the same time, the solvent molecules and solute molecules of the training set were input into the above ten models in the format of x, y, and z to obtain ten predicted values of solvation free energy, and the average of them was taken as the final prediction result of the training set. The correlation between the real value of the training set and the test set and the predicted value of the model is shown in Figure 10 and Figure 11, respectively. where MAE is the mean absolute error, RMSE is the root mean square error, and ^R2 is the coefficient of determination. The smaller the MAE and RMSE, the smaller the model error. R ² is a value between 0 and 1, and the larger R ² is, the better the model correlation is. It can be seen that the correlation between the test set and the training set of the model is basically the same, and the average absolute error of prediction is less than 1kJ/mol, which is much lower than the error of traditional machine learning methods. The results are shown in Table 1.

Table 1

模型Model	CIGINCIGIN	DelfosDelfos	MPNNMPNN	本申请this application
MAE/(kJ/mol)MAE/(kJ/mol)	3.173.17	4.974.97	4.814.81	0.770.77

In another specific embodiment, at first, a total of 48,776 conformations of 11,940 molecules are collected, and water, tetrahydrofuran, chloroform, dichloromethane, dioxane, toluene, methanol, acetone, n- Heptane, cyclohexane, diethyl ether, acetonitrile, dimethylformamide, dimethyl sulfoxide and methyl tert-butyl ether are used as solvents, and COSMOtherm is used to calculate the solvation of 48776 conformations in 15 solvents Free energy data 731640 items. 48776 conformations are stored in the data set in x, y, z format, and 731640 solvation free energy data corresponding to solute conformation and solvent conformation are stored in the data set as floating point numbers. 41475 pieces of solute-solvent-solvation free energy data of 2765 conformation systems of 740 kinds of molecules were selected as the test set, and the other 690165 systems were used as the training set.

Then, set the feature dimension F to 128, and the truncation radius r _cut to

Next, set the loss function to mean square error. Divide the 690,165 systems in the training set into ten parts, set different random number seeds, and train the model with ten-fold cross-validation until the loss function of the verification set is no longer reduced. Ten equivariant graph neural network models are obtained, and ten network parameters of a model.

Then, input the solvent and solute molecules of 41,475 systems in the test set to be predicted into ten models sequentially in the format of x, y, and z to obtain ten predicted values of solvation free energy, and take their average as the final predicted result. For comparison, at the same time, the solvent and solute molecules of the training set are input into ten models in the format of x, y, and z to obtain ten predicted values of solvation free energy, and the average of them is taken as the final prediction result of the training set. The correlation between the real value of the training set and the test set and the predicted value of the model is shown in Figure 12 and Figure 13, respectively. It can be seen that the correlation of the model on the test set and the training set is basically the same, and the average absolute error of prediction is less than 1kJ/mol.

In this embodiment, aiming at the defects and insufficiencies in the prediction of molecular solvation free energy in related technologies, a neural network based on equivariant graphs is proposed to predict the solvation free energy. Among them, in view of the problem that related technologies cannot fully represent the three-dimensional characteristics of molecules, this embodiment uses virtual molecular graphs as descriptors to represent molecules. As related technologies do not explicitly describe solvent-solute interactions, this embodiment uses solute molecular feature vectors and solvent molecules A matrix product of eigenvectors describes the solvent-solute interaction. Specifically, it consists of four steps: molecular encoding, equivariant graph convolution, feature interaction and free energy prediction. The molecular encoding step represents solvent and solute molecules as virtual molecular graphs with feature encodings. The equivariant graph convolution step transforms the virtual molecular graph into a feature representation in matrix form. In the characteristic interaction step, the characteristic representation of solvent and solute is matrix multiplied to obtain the characteristic representation of solvation. The free energy prediction step is based on the characteristic representation of solvation to predict the molecular solvation free energy through the fully connected neural network, which effectively improves the accuracy of the predicted molecular solvation free energy.

Another aspect of the present application also provides a design method.

Fig. 14 schematically shows a flowchart of a design method according to an embodiment of the present application.

Referring to FIG. 14, the design method may include operation S1410 and operation S1420.

In operation S1410, according to the method shown above, the free energy of solvation is determined.

In operation S1420, drug design or material design, etc. are performed based on the free energy of solvation.

Another aspect of the present application also provides a data processing device.

Fig. 15 schematically shows a block diagram of a data processing device according to an embodiment of the present application.

Referring to FIG. 15 , the data processing device may include: a module for obtaining data to be processed 1510 , a set generation module 1520 , a node and edge feature generation module 1530 , and a virtual molecule construction module 1540 .

The to-be-processed data obtaining module 1510 is used to obtain the to-be-processed data, and the to-be-processed data includes property information for multiple atoms in the target molecule.

The set generation module 1520 is used to generate a node set and a node position set for the target molecule in response to the respective attribute information of a plurality of atoms, wherein the multiple nodes in the node set respectively represent atoms of a specific atom type, and the node position set includes nodes The coordinate information of each node in the collection in a specific coordinate system.

The node and edge feature generation module 1530 is used to generate the node scalar feature N _s and the node vector feature N _v for the node set, and generate the edge scalar feature E _s and the edge vector for the node set based on the coordinate information of each node in the node position set Features E _v .

The virtual molecule construction module 1540 is used to construct a virtual molecular graph based on the node scalar feature N _s , node vector feature N _v , edge scalar feature E _s and edge vector feature E _v for the node set, to determine the molecule of the target molecule based on the virtual molecular graph The feature X facilitates determining the free energy of solvation based at least on the molecular feature X of the target molecule.

In some embodiments, the target molecule includes N atoms, and the plurality of nodes in the node set each have F-dimensional features.

The dimension of node scalar feature N _s includes N×F×1 dimension, the dimension of node vector feature N _v includes N×F×3 dimension, the dimension of edge scalar feature E _s includes N×1×1 dimension, and the dimension of edge vector feature E _v The dimensions of include N×3×1 dimensions.

In some embodiments, the above apparatus 1500 may further include: a truncation radius determination module and a target node set determination module.

The cutoff radius determination module is configured to determine the cutoff radius r _cut after generating a node set and a node position set for the target molecule in response to the respective attribute information of the plurality of atoms.

The target node set determining module is configured to determine the target nodes whose distance between nodes is less than or equal to the cutoff radius r _cut from the node set to obtain the target node set N _i .

Correspondingly, the node and edge feature generation module 1530 is specifically configured to generate the edge scalar feature E _s and the edge vector feature E _v for the target node set N _i based on the coordinate information for the target node in the node position set.

In some embodiments, the set of target nodes includes E nodes, each of which has F-dimensional features.

The dimension of node scalar feature N _s includes N×F×1 dimension, the dimension of node vector feature N _v includes N×F×3 dimension, the dimension of edge scalar feature E _s includes E×1×1 dimension, and the dimension of edge vector feature E _v The dimensions of include E×3×1 dimensions.

In some embodiments, the above apparatus 1500 further includes a feature update module and a loop module.

The feature updating module is configured to update the node scalar feature N _s and the node vector feature N _v based on the virtual molecular graph, and obtain the updated node scalar feature New_N _s and the updated node vector feature New_N _v .

The cycle module is configured to repeat the following units until the specified number of cycles num_conv is reached, and the updated node scalar feature New_N _s obtained when the specified cycle number num_conv is reached is used as the molecular feature X.

The feature replacement unit is configured to use the updated node scalar feature New_N _s and the updated node vector feature New_N _v as the current node scalar feature Now_N _s and the current node vector feature Now_N _v respectively.

The feature calculation unit is configured to use the current node scalar feature Now_N _s , the current node vector feature Now_N _v , the edge scalar feature E _s and the edge vector feature E _v to construct and update the virtual molecular graph.

The feature updating unit is configured to update the updated node scalar feature New_N _s and the updated node vector feature New_N _v based on the updated virtual molecular graph.

In some embodiments, the feature update module is specifically configured to perform the following operations.

In certain embodiments, target molecules are solute molecules and/or solvent molecules.

The above-mentioned apparatus 1500 further includes: a solute-solvent molecular characteristic determination module configured to determine a solute molecular characteristic of the solute molecule, and a solvent molecular characteristic of at least one solvent molecule associated with the solute molecule, so that The solvent molecule characteristic of at least one solvent molecule associated with the solute molecule determines the free energy of solvation.

In some embodiments, the above-mentioned apparatus 1500 further includes: a solvation matrix determination module and a solvation characteristic determination module.

The solvation matrix determination module is configured to, after determining the solute molecular signature of the solute molecule, and the solvent molecular signature of at least one solvent molecule associated with the solute molecule, use the matrix product of the solvent molecular signature and the solute molecular signature as the solvent molecule and The solvation matrix between solute molecules.

The solvation signature determination module is configured to determine a solvation signature based on the solvation matrix.

In certain embodiments, the solvation characterization module includes: a solvent characterization unit, a solute characterization unit, and a solvation characterization unit.

Wherein, the solvent characteristic determining unit is configured to calculate the solvent characteristic corresponding to the preset solute weight based on the solvation matrix, and calculate the solute characteristic corresponding to the preset solvent weight based on the solvation matrix.

The solute feature determination unit is configured to convert the solvent feature and the solute feature into a one-dimensional row vector including F elements, respectively.

The solvation signature determination unit is configured to concatenate row vectors to obtain solvation signatures.

In some embodiments, the solute feature determination unit includes an array element weight determination subunit and a weighted summation subunit.

The array element weight determining subunit is configured to determine the first array element weight of the array element corresponding to the atom of the solvent molecule in the solvent feature, and determine the second array element weight of the array element corresponding to the atom of the solute molecule in the solute feature.

The weighted summation subunit is configured to perform weighted summation on the solvent features based on the first array element weights to obtain a one-dimensional first row vector including F elements, and to perform weighted summation on the solute features based on the second array element weights , to obtain a one-dimensional second row vector containing F elements.

Another aspect of the present application also provides a device for training a solvation free energy prediction model.

Fig. 16 schematically shows a block diagram of an apparatus for training a solvation free energy prediction model according to an embodiment of the present application.

The above-mentioned device 1600 includes: a model training module 1610, which is used to input the virtual molecular graph determined based on the above-mentioned method into the solvation free energy prediction model, and adjust the model parameters to make the loss function converge, so as to obtain the trained solvation free energy prediction model , where the virtual molecular graph has corresponding solvation free energy label information, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information.

In some embodiments, the above-mentioned solvation free energy prediction model includes: an equivariant graph convolutional network configured to convert a virtual molecular graph into solute molecular features and/or solvent molecular features.

Among them, the equivariant graph convolutional network includes a convolutional layer with a specified number of cycles num_conv layer, where the output of the current convolutional layer is used as part of the input of the adjacent convolutional layer of the next layer; the input of the first convolutional layer includes: Node scalar feature N _s , node vector feature N _v , edge scalar feature E _s and edge vector feature E _v , the output of the first convolutional layer includes: update node scalar feature New_N _s and update node vector feature New_N _v ; the first volume The input of the convolutional layer other than the product layer includes: update node scalar feature New_N _s and update node vector feature New_N _v , edge scalar feature E _s and edge vector feature E _v ; the convolutional layer other than the first convolutional layer The output includes: updated node scalar feature New_N _s and updated node vector feature New_N _v .

In some embodiments, the convolutional layer is configured to perform the following operations.

The fifth matrix corresponding multiplication operation is performed on the ninth sub-processing result Q9 and the eighth sub-processing result Q8 to obtain the updated node vector feature NewN _v .

In certain embodiments, the solvation free energy prediction model includes: a molecular encoding network.

The molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein the training data It has solvation free energy labeling information, where the atoms of solute molecules or solvent molecules in the training data have F-dimensional features respectively.

In certain embodiments, the solvation free energy prediction model includes: a solvation network.

The solvation network is configured to convert solute molecular features and solvent molecular features into solvation features.

For example, the solvation network includes a self-attention network configured to determine the weight of the first element in the solvent feature corresponding to the atoms of the solvent molecule and to determine the weight of the first element in the solute feature corresponding to the atom of the solute molecule The weight of the second array element of the element, in order to fuse the array elements corresponding to each atom of the solvent molecule in the solvent feature according to the first array element weight and to fuse the array elements corresponding to each atom of the solute molecule in the solute feature according to the second array element weight A fusion is performed in which solvent and solute characteristics are determined based on a solvation matrix, and the solvation matrix is determined based on solute molecular characteristics and solvent molecular characteristics.

In some embodiments, the solvation free energy prediction model includes: a fully connected network. A fully connected network is configured to convert solvation features into solvation free energies.

Among them, the fully connected network includes: the first linear layer, the first activation function layer, the second linear layer, the second activation function layer and the third linear layer connected in sequence, wherein the first linear layer and the second linear layer The output dimension is F dimension, and the output dimension of the third linear layer is 1 dimension.

In some embodiments, the above-mentioned apparatus 1600 further includes: a training set segmentation module and a model building module.

The training set splitting module is configured to split the training data set into a specified number of sub-training data sets.

The model building block is configured to build as many free energy of solvation prediction models as specified.

The model training module 1610 is specifically configured to input the training data in each sub-training data set into the molecular encoding network of different solvation free energy prediction models, so as to perform model training on different solvation free energy prediction models respectively, and obtain multiple A model trained to predict the free energy of solvation with the same number of copies as specified.

Another aspect of the present application also provides an apparatus for determining the free energy of solvation.

Fig. 17 schematically shows a block diagram of an apparatus for determining the free energy of solvation according to an embodiment of the present application.

The above device 1700 includes: a free energy prediction module 1710, configured to use the trained solvation free energy prediction model to process the data to be processed to obtain the solvation free energy for the data to be processed, wherein the data to be processed includes The attribute information of multiple atoms, target molecules include solute molecules and/or solvent molecules.

For example, the solvation free energy prediction model includes at least one of the following networks: a molecular encoding network configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into and/or a virtual molecular map for solvent molecular data, wherein the training data has solvation free energy annotation information; an equivariant map convolutional network configured to convert the virtual molecular map into solute molecular features and/or solvent molecular features; A solvation network configured to convert solute molecular features and solvent molecular features into solvation features; a fully connected network configured to convert solvation features into solvation free energy.

In some embodiments, the above-mentioned apparatus 1700 further includes: a multi-model processing module and a weighting processing module.

The multi-model processing module is configured to input the data to be processed into different trained solvation free energy prediction models of a specified number to obtain a specified number of solvation free energies;

The weighting processing module is configured to take the weighted average of the specified number of solvation free energies as the solvation free energy corresponding to the data to be processed. It should be noted that the respective weights of the specified number of solvation free energies may be the same or different. For example, the weight of the solvation free energy obtained by the model with high prediction accuracy on the test data set can be higher than the weight of the solvation free energy obtained by other models.

Another aspect of the present application also provides a design device.

Fig. 18 schematically shows a block diagram of a design device according to an embodiment of the present application.

Referring to FIG. 18 , the apparatus 1800 may include: a solvation free energy determination module 1810 and a design module 1820 .

Wherein, the solvation free energy determination module 1810 is configured to determine the solvation free energy according to the above method.

Design module 1820 is used for drug design or material design based on free energy of solvation.

Regarding the devices 1500 , 1600 , 1700 , and 1800 in the above embodiments, the specific manner in which each module and unit performs operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Another aspect of the present application also provides an electronic device.

Fig. 19 schematically shows a block diagram of an electronic device implementing an embodiment of the present application.

Referring to FIG. 19 , an electronic device 1900 includes a memory 1910 and a processor 1920 .

Processor 1920 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), on-site Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

The memory 1910 may include various types of storage units such as system memory, read only memory (ROM), and persistent storage. Wherein, the ROM can store static data or instructions required by the processor 1920 or other modules of the computer. The persistent storage device may be a readable and writable storage device. Persistent storage may be a non-volatile storage device that does not lose stored instructions and data even if the computer is powered off. In some embodiments, the permanent storage device adopts a mass storage device (such as a magnetic or optical disk, flash memory) as the permanent storage device. In some other implementations, the permanent storage device may be a removable storage device (such as a floppy disk, an optical drive). The system memory can be a readable and writable storage device or a volatile readable and writable storage device, such as dynamic random access memory. System memory can store some or all of the instructions and data that the processor needs at runtime. In addition, memory 1910 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (eg, DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and magnetic and/or optical disks may also be used. In some embodiments, memory 1910 may include a readable and/or writable removable storage device, such as a compact disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Super Density Disc, Flash memory card (such as SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, etc. Computer-readable storage media do not contain carrier waves and transient electronic signals transmitted by wireless or wire.

Executable codes are stored in the memory 1910 , and when the executable codes are processed by the processor 1920 , the processor 1920 may execute part or all of the methods mentioned above.

In addition, the method according to the present application can also be implemented as a computer program or computer program product, the computer program or computer program product including computer program code instructions for executing some or all of the steps in the above method of the present application.

Alternatively, the present application may also be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium), on which executable code (or computer program or computer instruction code) is stored, When the executable code (or computer program or computer instruction code) is executed by the processor of the electronic device (or server, etc.), the processor is made to perform part or all of the steps of the above-mentioned method according to the present application.

Having described various embodiments of the present application above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

A data processing method, characterized in that, comprising:

Obtaining data to be processed, the data to be processed includes attribute information for each of a plurality of atoms in the target molecule;

In response to the respective attribute information of the plurality of atoms, generate a node set and a node position set for the target molecule, wherein the plurality of nodes in the node set represent atoms of a specific atom type, and the node position set including coordinate information of each node in the node set in a specific coordinate system;

Generate node scalar feature N s and node vector feature N v for the node set, and generate edge scalar feature E s and edge vector feature E v for the node set based on the coordinate information of each node in the node position set ;

constructing a virtual molecular graph based on the node scalar feature N s , node vector feature N v , edge scalar feature E s , and edge vector feature E v for the node set, to determine molecular features of the target molecule based on the virtual molecular graph X, facilitates determining the free energy of solvation based at least on the molecular characteristic X of said target molecule.
The method according to claim 1, wherein the target molecule comprises N atoms, and a plurality of nodes in the node set each have F-dimensional features;

The dimension of the node scalar feature N s includes N×F×1 dimension, the dimension of the node vector feature N v includes N×F×3 dimension, and the dimension of the edge scalar feature E s includes N×1×1 dimension , the dimensions of the edge vector feature E v include N×3×1 dimensions.
The method according to claim 1, further comprising: after generating a node set and a node position set for the target molecule in response to the respective attribute information of the plurality of atoms,

Determine the cut-off radius r cut ;

Determining target nodes whose distance between nodes is less than or equal to the truncation radius r cut from the node set to obtain a target node set N i ;

The generation of edge scalar features and edge vector features for the node set based on the coordinate information of each node in the node position set includes:

Generate an edge scalar feature E s and an edge vector feature E v for the target node set N i based on the coordinate information for the target node in the node position set.
The method according to claim 3, wherein the target node set includes E nodes, and each of the E nodes has F-dimensional features;

The dimension of the node scalar feature N s includes N × F × 1 dimension, the dimension of the node vector feature N v includes N × F × 3 dimension, and the dimension of the edge scalar feature E s includes E × 1 × 1 dimension , the dimension of the edge vector feature E v includes E×3×1 dimension.
The method according to claim 2 or 4, wherein the determining the molecular feature X of the target molecule based on the virtual molecular map comprises:

Updating the node scalar feature N s and the node vector feature N v based on the virtual molecular graph to obtain an updated node scalar feature New_N s and an updated node vector feature New_N v ;

Repeat the following operations until the specified number of cycles num_conv is reached, so that the updated node scalar feature New_N s obtained when the specified number of cycles num_conv is reached is used as the molecular feature X:

Using the updated node scalar feature New_N s and the updated node vector feature New_N v as the current node scalar feature Now_N s and the current node vector feature Now_N v respectively;

Using the current node scalar feature Now_N s , the current node vector feature Now_N v , the edge scalar feature E s and the edge vector feature E v to construct an updated virtual molecular graph;

The updated node scalar feature New_N s and the updated node vector feature New_N v are updated based on the updated virtual molecular graph.
The method according to claim 5, wherein the node scalar feature N s and the node vector feature N v are updated based on the virtual molecular graph to obtain the updated node scalar feature New_N s and the updated node vector feature New_N v , including:

Performing the first linear operation, the second activation function and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and performing the third linear operation on the edge scalar feature E s to obtain The second sub-processing result Q2;

Performing a first matrix corresponding multiplication operation on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain a third sub-processing result Q3;

Perform a second matrix corresponding multiplication operation based on the third sub-processing result Q3 and the node vector feature N v to obtain a fourth sub-processing result Q4, and, based on the third sub-processing result Q3 and the edge vector feature E v performs the multiplication operation corresponding to the third matrix to obtain the fifth sub-processing result Q5;

performing a first matrix addition operation on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain a sixth sub-processing result Q6;

The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8;

The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the first The seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform an inner product operation Inner to obtain the tenth sub-processing result Q10;

Performing a fourth matrix corresponding multiplication operation on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain an eleventh sub-processing result Q11;

Performing a fifth matrix multiplication operation on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v ;

Performing a second matrix addition operation on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the update node scalar feature NewN s .
The method according to claim 5, wherein the target molecule is a solute molecule and/or a solvent molecule;

The method also includes:

determining a solute molecular characteristic of the solute molecule, and a solvent molecular characteristic of at least one solvent molecule associated with the solute molecule, such that based on the solute molecular characteristic of the solute molecule and the at least one solvent molecule associated with the solute molecule Solvent molecular characteristics, determine the solvation free energy.
The method according to claim 7, further comprising: after said determining the solute molecular characteristic of the solute molecule and the solvent molecular characteristic of at least one solvent molecule associated with the solute molecule,

using the matrix product of the solvent molecular characteristics and the solute molecular characteristics as the solvation matrix between the solvent molecules and the solute molecules;

A solvation profile is determined based on the solvation matrix.
The method according to claim 8, wherein said determining solvation characteristics based on said solvation matrix comprises:

calculating solvent characteristics corresponding to preset solute weights based on the solvation matrix, and calculating solute characteristics corresponding to preset solvent weights based on the solvation matrix;

converting the solvent feature and the solute feature into a one-dimensional row vector including F elements;

The row vectors are concatenated to obtain the solvation feature.
The method according to claim 9, wherein said converting said solvent feature and said solute feature into a one-dimensional row vector comprising F elements comprises:

determining a first element weight of an element corresponding to an atom of the solvent molecule in the solvent feature, and determining a second element weight of an element corresponding to an atom of the solute molecule in the solute feature;

Perform weighted summation on the solvent features based on the first array element weights to obtain a one-dimensional first row vector including F elements, and perform weighted summation on the solute features based on the second array element weights And, get a one-dimensional second row vector containing F elements.
A method for training a solvation free energy prediction model, characterized in that:

Input the virtual molecular graph determined based on the method according to any one of claims 1 to 10 into the solvation free energy prediction model, and adjust the model parameters so that the loss function converges to obtain a trained solvation free energy prediction model, wherein , the virtual molecular graph has corresponding solvation free energy label information, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information.
The method according to claim 11, wherein the solvation free energy prediction model comprises:

an equivariant graph convolutional network configured to convert said virtual molecular graph into solute molecular features and/or solvent molecular features;

Wherein, the equivariant graph convolutional network includes a convolutional layer specifying the number of cycles num_conv layer, wherein the output of the current convolutional layer is used as part of the input of the next adjacent convolutional layer; the input of the first convolutional layer Including: node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v , the output of the first convolutional layer includes: updated node scalar feature New_N s and updated node vector feature New_N v ; The input of the convolution layer outside the first convolution layer includes: update node scalar feature New_N s and update node vector feature New_N v , edge scalar feature E s and edge vector feature E v ; the first convolution The output of the convolutional layer outside the layer includes: updated node scalar feature New_N s and updated node vector feature New_N v .
The method according to claim 12, wherein the convolutional layer is configured as:

Performing the first linear operation, the second activation function and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and performing the third linear operation on the edge scalar feature E s to obtain The second sub-processing result Q2;

Performing a first matrix corresponding multiplication operation on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain a third sub-processing result Q3;

Perform a second matrix corresponding multiplication operation based on the third sub-processing result Q3 and the node vector feature N v to obtain a fourth sub-processing result Q4, and, based on the third sub-processing result Q3 and the edge vector feature E v performs the multiplication operation corresponding to the third matrix to obtain the fifth sub-processing result Q5;

performing a first matrix addition operation on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain a sixth sub-processing result Q6;

The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8;

The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the first The seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform an inner product operation Inner to obtain the tenth sub-processing result Q10;

Performing a fourth matrix corresponding multiplication operation on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain an eleventh sub-processing result Q11;

Performing a second matrix addition operation on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the update node scalar feature NewN s ;

The fifth matrix corresponding multiplication operation is performed on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v .
The method according to claim 11, characterized in that:

The solvation free energy prediction model includes:

The molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein, The training data has solvation free energy labeling information, and atoms of solute molecules or solvent molecules in the training data have F-dimensional features respectively;

and / or

The solvation free energy prediction model includes:

a solvation network configured to convert solute molecular signatures and solvent molecular signatures into solvation signatures;

Wherein, the solvation network includes a self-attention network, and the self-attention network is configured to determine the first element weight of the element corresponding to the atom of the solvent molecule in the solvent feature, and determine the weight of the first element corresponding to the atom in the solute feature The atoms of the solute molecules correspond to the second element weights of the elements, so that the elements corresponding to the atoms of the solvent molecules in the solvent feature are fused according to the first element weights and the elements corresponding to the atoms of the solvent molecules are fused according to the second element weights. The array element weight fuses the array elements corresponding to each atom of the solute molecule in the solute feature, wherein the solvent feature and the solute feature are determined based on a solvation matrix, and the solvation matrix is determined based on said solute molecular characteristics and said solvent molecular characteristics;

and / or

The solvation free energy prediction model includes:

A fully connected network configured to convert the solvation feature into a free energy of solvation; the fully connected network includes: a first linear layer connected in sequence, a first activation function layer, a second linear layer, a second The activation function layer and the third linear layer, wherein the output dimension of the first linear layer and the second linear layer is F dimension, and the output dimension of the third linear layer is 1 dimension.
The method according to claim 14, further comprising:

dividing the training data set into sub-training data sets of a specified number;

Constructing the same number of solvation free energy prediction models as the specified number of copies;

Said inputting said training data into said molecular encoding network comprises:

Input the training data in each sub-training data set into the molecular encoding network of different solvation free energy prediction models, so as to perform model training on different solvation free energy prediction models respectively, and obtain multiple trained and specified Copies of the same number of solvation free energy prediction models.
A method for determining free energy of solvation, comprising:

Utilize the free energy of solvation prediction model trained according to the method according to any one of claims 11 to 15 to process the virtual molecular graph to obtain the free energy of solvation for the virtual molecular graph, wherein the virtual molecular graph is based on the A graph generated by processing data, the data to be processed includes respective attribute information for a plurality of atoms in a target molecule, the target molecule including solute molecules and/or solvent molecules.
The method according to claim 16, further comprising:

Inputting the virtual molecular map or the data to be processed into different trained solvation free energy prediction models of a specified number to obtain a specified number of solvation free energies;

The weighted average of the specified number of solvation free energies is used as the solvation free energy corresponding to the data to be processed.
A design method, characterized in that the method comprises:

Determining the free energy of solvation according to the method of any one of claims 1 to 17;

Drug design or material design is performed based on the solvation free energy.
A data processing device, characterized in that it comprises:

The data to be processed obtaining module is used to obtain the data to be processed, and the data to be processed includes attribute information for a plurality of atoms in the target molecule;

The set generation module is used to generate a set of nodes and a set of node positions for the target molecule in response to the respective attribute information of the plurality of atoms, wherein the multiple nodes in the set of nodes respectively represent atoms of a specific atom type, The node position set includes coordinate information of each node in the node set in a specific coordinate system;

The node and edge feature generation module is used to generate the node scalar feature N s and the node vector feature N v for the node set, and generate the edge scalar feature for the node set based on the coordinate information of each node in the node position set E s and edge vector features E v ;

The virtual molecular building block is used to construct a virtual molecular graph based on the node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v for the node set, to determine all molecular graphs based on the virtual molecular graph. The molecular characteristic X of the target molecule facilitates determining the free energy of solvation based at least on the molecular characteristic X of the target molecule.
The device according to claim 19, wherein the target molecule comprises N atoms, and each of the multiple nodes in the node set has F-dimensional features;

The dimension of the node scalar feature N s includes N×F×1 dimension, the dimension of the node vector feature N v includes N×F×3 dimension, and the dimension of the edge scalar feature E s includes N×1×1 dimension , the dimensions of the edge vector feature E v include N×3×1 dimensions.
The device according to claim 19, further comprising:

The truncation radius determination module is used to determine the truncation radius r cut after generating the node set and node position set for the target molecule in response to the respective attribute information of the plurality of atoms;

The target node set determination module is used to determine the target nodes whose distance between nodes is less than or equal to the truncation radius r cut from the node set to obtain the target node set N i ;

The target node set determining module is specifically configured to generate an edge scalar feature E s and an edge vector feature E v for the target node set N i based on the coordinate information for the target node in the node position set.
The device according to claim 21, wherein the target node set includes E nodes, and each of the E nodes has F-dimensional features;

The dimension of the node scalar feature N s includes N×F×1 dimension, the dimension of the node vector feature N v includes N×F×3 dimension, and the dimension of the edge scalar feature E s includes E×1×1 dimension , the dimension of the edge vector feature E v includes E×3×1 dimension.
The device according to claim 20 or 22, wherein the device comprises:

The feature update module is used to update the node scalar feature N s and the node vector feature N v based on the virtual molecular graph, and obtain the updated node scalar feature New_N s and the updated node vector feature New_N v ;

The cycle module is used to repeat the following units until the specified number of cycles num_conv is reached, so that the updated node scalar feature New_N s obtained when the specified cycle number num_conv is reached is used as the molecular feature X:

A feature replacement unit configured to use the updated node scalar feature New_N s and the updated node vector feature New_N v as the current node scalar feature Now_N s and the current node vector feature Now_N v respectively;

A feature calculation unit configured to use the current node scalar feature Now_N s , the current node vector feature Now_N v , the edge scalar feature E s and the edge vector feature E v to construct and update a virtual molecular graph;

A feature updating unit configured to update the updated node scalar feature New_N s and the updated node vector feature New_N v based on the updated virtual molecular graph.
The device according to claim 23, wherein the feature update module is specifically configured to perform the following operations:

Performing the first linear operation, the second activation function and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and performing the third linear operation on the edge scalar feature E s to obtain The second sub-processing result Q2;

Performing a first matrix corresponding multiplication operation on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain a third sub-processing result Q3;

Perform a second matrix corresponding multiplication operation based on the third sub-processing result Q3 and the node vector feature N v to obtain a fourth sub-processing result Q4, and, based on the third sub-processing result Q3 and the edge vector feature E v performs the multiplication operation corresponding to the third matrix to obtain the fifth sub-processing result Q5;

performing a first matrix addition operation on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain a sixth sub-processing result Q6;

The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8;

The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the first The seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform an inner product operation Inner to obtain the tenth sub-processing result Q10;

Performing a fourth matrix corresponding multiplication operation on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain an eleventh sub-processing result Q11;

Performing a fifth matrix multiplication operation on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v ;

Performing a second matrix addition operation on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the update node scalar feature NewN s .
The device according to claim 23, wherein the target molecule is a solute molecule and/or a solvent molecule;

The device also includes:

a solute solvent molecular characteristic determination module for determining a solute molecular characteristic of a solute molecule, and a solvent molecular characteristic of at least one solvent molecule associated with said solute molecule, so that A solvent molecule characteristic of at least one solvent molecule associated with the molecule determines the free energy of solvation.
The device according to claim 25, further comprising:

a solvation matrix determination module for combining said solvent molecular signature and said solute molecular signature after said determining a solute molecular signature of a solute molecule, and a solvent molecular signature of at least one solvent molecule The matrix product of is used as the solvation matrix between said solvent molecule and said solute molecule;

The solvation signature determination module is for determining a solvation signature based on the solvation matrix.
The device according to claim 26, wherein the solvation characteristic determination module comprises:

The solvent characteristic determining unit is used to calculate the solvent characteristic corresponding to the preset solute weight based on the solvation matrix, and calculate the solute characteristic corresponding to the preset solvent weight based on the solvation matrix;

The solute feature determination unit is used to convert the solvent feature and the solute feature into a one-dimensional row vector including F elements;

The solvation feature determination unit is used to concatenate the row vectors to obtain the solvation features.
The device according to claim 27, wherein the solute characteristic determination unit comprises:

The array element weight determining subunit is used to determine the first array element weight corresponding to the atoms of the solvent molecule in the solvent feature, and determine the first array element weight corresponding to the atom of the solute molecule in the solute feature Second array element weight;

The weighted summation subunit is configured to perform weighted summation on the solvent features based on the first array element weights to obtain a one-dimensional first row vector including F elements, and to calculate all the solvent features based on the second array element weights The above solute characteristics are weighted and summed to obtain a one-dimensional second row vector including F elements.
A device for training a solvation free energy prediction model, characterized in that:

The model training module is used to input the virtual molecular graph determined based on the device according to any one of claims 19 to 28 into the solvation free energy prediction model, and adjust the model parameters so that the loss function converges to obtain the trained solvation A free energy prediction model, wherein the virtual molecular graph has corresponding solvation free energy label information, and the input of the loss function includes the predicted solvation free energy and the solvation free energy in the solvation free energy label information able.
The device according to claim 29, wherein the solvation free energy prediction model comprises:

an equivariant graph convolutional network configured to convert said virtual molecular graph into solute molecular features and/or solvent molecular features;

Wherein, the equivariant graph convolutional network includes a convolutional layer specifying the number of cycles num_conv layer, wherein the output of the current convolutional layer is used as part of the input of the next adjacent convolutional layer; the input of the first convolutional layer Including: node scalar feature N s , node vector feature N v , edge scalar feature E s and edge vector feature E v , the output of the first convolutional layer includes: updated node scalar feature New_N s and updated node vector feature New_N v ; The input of the convolution layer outside the first convolution layer includes: update node scalar feature New_N s and update node vector feature New_N v , edge scalar feature E s and edge vector feature E v ; the first convolution The output of the convolutional layer outside the layer includes: updated node scalar feature New_N s and updated node vector feature New_N v .
The device according to claim 30, wherein the convolutional layer is configured as:

Performing the first linear operation, the second activation function and the second linear operation on the node scalar feature N s in order to obtain the first sub-processing result Q1, and performing the third linear operation on the edge scalar feature E s to obtain The second sub-processing result Q2;

Performing a first matrix corresponding multiplication operation on the first sub-processing result Q1 and the second sub-processing result Q2 to obtain a third sub-processing result Q3;

Perform a second matrix corresponding multiplication operation based on the third sub-processing result Q3 and the node vector feature N v to obtain a fourth sub-processing result Q4, and, based on the third sub-processing result Q3 and the edge vector feature E v performs the multiplication operation corresponding to the third matrix to obtain the fifth sub-processing result Q5;

performing a first matrix addition operation on the fourth sub-processing result Q4 and the fifth sub-processing result Q5 to obtain a sixth sub-processing result Q6;

The sixth sub-processing result Q6 is respectively subjected to the fourth linear operation and the fifth linear operation to obtain the seventh sub-processing result Q7 and the eighth sub-processing result Q8;

The seventh sub-processing result Q7 is sequentially subjected to the sixth linear operation, the second activation function, and the seventh linear operation to obtain the ninth sub-processing result Q9, and based on the third sub-processing result Q3, the first The seventh sub-processing result Q7 and the eighth sub-processing result Q8 perform an inner product operation Inner to obtain the tenth sub-processing result Q10;

Performing a fourth matrix corresponding multiplication operation on the ninth sub-processing result Q9 and the tenth sub-processing result Q10 to obtain an eleventh sub-processing result Q11;

Performing a second matrix addition operation on the ninth sub-processing result Q9 and the eleventh sub-processing result Q11 to obtain the update node scalar feature NewN s ;

The fifth matrix corresponding multiplication operation is performed on the eighth sub-processing result Q8 and the ninth sub-processing result Q9 to obtain the updated node vector feature NewN v .
The device according to claim 29, characterized in that:

The solvation free energy prediction model includes:

The molecular encoding network is configured to convert each training data in the training data set including solute molecular data and/or solvent molecular data into virtual molecular graphs for solute molecular data and/or solvent molecular data, wherein, The training data has solvation free energy labeling information, and atoms of solute molecules or solvent molecules in the training data have F-dimensional features respectively;

and / or

The solvation free energy prediction model includes:

a solvation network configured to convert solute molecular signatures and solvent molecular signatures into solvation signatures;

Wherein, the solvation network includes a self-attention network, and the self-attention network is configured to determine the first element weight of the element corresponding to the atom of the solvent molecule in the solvent feature, and determine the weight of the first element corresponding to the atom in the solute feature The atoms of the solute molecules correspond to the second element weights of the elements, so that the elements corresponding to the atoms of the solvent molecules in the solvent feature are fused according to the first element weights and the elements corresponding to the atoms of the solvent molecules are fused according to the second element weights. The array element weight fuses the array elements corresponding to each atom of the solute molecule in the solute feature, wherein the solvent feature and the solute feature are determined based on a solvation matrix, and the solvation matrix is determined based on said solute molecular characteristics and said solvent molecular characteristics;

and / or

The solvation free energy prediction model includes:

A fully connected network configured to convert the solvation feature into a free energy of solvation; the fully connected network includes: a first linear layer connected in sequence, a first activation function layer, a second linear layer, a second The activation function layer and the third linear layer, wherein the output dimension of the first linear layer and the second linear layer is F dimension, and the output dimension of the third linear layer is 1 dimension.
The apparatus of claim 32, further comprising:

A training set segmentation module, configured to divide the training data set into sub-training data sets of a specified number of copies;

A model building block for constructing the same number of solvation free energy prediction models as the specified number of copies;

The model training module is specifically used to input the training data in each sub-training data set into the molecular encoding network of different solvation free energy prediction models, so as to perform model training on different solvation free energy prediction models respectively, and obtain multiple A number of trained solvation free energy prediction models equal to the specified number of copies.
A device for determining free energy of solvation, characterized in that it comprises:

The free energy prediction module is used to process the virtual molecular graph using the solvation free energy prediction model trained by the device according to any one of claims 29 to 33 to obtain the solvation free energy for the virtual molecular graph, wherein the The virtual molecular graph is a graph generated based on the data to be processed, and the data to be processed includes attribute information for a plurality of atoms in a target molecule, and the target molecule includes solute molecules and/or solvent molecules.
The apparatus of claim 34, further comprising:

A multi-model processing module, configured to input the virtual molecular map or the data to be processed into different trained solvation free energy prediction models of a specified number to obtain a specified number of solvation free energies;

A weighting processing module, configured to use the weighted average of the specified number of solvation free energies as the solvation free energy corresponding to the data to be processed.
A design device, characterized in that said device comprises:

A solvation free energy determination module, for determining the solvation free energy according to the device according to any one of claims 19 to 35;

The design module is used for drug design or material design based on the solvation free energy.
An electronic device, characterized in that it comprises:

processor; and

A memory on which executable code is stored, which, when executed by the processor, causes the processor to perform the method according to any one of claims 1-17.
A computer-readable storage medium, which is characterized in that executable code is stored thereon, and when the executable code is executed by a processor of an electronic device, the processor is made to execute any one of the following claims 1-17. method described in the item.