CN114664391A

CN114664391A - Molecular feature determination method, related device and equipment

Info

Publication number: CN114664391A
Application number: CN202011538932.3A
Authority: CN
Inventors: 乔楠; 林歆远; 徐迟
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2022-06-24

Abstract

The embodiment of the application discloses a molecular feature determination method, a related device and equipment, which are used for improving the accuracy of the determined molecular feature. In the method of the embodiment of the application, a target molecular graph is obtained first, then N node features and M edge features are obtained based on the target molecular graph, wherein N and M are integers greater than 1, the N node features and the M edge features are further processed by using a pre-trained molecular characterization model to obtain the target molecular features, the pre-trained molecular characterization model comprises an encoder and a decoder, when the molecular characterization model is pre-trained, the encoder is used for encoding the features of the trained molecular graph, the decoder is used for reconstructing the encoding result of the encoder into a simplified molecular linear input specification SMILES expression, and parameters of the encoder and the decoder are adjusted in an iterative manner.

Description

Molecular feature determination method, related device and equipment

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to a molecular feature determination method, a related device and equipment.

Background

With the continuous breakthrough of the medical and medical field, the population life is remarkably prolonged; meanwhile, with the continuous deterioration of natural environment, new complex diseases are also emerged. Meanwhile, the success rate of development is extremely low despite billions of money and decades of time cost for drug development, which urgently requires the intervention of new technical means to improve the development efficiency and success rate. With the popularity of deep learning technology, the development trend of using artificial intelligence technology to carry out innovation and application in various industries becomes. The method has the overwhelming advantages compared with the traditional method in the fields of image, natural language processing and the like. It would be of great significance to use deep learning techniques to accelerate the speed of drug development.

At present, a molecule can be encoded into a vector (hidden variable) with a specific length through reconstruction training of the molecule based on an Auto Encoder (AE) to achieve the purpose of feature extraction/dimension reduction, and specifically achieve the purpose of dimension reduction of the molecule, a model based on a simplified molecular-input line-entry system (SMILES) expression is generally used, and the model is trained by using the SMILES expression as an input and an output of the model.

However, the SMILES expression lacks representation of chemical characteristics of atoms and chemical bonds, and the structure of the molecule is also implicit in the expression of the character sequence, so that the implicit spatial characterization of the molecule obtained by the model cannot well describe the chemical characteristics of the molecule.

Disclosure of Invention

The embodiment of the application provides a molecular feature determination method, a related device and equipment, and the molecular characterization model pre-trained by the application can be used for more accurately characterizing molecules of a molecular graph, so that the accuracy of the determined molecular features is improved.

In a first aspect of embodiments of the present application, a method for determining molecular characteristics is provided. The method may be executed by a terminal device or a server, or may also be executed by a chip configured in the terminal device or the server, which is not limited in this application. In the method, a molecular feature determination device firstly needs to acquire a target molecular graph, then acquires N node features and M edge features based on the target molecular graph, wherein N and M are integers larger than 1, and then processes the N node features and the M edge features by using a pre-trained molecular characterization model to obtain the target molecular features, wherein the pre-trained molecular characterization model comprises an encoder and a decoder, the encoder is used for encoding the features of the training molecular graph when the molecular characterization model is pre-trained, the decoder is used for reconstructing the encoding result of the encoder into a SMILES expression, and parameters of the encoder and the decoder are adjusted in an iterative manner.

In this embodiment, the molecular graph is used to provide more features for the pre-trained molecular characterization model, so as to improve the accuracy of the determined molecular features, and further, the molecular features are reconstructed into a SMILES expression by the pre-trained molecular characterization model, so that the error of the molecular graph with a complex structure can be reduced due to the SMILES expression, thereby improving the performance of the pre-trained molecular characterization model.

In one implementation of the embodiments of the present application, the target molecule map represents structural information of a chemical molecule, and the target molecule features are used for drug screening and/or molecular property prediction.

In this embodiment, since the target molecular diagram can be used to represent the structural information of the chemical molecule, the obtained structural information of the chemical molecule can determine the statistical correlation between the chemical structure and the classification (active, inactive, toxic, non-toxic, etc.), and the determined molecular characteristics of the chemical molecule can be used for drug screening and/or molecular property prediction by using regression and classification techniques, thereby improving the feasibility of the scheme.

In an implementation manner of the embodiment of the present application, node features and edge features may be obtained by one-hot (one-hot) encoding, so that the node features and the edge features are sparse, and secondly, because input dimensions of the node features and the edge features are different, the node features and the edge features may be projected to the same dimension, so as to perform operations such as attention convolution in the following step. Based on this, the molecular feature determination device performs projection processing on the N node features through a first full-link layer of the pre-trained molecular characterization model to obtain N processed node features, performs projection processing on the M edge features through a second full-link layer of the pre-trained molecular characterization model to obtain M processed edge features, and then encodes the N processed node features and the M processed edge features by using an encoder in the pre-trained molecular characterization model to obtain the target molecular feature. It should be understood that the processed node features are in the same dimension as the processed edge features.

In this embodiment, the node features and the edge features are projected to the same degree, so that the processed node features and the processed edge features in the same dimension are obtained, and subsequent operations such as attention convolution and the like are further performed conveniently, thereby improving the feasibility and reliability of the scheme.

In an implementation manner of the embodiment of the present application, for the hierarchy of the graph structure, an attention graph convolution may be used to perform local computation to obtain features of neighboring nodes, and after the attention convolution operation is performed on each node, information of more neighboring nodes is fused. Therefore, the molecular feature determination device performs attention convolution operation on the N processed node features and the M processed edge features through the convolution layer of the encoder to obtain N attention convolved node features, then stacks the N processed node features and the N attention convolved node features through the node embedding layer of the encoder to obtain N stacked node features, and then aggregates the N stacked node features through the aggregation layer of the encoder to obtain the target molecular feature.

In the embodiment, the stacked node features are obtained by stacking the node features and the node features of the adjacent nodes, so that the pre-trained molecular characterization model learns more information in different subspaces represented by the stacked node features, the diversification and the accuracy of the information are improved, and the aggregation processing is performed, so that the accuracy of the obtained target molecular features is improved.

In an implementation manner of the embodiment of the present application, after performing an attention convolution operation on N processed node features and M processed edge features by using a convolution layer of an encoder to obtain N attention convolved node features, the molecular feature determination device may further perform an attention convolution operation on the N attention convolved node features and the M processed edge features by using the convolution layer to obtain N attention convolved node features, and then stack the N processed node features, the N attention convolved node features, and the N attention convolved node features by using a node embedding layer to obtain N stacked node features.

It should be understood that, in an actual application, after obtaining N node features after attention convolution again, if there are neighboring node features, attention convolution operations may be further performed on the N node features after attention convolution again and the M processed edge features, so as to obtain N node features after attention convolution again, and so on until all neighboring node features are obtained.

In the embodiment, attention is used to convolute the layer, the node features in different radius ranges are extracted, feature extraction is respectively carried out on the node feature layers with different radii, and finally the stacked node features capable of reflecting the node features and the node features of the adjacent nodes are spliced, so that more levels of molecular information are provided for the pre-trained molecular characterization model, and the accuracy of the obtained node features is further improved.

In an implementation manner of the embodiment of the application, after the N node features and the M edge features are processed by using the pre-trained molecular characterization model to obtain the target molecular features, a first message indicating that the target molecular features are subjected to the re-parameterization processing may be further obtained, the target molecular features are subjected to the re-parameterization processing based on the first message, and at least one of the re-parameterized target molecular features is used as an input of a decoder included in the pre-trained molecular characterization model, so that at least one SMILES expression is obtained, and then the SMILES expression is used for generating similar compounds, that is, different chemical molecules are chemically synthesized based on the SMILES expression.

In the embodiment, at least one SMILES expression is obtained based on the obtained target molecule characteristics, different chemical molecules are chemically synthesized based on the SMILES expression so as to be used for generating similar compounds, and the scheme is applied to different service scenes, so that the flexibility of the scheme is improved.

In an implementation manner of the embodiment of the present application, the parameters corresponding to the attention convolution operations are the same, that is, the convolution layers all use the same parameters. In addition, a Multi-Head Attention Mechanism (Multi-Head Attention Mechanism) is also applied to the model.

In this embodiment, recycling (parameter sharing) the same convolutional layer can improve the consistency of feature aggregation operation of each layer, and reduce the number of parameters of the model without significantly affecting the performance of the model. Secondly, a multi-head attention mechanism can also be applied to the model to allow the model to learn more information in different representation subspaces, so that the retained feature information can be further improved, and the accuracy of the obtained target node features can be improved.

A second aspect of embodiments of the present application provides a molecular feature determination apparatus, including:

the acquisition module is used for acquiring a target molecular graph;

the acquisition module is further used for acquiring N node characteristics and M edge characteristics based on the target molecular graph, wherein N and M are integers larger than 7;

and the processing module is used for processing the N node characteristics and the M edge characteristics by utilizing the pre-trained molecular characterization model to obtain target molecular characteristics, wherein the pre-trained molecular characterization model comprises an encoder and a decoder, the encoder is used for encoding the characteristics of the training molecular graph when the molecular characterization model is pre-trained, the decoder is used for reconstructing the encoding result of the encoder into an SMILES expression, and the parameters of the encoder and the decoder are adjusted in an iterative manner.

In an implementation manner of the embodiment of the present application, the processing module is specifically configured to:

performing projection processing on the N node characteristics through a first full-connection layer of the pre-trained molecular characterization model to obtain N processed node characteristics;

performing projection processing on the M edge features through a second full-connection layer of the pre-trained molecular characterization model to obtain M processed edge features;

and coding the N processed node characteristics and the M processed edge characteristics by utilizing a coder in the pre-trained molecular characterization model to obtain the target molecular characteristics.

performing attention convolution operation on the N processed node features and the M processed edge features through a convolution layer of the encoder to obtain N attention-convolved node features;

stacking the N processed node characteristics and the N attention convolved node characteristics through a node embedding layer of the encoder to obtain N stacked node characteristics;

and polymerizing the N stacked node characteristics through a polymerization layer of the encoder to obtain the target molecular characteristics.

In an implementation manner of the embodiment of the present application, the processing module is further configured to, after performing, by using a convolution layer of the encoder, attention convolution operation on the N processed node features and the M processed edge features to obtain N attention convolved node features, perform, by using the convolution layer, attention convolution operation on the N attention convolved node features and the M processed edge features to obtain N attention convolved node features;

and the processing module is specifically used for stacking the N processed node features, the N attention convolved node features and the N attention convolved node features through the node embedding layer to obtain N stacked node features.

In an implementation manner of the embodiment of the application, the obtaining module is further configured to obtain a first message after the N node features and the M edge features are processed by using a pre-trained molecular characterization model to obtain target molecular features;

the processing module is also used for carrying out re-parameter processing on the target molecule characteristics based on the first message to obtain at least one re-parameter processed target molecule characteristic;

and the acquisition module is further used for acquiring at least one SMILES expression through a decoder included in the pre-trained molecular characterization model based on the target molecular characteristics processed by the at least one re-parameter, wherein the at least one SMILES expression is used for generating similar compounds.

In a third aspect of the embodiments of the present application, there is provided a terminal device, which may be the molecular characteristics determination apparatus designed in the above method, or a chip disposed in the molecular characteristics determination apparatus. The terminal device includes: a processor, coupled to the memory, and configured to execute the instructions in the memory to implement the method performed by the sub-feature determining apparatus in the first aspect and any one of the possible implementations thereof. Optionally, the terminal device further comprises a memory. Optionally, the terminal device further comprises a communication interface, the processor being coupled to the communication interface.

When the terminal device is a molecular characterization device, the communication interface may be a transceiver, or an input/output interface.

When the terminal device is a chip provided in the molecular characterization device, the communication interface may be an input/output interface.

Alternatively, the transceiver may be a transmit-receive circuit. Alternatively, the input/output interface may be an input/output circuit.

In a fourth aspect of the embodiments of the present application, there is provided a server, which may be the molecular characterization device designed in the above method, or a chip provided in the molecular characterization device. The server includes: a processor, coupled to the memory, and configured to execute the instructions in the memory to implement the method performed by the sub-feature determining apparatus in the first aspect and any one of the possible implementations thereof. Optionally, the server further comprises a memory. Optionally, the server further comprises a communication interface, the processor being coupled to the communication interface.

The communication interface may be a transceiver, or an input/output interface, when the server is configured as a molecular characterization device.

When the server is a chip provided in a molecular characterization device, the communication interface may be an input/output interface.

In a fifth aspect of embodiments of the present application, a program is provided, which, when executed by a processor, is configured to perform the method of the first aspect or any one of the possible implementations of the first aspect.

A sixth aspect of the embodiments of the present application provides a computer program product (or computer program) storing one or more computers, and when the computer program product is executed by a processor, the processor executes the method in the first aspect or any one of the possible implementation manners of the first aspect.

A seventh aspect of the embodiments of the present application provides a chip, where the chip includes at least one processor, and is configured to support a terminal device to implement the functions recited in the first aspect or any one of the possible implementation manners of the first aspect. In one possible design, the system-on-chip may further include a memory, the at least one processor communicatively coupled to the at least one memory, the at least one memory having instructions stored therein for storing program instructions and data necessary for the terminal device and the server. Optionally, the chip system further includes an interface circuit, and the interface circuit provides program instructions and/or data for the at least one processor.

In an eighth aspect of embodiments of the present application, a computer-readable storage medium is provided, where the computer-readable storage medium stores a program, and the program enables a terminal device to execute any one of the methods in the first aspect and the possible implementation manners.

It should be noted that beneficial effects brought by the embodiments of the second aspect to the eighth aspect of the present application and descriptions of the embodiments of the aspects may be understood by referring to the embodiments of the first aspect, and therefore, repeated descriptions are omitted.

Drawings

Fig. 1 is a schematic diagram of a VAE network structure in an embodiment of the present application;

FIG. 2 is a schematic diagram of an embodiment of a method for molecular feature determination in an embodiment of the present application;

FIG. 3 is a schematic diagram of another embodiment of a method for molecular feature determination in an embodiment of the present application;

FIG. 4 is a schematic diagram of an embodiment of obtaining the target molecule characteristics after the re-parameter processing in the embodiment of the present application;

FIG. 5 is a schematic diagram of yet another embodiment of a method for molecular feature determination in an embodiment of the present application;

fig. 6 is a schematic diagram of an embodiment of obtaining processed node features in the embodiment of the present application;

FIG. 7 is a diagram of an embodiment of obtaining processed edge features in an embodiment of the present application;

FIG. 8 is a schematic diagram of an embodiment of obtaining node features after attention convolution in an embodiment of the present application;

FIG. 9 is a schematic diagram of an embodiment of obtaining node features after attention re-convolution in the embodiment of the present application;

FIG. 10 is a schematic diagram of a convolutional layer of an encoder in the embodiment of the present application;

FIG. 11 is a schematic diagram of an embodiment of obtaining stacked node features in an embodiment of the present application;

FIG. 12 is a diagram illustrating another embodiment of obtaining stacked node features according to an embodiment of the present application;

fig. 13 is a schematic flowchart illustrating a process of acquiring a target SMILES expression in the embodiment of the present application;

FIG. 14 is a schematic flow chart illustrating molecular property prediction based on a pre-trained molecular characterization model according to an embodiment of the present application;

FIG. 15 is a schematic diagram of an embodiment of a performance evaluation result in the embodiment of the present application;

FIG. 16 is a schematic view of an embodiment of a molecular characteristics determining apparatus according to the embodiment of the present application;

fig. 17 is a schematic structural diagram of an embodiment of the molecular feature determination apparatus according to the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Additionally, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For better understanding of the method for determining molecular characteristics, the related apparatus and device disclosed in the embodiments of the present application, the technical solutions in the present application will be described below with reference to the drawings in the present application.

First, some terms or concepts related to the embodiments of the present application are explained to facilitate understanding by those skilled in the art.

First, deep learning (deep learning)

Deep learning is a machine learning technology based on a deep neural network algorithm, and is mainly characterized in that multiple nonlinear transformation structures are used for processing and analyzing data. The method is mainly applied to scenes such as perception, decision and the like in the field of artificial intelligence, such as image and voice recognition, natural language translation, computer gaming and the like.

Two, multilayer perceptron (MLP)

MLP is an artificial neural network of forward architecture that maps a set of input vectors to a set of output vectors. MLP can be viewed as a directed graph, consisting of multiple levels of nodes, each level being fully connected to the next level. Each node, except the input nodes, is a neuron (or processing unit) with a nonlinear activation function. A supervised learning approach called back-propagation algorithm is often used to train MLPs. The MLP is the popularization of the sensor, and the defect that the sensor cannot identify linear irreparable data is overcome.

Three, self-encoder

An AE is a neural network that approximates output values to input values using a back-propagation algorithm by first compressing the input into a hidden spatial representation and then reconstructing the output from this representation. The self-encoder consists of an encoder and a decoder.

Four, Variational Auto Encoder (VAE)

VAE is a category of self-encoders. In the encoder part, the input is compressed into two vectors with the size of n, the vectors respectively represent a mean vector mu and a standard deviation vector sigma, then parameters are repeated according to the mean vector mu and the standard deviation vector sigma to generate a new hidden variable, and then the decoder reconstructs the output with the same size as the input.

Graph Neural Network (GNN)

The GNN is a neural network based on a graph structure and is used for learning data features under a non-Euclidean space. More data representations in the real world are not simple arrangements of sequences or planes, but rather represent more complex graph structures, such as social networks, commodity-store-human relationships, molecular structures, and so forth.

Sixth, Message Passing Neural Network (MPNN)

MPNN is a type of model, and is not a specific model but a formalized framework of spatial convolution. It decomposes the spatial convolution into two processes: messaging and status update operations. More specifically, it is a process of reducing (Reduce) at the end point of an edge and updating the state of the end point itself after a message (a point or a feature of the edge) is passed in the direction of the edge. One key advantage of MPNN is the ability to generate appropriate models to learn graph features from given task definition delivery and updatable behaviors.

Seven, simplified molecular-input line-entry system (SMILES)

SMILES is a specification for explicitly describing a molecular structure using American Standard Code for Information Interchange (ASCII) character strings. An important feature of SMILES compared to most other methods of representing structures is that it is very memory-space efficient and therefore very suitable for transport and storage.

Eighthly, Quantitative Structure Activity Relationship (QSAR)

QSAR is an important branch of chemometrics, the most widely used drug design method, and aims to establish quantitative relationships between compound properties and its structure through reasonable mathematical statistics. Then, corresponding characteristics of other compounds are guessed through the quantitative relations, and a designer is guided to purposefully modify the structure of the physiologically active substance, so that the research and development period of the high-performance compound is greatly shortened, and the research and development cost is saved.

Nine, Molecular Similarity search (Molecular Similarity Searching)

Molecular similarity search is a process that starts with a known active compound, quantifies similarity to the known compound in a molecular database, and returns results from high to low in similarity.

Ten, Virtual Screening (VS)

VS is a computational technique for drug discovery to identify the structures from a search of small molecule libraries that are most likely to bind to a drug target (usually a protein receptor or enzyme). With the improvement of the accuracy of the method, virtual screening has become an indispensable part of the drug discovery process.

Eleven, Molecular Fingerprint (Molecular finger Fingerprint)

Molecular fingerprinting is a method of characterizing a molecule by describing a specific length vector of the chemical structure of the molecule. A common type of fingerprint may encode a molecular structure as a series of binary digits (bits) that represent the presence or absence of a particular substructure in a molecule. The similarity between two molecules can be quickly calculated by comparing fingerprints. Secondly, fingerprints can also be used in scenes such as structural activity modeling, similarity searching and virtual screening.

With the continuous breakthrough of the medical and medical fields, the population life is remarkably prolonged, and simultaneously, the newly added complex diseases emerge along with the continuous deterioration of the natural environment. Meanwhile, despite billions of money and decades of time cost for drug development, the success rate of development is still very low, which urgently requires the intervention of new technical means to improve the efficiency and success rate of development. With the popularity of deep learning technology, the development trend of using artificial intelligence technology to carry out innovation and application in various industries becomes. It would be of great significance to use deep learning techniques to accelerate the speed of drug development. But using neural networks to predict the properties of molecules is more complex because the predictor variables (molecules) can be of arbitrary size and shape. While most machine learning methods can only handle fixed size inputs. The most common technique currently used is to project the numerator into a feature vector of fixed dimensions for representation and then use the fully-connected layer access deep neural network for these features as the input for the numerator. As can be seen from the foregoing terminology or conceptual explanation, a molecular fingerprint is used to describe a vector of a specific length of a molecular chemical structure, and an extended-connectivity fingerprint (ECFP) is a conventional molecular fingerprint used to capture molecular features related to molecular activity. The ECFP has the advantages of fast calculation, infinite number of characteristics theoretically expressed, strong interpretability and the like, and is widely applied to medicine research. However, because the ECFP is a binary vector and does not have learning ability, the ECFP also has a certain limitation in application, and a new deep learning network needs to be constructed to generate a continuous numerical molecular fingerprint.

VS technology is often applied in the field of medical medicine, i.e. the molecular characteristics of compounds are used for drug screening. Specifically, the molecular characteristics can be used for quantitative structure-activity relationship (QSAR) modeling (i.e., molecular property prediction), and the QSAR modeling can be used as one of tools for virtual drug screening. Since the VS can screen compound libraries for compounds of interest with desired properties, the compounds of interest can then be tested experimentally. VS can speed up the discovery process, reduce the number of compounds to be tested experimentally and allow for a combinatorial selection. Further, relevant chemogenomics data can be collected from databases and literature by QSAR, then chemical descriptors can be calculated at different levels of molecular structure representation, and then the molecular characteristics of compounds can be determined by correlating chemical structures with biological properties using the examples of the present application, and drug screening can be performed using regression and classification techniques based on the determined molecular characteristics of compounds since QSAR can find statistical correlations between chemical structures and classifications (active, inactive, toxic, non-toxic, etc.).

When a deep learning technique is used for drug screening, since molecules can be directly represented by SMILES and a molecular graph, at present, based on AE through reconstruction training of molecules, molecules can be encoded into a vector (hidden variable) with a specific length to achieve the purpose of feature extraction/dimension reduction, specifically achieve the purpose of dimension reduction of molecules, and a model based on SMILES expression is usually used, and is trained by using the SMILES expression as input and output of the model. However, the SMILES expression lacks the representation of the chemical features of atoms and chemical bonds, and the structure of molecules is also hidden in the expression of character sequences, so that in the model training process, hidden space characterization cannot well describe the chemical characteristics of molecules, the accuracy of the determined molecular structure is reduced, and the performance of a target model obtained by model training is reduced. Secondly, the length and complexity of the SMILES and the molecular diagram change along with the change of the molecular structure, and the use of the SMILES and the molecular diagram as input can improve the complexity of the model, thereby increasing the time required by the design, training and verification of the model. The ECFP as a fixed-length representation mode has the advantage of easy use, but the information included in the molecular characteristics obtained by the ECFP is ambiguous and difficult to quantify due to the problems of discrete values, Bit Collision (Bit Collision) and the like, so that the accuracy of QSAR modeling of the medicine is reduced.

Since a molecule is an integral body formed by combining constituent atoms according to a certain bonding sequence and spatial arrangement, the bonding sequence and spatial arrangement relationship is called as a molecular structure. Due to the interactions between atoms within a molecule, the physical and chemical properties of a molecule depend not only on the type and number of constituent atoms, but also on the structure of the molecule. Although molecules can be recorded and transported using textual molecular formulae (e.g., SMILES), the nature of the molecule is more suitably represented by a molecular diagram, e.g., atoms correspond to vertices of the molecular diagram and chemical bonds correspond to edges of the molecular diagram, so a molecule can have an infinite number of SMILES representations, but the molecular diagram representation is unique. Based on the characteristics of molecules, in order to solve the above problems, embodiments of the present application provide a method for determining molecular features, which is used to improve the accuracy of the determined molecular features and improve the performance of a pre-trained molecular characterization model.

Secondly, the technical scheme in the application can also be applied to generation of similar compounds in the field of medical treatment and medicine, namely after the molecular characteristics are obtained, the message indicating that the molecular characteristics are subjected to the re-parameter processing is obtained, the molecular characteristics are subjected to the re-parameter processing based on the message, and the molecular characteristics obtained after the re-parameter processing are used as the input of an encoder, so that a SMILES (Simples-like expression) corresponding to the molecular characteristics is obtained, and the SMILES-like expression is used for generation of the similar compounds, namely, different chemical molecules are chemically synthesized based on the SMILES-like expression.

Firstly, a network structure of a VAE used in an embodiment of the present application is introduced, please refer to fig. 1, where fig. 1 is a schematic structural diagram of the network structure of the VAE in the embodiment of the present application, as shown in fig. 1, a1 indicates a molecular graph, a2 indicates a SMILES expression, the molecular graph first obtains node features and edge features through a pre-trained molecular characterization model, then compresses the node features and the edge features into two vectors with a size of n through an encoder included in the pre-trained molecular characterization model, respectively represents a mean vector μ and a standard deviation vector σ, then generates hidden variables through sampling according to the mean vector μ and the standard deviation vector σ, and then reconstructs the hidden variables into the smils expression corresponding to the molecular graph through a decoder included in the pre-trained molecular characterization model. The method avoids error calculation in the process of obtaining the molecular characteristics on the premise of providing more effective characteristic information for the pre-trained molecular characterization model.

Based on this, the method for determining molecular characteristics used in the embodiments of the present application is first described in detail below for drug screening, please refer to fig. 2, fig. 2 is a schematic diagram of an embodiment of the method for determining molecular characteristics used in the embodiments of the present application, and as shown in fig. 2, the method for determining molecular characteristics when applied to drug screening includes the following steps.

S101, acquiring a target molecular graph;

in this embodiment, the molecular feature determination device needs to acquire a target molecular graph first, where the target molecular graph may be a target molecular graph received by the molecular feature determination device through network communication, or may be a target molecular graph stored by the molecular feature determination device itself. The specific manner of obtaining the target molecule graph is not to be construed as a limitation of the embodiment of the present application.

S102, acquiring N node characteristics and M edge characteristics based on a target molecular graph;

in this embodiment, the molecular feature determining apparatus uses the target molecular graph as an input of the pre-trained molecular characterization model, and is configured to provide more information including a molecular structure to the pre-trained molecular characterization model, so that the pre-trained molecular characterization model may obtain N node features and M edge features, where N and M are integers greater than 1.

In particular, the node features are used to indicate atomic features in the target molecular graph, while the edge features are used to indicate chemical bond features in the target molecular graph. Further, in The embodiment of The present application, The target molecular diagram includes, but is not limited to, 8 atomic features and 4 chemical bond features, which are used to describe atoms and their local environment, The 8 atomic features include adjacent heavy atoms, covalent bond number of atoms, formal charge number, radical electron number, hybridization orbital type, aromaticity and chiral feature, and The 4 chemical bond features include chemical bond type, stereochemistry type (The nature of The bond's steriocem), conjugation and whether in a ring or not. It should be understood that the atomic features and chemical bond features described in the foregoing examples are only used for understanding the present solution, and should not be construed as limiting the present solution.

S103, processing the N node characteristics and the M edge characteristics by using the pre-trained molecular characterization model to obtain target molecular characteristics.

In this embodiment, the molecular feature determining apparatus processes the N node features and the M edge features using a pre-trained molecular characterization model to obtain target molecular features. Specifically, the pre-trained molecular characterization model comprises an encoder and a decoder, wherein when the molecular characterization model is pre-trained, the encoder is used for encoding the features of the training molecular graph, the decoder is used for reconstructing the encoding result of the encoder into a SMILES expression, and the parameters of the encoder and the decoder are adjusted iteratively.

Illustratively, as shown in fig. 1, an encoder in the pre-trained molecular characterization model compresses node features and edge features into two vectors of size n, representing a mean vector μ and a standard deviation vector σ, respectively, and then samples the mean vector μ and the standard deviation vector σ to generate target molecular features.

Next, since the method for determining molecular characteristics used in the embodiment of the present application can also be applied to the generation of similar compounds, the following describes a scenario for the generation of similar compounds in detail, please refer to fig. 3, where fig. 3 is a schematic diagram of another embodiment of the method for determining molecular characteristics in the embodiment of the present application, and as shown in fig. 3, when the method for determining molecular characteristics is applied to the generation of similar compounds, the following steps are included.

S201, acquiring a target molecular graph;

in this embodiment, the manner of acquiring the target molecular map by the molecular feature determination device is similar to that in step S101, and is not described herein again.

S202, acquiring N node characteristics and M edge characteristics based on the target molecular graph;

in this embodiment, the manner of acquiring the N node features and the M edge features by the molecular feature determination device based on the target molecular graph is similar to that in step S102, and details are not repeated here.

S203, processing the N node characteristics and the M edge characteristics by using the pre-trained molecular characterization model to obtain target molecular characteristics;

in this embodiment, the molecular feature determining device processes the N node features and the M edge features by using the pre-trained molecular characterization model, and the manner of obtaining the target molecular features is similar to that in step S103, and is not described herein again.

S204, acquiring a first message;

in this embodiment, after the molecular feature determining device obtains the target molecular feature through step S203, the molecular feature determining device may be further applied to a similar compound generation scenario, so that the first message may be further obtained, and the first message indicates to perform the re-parameter processing on the target molecular feature.

S205, carrying out heavy parameter processing on the target molecule characteristics based on the first message to obtain at least one heavy parameter processed target molecule characteristic;

in this embodiment, the molecular feature determining apparatus performs a re-parameter processing on the target molecular feature based on the first message to obtain at least one re-parameter-processed target molecular feature. For ease of understanding, please refer to fig. 4,FIG. 4 is a schematic diagram of an embodiment of obtaining the target molecule characteristics after the re-parameterization processing in the embodiment of the present application, and as shown in FIG. 4, the target molecule characteristics (N) are obtained_layer×2N_latent) Mean value (N) split into hidden spaces_layer×N_latent) And standard deviation (N)_layer×N_latent) Two matrices, then the mean (N)_layer×N_latent) And standard deviation (N)_layer×N_latent) Carrying out heavy parameter processing, thereby obtaining the target molecule characteristics (N) after heavy parameter processing_layer×N_latent)。

S206, acquiring at least one simplified molecular linear input specification SMILES expression through a decoder included in the pre-trained molecular characterization model based on the target molecular characteristics processed by the at least one heavy parameter.

In this embodiment, the molecular feature determining apparatus uses the target molecular feature after the at least one re-parameterization process as an input of a decoder included in the pre-trained molecular characterization model, and the decoder in the pre-trained molecular characterization model reconstructs the target molecular feature into at least one simplified molecular linear input canonical SMILES expression, thereby obtaining the at least one simplified molecular linear input canonical SMILES expression output by the decoder, and the at least one simplified molecular linear input canonical SMILES expression is used for generating similar compounds.

Since VAE models usually use a fixed-length vector as their implicit spatial representation, in the molecular characterization problem, the learning goal of the pre-trained molecular characterization model is molecular reconstruction, as can be seen from the embodiments of fig. 2 and fig. 3, the method for determining molecular characteristics in the embodiment of the present application needs to use the target molecular characteristics in the target molecular map to reconstruct the SMILES expression of the molecule, so the hidden spatial distribution of VAE will be more prone to the generation of SMILES, the target molecular graph is used as the input of a pre-trained molecular characterization model, the capability of the model hidden space in the representation of molecular attributes is still relatively limited, in order to retain layer information of more molecules, so that the molecular characterization is more favorable for the characterization of molecular properties, the embodiment of the application further provides an encoder, for encoding a target molecule graph, more layer information can be retained in the target molecule feature by hard segmentation. For ease of understanding, referring to fig. 5, fig. 5 is a schematic diagram of another embodiment of the method for determining molecular characteristics in the embodiment of the present application, and as shown in fig. 5, the method for determining molecular characteristics includes the following steps.

S301, acquiring a target molecular graph;

S302, acquiring N node characteristics and M edge characteristics based on a target molecular graph;

S303, performing projection processing on the N node characteristics through a first full-connection layer of the pre-trained molecular characterization model to obtain N processed node characteristics;

in this embodiment, since the node features and the edge features may be obtained through one-hot encoding, a situation that the node features and the edge features are sparse may occur, and secondly, since the input dimensions of the node features and the edge features are different, the node features and the edge features may be projected to the same dimension, so as to perform operations such as attention convolution in the following step.

Based on the method, the molecular characteristic determination device performs projection processing on the N node characteristics through the first full-connection layer of the pre-trained molecular characterization model to obtain the N processed node characteristics. For easy understanding, please refer to fig. 6, where fig. 6 is a schematic view illustrating an embodiment of obtaining processed node features in the embodiment of the present application, as shown in fig. 6, B1 indicates a target graph, and after obtaining the node features corresponding to the target graph B1 through step S302, the node features are used as inputs of a first fully-connected layer in an encoder, and the first fully-connected layer performs projection processing on the node features to obtain the processed node features, and outputs the processed node features.

S304, performing projection processing on the M edge features through a second full-connection layer of the pre-trained molecular characterization model to obtain M processed edge features;

in this embodiment, since the node features and the edge features may be obtained through one-hot encoding, a situation that the node features and the edge features are sparse may occur, and secondly, since the input dimensions of the node features and the edge features are different, the node features and the edge features may be projected to the same dimension, so as to perform operations such as attention convolution and the like in the subsequent step.

Based on this, the molecular feature determination device performs projection processing on the M edge features through the second full connection layer of the pre-trained molecular characterization model to obtain the M processed edge features. For easy understanding, please refer to fig. 7, fig. 7 is a schematic diagram illustrating an embodiment of obtaining processed edge features in the embodiment of the present application, as shown in fig. 7, C1 indicates a target graph, after obtaining the edge feature corresponding to the target graph C1 through step S302, the edge feature is used as an input of a second fully-connected layer in an encoder, the second fully-connected layer performs projection processing on the edge feature to obtain the processed edge feature, and outputs the processed edge feature.

Specifically, the processed node features obtained in step S303 and step S304 and the processed edge features are in the same dimension. Next, there is no timing limitation between step S303 and step S304, that is, step S303 and step S304 may be performed simultaneously, or step S304 may be performed sequentially, which is not limited in this embodiment of the present application.

S305, performing attention convolution operation on the N processed node features and the M processed edge features through a convolution layer of the encoder to obtain N attention-convolved node features;

in this embodiment, for the hierarchy of the graph structure, the embodiment of the present application performs local computation by using attention graph convolution to obtain features of neighboring nodes, and after each node is performed with attention convolution operation, information of more neighboring nodes will be fused. Based on this, the molecular feature determination device performs attention convolution operations on the N processed node features and the M processed edge features by the convolution layer of the encoder, obtaining N attention-convolved node features.

For convenience of understanding, please refer to fig. 8, fig. 8 is a schematic diagram illustrating an embodiment of obtaining node features after attention convolution in the embodiment of the present application, and as shown in fig. 8, D1 indicates a target molecular graph, and the processed node features and the processed edge features corresponding to the target molecular graph D1 can be obtained through the steps described in the foregoing embodiment, and then the processed node features and the processed edge features are used as inputs of convolution layers of an encoder in a pre-trained molecular characterization model, and the convolution layers perform attention convolution operations on the processed node features and the processed edge features, and output the attention-convolved node features.

Specifically, the convolutional layer performs an attention convolution operation on the node features and the edge features by the following steps. First, the attention convolution operation for each chemically-bound directed atom pair includes the following steps:

(1) the edge features are added to the adjacent node features and spliced along the chemical bond direction:

he′_uv＝[h_u+he_uv，h_v+he_uv]； (1)

u∈N(v)； (2)

wherein h indicates node characteristics, he indicates edge characteristics, u and v indicate nodes, u belongs to N (v) in formula (2) indicates that the node u is an adjacent node belonging to the node v, h indicates that the node u is a node_uIndicating node characteristics of node u, h_vIndicating node characteristics of node v, he_uvIndicating an edge characteristic, he, between node u and node v_uvAnd adding the node characteristics of the indication node u and the node characteristics of the node v, and splicing the edge characteristics along the chemical bond direction.

(2) Calculating an attention weight value of a directed atom pair:

w_uv＝W·he′_uv； (3)

wherein u and v indicate nodes, w_uvIndicating attention weight value between node u and node v, W indicates he_uvTo w_uvWeight matrix of linear projections of he_uvAnd adding the node characteristics of the indication node u and the node characteristics of the node v, and splicing the edge characteristics along the chemical bond direction.

Secondly, the attention convolution operation for each atom and all chemical bonds and connected atoms pointing to the atom comprises the following steps:

(1) normalizing the weight of the neighboring node:

wherein, a_uvIndicating the weight between the normalized node u and the node v, u and v indicating the nodes, u belongs to N (v) indicating the node u as the adjacent node belonging to the node v, w_uvIndicating the attention weight value between node u and node v.

(2) Aggregation, projection and updating of node features:

h_v＝W·{h_v+∑_u∈N(v)[a_uv·(h_u+he_uv)]}； (5)

wherein h is_uIndicating node characteristics of node u, W indicates h_uA weight matrix of the linear projection of (a)_uvIndicating the weight between the normalized node u and the node v, wherein u belongs to N (v) indicating that the node u is a neighboring node belonging to the node v, he_uvIndicating the edge characteristics between node u and node v.

It is understood that the foregoing steps and formulas are only used to understand the present solution, and the specific way to perform the attention convolution operation should be flexibly determined according to actual situations.

S306, performing attention convolution operation on the N attention-convolved node features and the M processed edge features through the convolution layer to obtain N attention-convolved node features;

in this embodiment, after each node is performed with the attention convolution operation, information of more neighboring nodes is fused, and the output of each layer is also transmitted to the next layer. Therefore, after obtaining the N attention convolved node features through step S305, the molecular feature determining apparatus may further perform the attention convolution operation on the N attention convolved node features and the M processed edge features again through the convolution layer of the pre-trained molecular characterization model, so as to obtain N attention convolved node features. It should be understood that, in an actual application, after obtaining N node features after attention convolution again, if there are neighboring node features, attention convolution operations may be further performed on the N node features after attention convolution again and the M processed edge features, so as to obtain N node features after attention convolution again, and so on until all neighboring node features are obtained.

For convenience of understanding, please refer to fig. 9, fig. 9 is a schematic diagram illustrating an embodiment of obtaining node features after attention convolution in the embodiment of the present application, and as shown in fig. 9, E1 indicates a target molecular graph, the processed node features and the processed edge features corresponding to the target molecular graph E1 may be obtained through the steps described in the foregoing embodiment, then the node features after attention convolution are obtained through step S305, the obtained node features after attention convolution and the processed edge features are again used as inputs of a convolutional layer of an encoder in a pre-trained molecular characterization model, the convolutional layer performs attention convolution operation on the node features after attention convolution and the processed edge features, and outputs the node features after attention convolution again.

Specifically, the convolutional layer needs an MPNN-based message passing mechanism to realize the combined utilization of the node feature and the edge feature, for easy understanding, please refer to fig. 10, fig. 10 is a schematic structural diagram of the convolutional layer of the encoder in the embodiment of the present application, as shown in fig. 10, F1 indicates a target graph, F2 indicates a node feature of one node in the target graph, F3 indicates an edge feature corresponding to the node feature F2, F41 and F42 indicate node features of neighboring nodes, F5 indicates an attention convolution operation, and F6 indicates a vector stitching operation. Based on this, after the target molecular graph F1 obtains the node features and the edge features, the node features are firstly subjected to projection processing through the first full-connected layer to obtain processed node features, the processed edge features are obtained in the same way, then attention convolution operation F5 is performed on the processed node features and the processed edge features, attention convolution operation F5 is performed on the node features F41 of the adjacent node and the node features F42 of the adjacent node and the processed edge features respectively, and then vector splicing operation F6 is performed on all the obtained node features, so that the node features including more information related to the nodes are obtained.

Optionally, the parameters corresponding to performing the attention convolution operation are the same, i.e. the convolution layers all use the same parameters. The convolution layers with the same cycle use (parameter sharing) can ensure the consistency of feature aggregation operation of each layer, and the number of parameters of the model is reduced while the performance of the model is not obviously influenced. In addition, Multi-Head attachment Mechanism is also applied to the model to allow the model to learn more information in different representation subspaces, so that the retained feature information can be further improved, and the accuracy of the obtained target node feature can be improved.

S307, stacking the N processed node features, the N attention convolved node features and the N attention convolved node features through the node embedding layer to obtain N stacked node features;

in this embodiment, for the hierarchy of the graph structure, the embodiment of the present application performs local computation using attention graph convolution to obtain features of neighboring nodes, and after each node is subjected to attention convolution operation, information of more neighboring nodes is fused, and the output of each layer is transmitted to the next layer, and the obtained node features are stacked. Therefore, the molecular feature determination device performs stacking processing on the N processed node features, the N attention convolved node features, and the N attention convolved node features through the node embedding layer of the pre-trained molecular characterization model, so as to obtain N stacked node features. It should be understood that the stacked node features described in the embodiments of the present application are graph features.

It should be understood that, since the embodiment of the present application only describes a process of performing the attention convolution operation twice, in practical application, after obtaining N node features after the attention convolution again, if there are adjacent node features, the attention convolution operation may be further performed on the N node features after the attention convolution again and the M processed edge features, so as to obtain N node features after the attention convolution again, and therefore the node embedding layer needs to stack the N processed node features, the N node features after the attention convolution again, and the like, and if more adjacent node features are obtained, all the node features need to be stacked. Therefore, the stacking process for three node features described in the present embodiment should not be construed as limiting the application.

For ease of understanding, please refer to fig. 11, fig. 11 is a schematic diagram of an embodiment of obtaining stacked node features in the embodiment of the present application, as shown in fig. 11, G1 indicates a target graph, the processed node features, the attention convolved node features, and the attention convolved node features corresponding to the target molecular graph G1 can be obtained through the steps described in the foregoing embodiment, the processed node features, the attention convolved node features, and the attention convolved node features are used as inputs of a node embedding layer of an encoder in a pre-trained molecular characterization model, and the node embedding layer performs stack processing on the processed node features, the attention convolved node features, and the attention convolved node features to obtain stacked node features, and outputs the stacked node features.

Further, referring to fig. 12, fig. 12 is a schematic diagram illustrating another embodiment of obtaining stacked node features in the embodiment of the present application, as shown in fig. 12, H1, H11 to H18 indicate carbon atoms, H21 to H24 indicate oxygen atoms, and carbon atom H1 is taken as a central node, that is, the obtained stacked node features indicate all node features related to carbon atom H1, as can be seen from fig. 12, nodes adjacent to carbon atom H1 include carbon atom H11, carbon atom H12 and carbon atom H13, so that attention convolution operations are performed on the processed node features (node features of carbon atom H1) and edge features, and attention convolution node features including node features of carbon atom H11, node features of carbon atom H12 and node features of carbon atom H13 can be obtained. Further, the nodes adjacent to carbon atom H11 include oxygen atom H21 and oxygen atom H22, the nodes adjacent to carbon atom H12 include carbon atom H14, and the nodes adjacent to carbon atom H13 include carbon atom H15 and oxygen atom H23, so that performing the attention convolution operation on the attention-convolved node features (the node features of carbon atom H11, the node features of carbon atom H12 and the node features of carbon atom H13) and the edge features can obtain the attention-convolved node features, which include the node features of oxygen atom H21, the node features of oxygen atom H22, the node features of carbon atom H14, the node features of carbon atom H15 and the node features of oxygen atom H23. Similarly, it can be obtained that the node features after the attention convolution again include the node features of the carbon atom H16 and the carbon atom H17, the node features after the attention convolution for the last time include the node features of the carbon atom H18 and the node features of the oxygen atom H24, and then the processed node features, the node features after the attention convolution again, the node features after the attention convolution for the next time, and the node features after the attention convolution for the last time are stacked, so that the stacked node features shown in fig. 12 can be obtained, where the stacked node features include the node features of all the atoms shown in fig. 12.

It should be understood that the example of fig. 12 is only used for understanding the present solution, and the node characteristics after specific stacking need to be flexibly determined according to the adjacent relationship between the nodes and the specific node characteristics.

S308, aggregating the N stacked node characteristics through an aggregation layer of the encoder to obtain target molecular characteristics;

in this embodiment, the molecular feature determining device uses the N stacked node features as input of a polymerization layer of the pre-trained molecular characterization model, and the polymerization layer performs polymerization processing on the N stacked node features to obtain a target molecular feature and outputs the target molecular feature.

S309, acquiring a first message;

in this embodiment, the manner of acquiring the first message by the molecular feature determination device root is similar to that in step S204, and is not described herein again.

S310, carrying out re-parameter processing on the target molecule characteristics based on the first message to obtain at least one re-parameter-processed target molecule characteristic;

in this embodiment, the manner in which the molecular feature determining apparatus performs the re-parameter processing on the target molecular feature based on the first message to obtain at least one re-parameter-processed target molecular feature is similar to that in step S205, and is not described herein again.

S311, based on the target molecule characteristics processed by the at least one heavy parameter, at least one simplified molecule linear input specification SMILES expression is obtained through a decoder included in the pre-trained molecule characterization model.

In this embodiment, the manner in which the molecular feature determination device obtains the at least one simplified molecular linear input specification SMILES expression through the decoder included in the pre-trained molecular characterization model based on the target molecular feature processed by the at least one re-parameter is similar to that in step S206, and is not described herein again.

For further understanding of the present solution, please refer to fig. 13, and fig. 13 is a schematic flowchart illustrating a process of obtaining a target SMILES expression in the embodiment of the present application, as shown in fig. 13, F1 indicates a target graph, and F2 indicates a SMILES expression. The target molecular graph F1 obtains node features and edge features through a pre-trained molecular characterization model, obtains processed node features and processed edge features through a first full-link layer and a second full-link layer according to a similar manner to the manner of the steps S303 and S304, uses the processed node features and the processed edge features as input of a convolutional layer, which outputs the node features after attention convolution, further, uses the node features after attention convolution and the processed edge features as input of the convolutional layer again, which outputs the node features after attention convolution again, uses the obtained processed node features, the node features after attention convolution and the node features after attention convolution again as input of a node embedding layer, which outputs the node features after stacking, and uses the node features after stacking as input of an aggregation layer, and the polymerization layer outputs the target molecular characteristics, the target molecular characteristics are used as the input of a decoder in the pre-trained molecular characterization model, and the decoder in the pre-trained molecular characterization model reconstructs the target molecular characteristics into SMILES expressions F2 corresponding to target molecular graphs, so that SMILES expressions F2 are output.

The method for determining molecular characteristics used in the embodiments of the present application is mainly described in detail above, and how to pre-train the molecular characterization model and use the solution provided in the present application in a molecular property prediction scenario will be described below. For ease of understanding, the present embodiment uses a universal data set, ZINC15, for pre-training of the molecular characterization model, and takes as an example 4 common molecular property prediction data sets for evaluating the model's ability to characterize the properties of different layers of a molecule, the 4 common molecular property prediction data sets including solubility data set, photovoltaic efficiency data set, malaria bioactivity data set, and biotoxicity data set. The aforementioned data set covers three levels of physicochemical, biophysical, and physiological prediction of molecular properties, and solubility and photovoltaic efficiency are the prediction tasks of the physicochemical properties of molecules, malaria bioactivity is the prediction task of the biophysical level, and Tox21 is the prediction task of the physiological level. Referring to fig. 14, fig. 14 is a schematic flow chart illustrating molecular property prediction based on a pre-trained molecular characterization model according to an embodiment of the present application, and as shown in fig. 14, the method for molecular property prediction based on a pre-trained molecular characterization model includes the following steps.

S401, pre-training a molecular characterization model;

in this embodiment, since the selected data set ZINC15 contains more than 2.3 hundred million compounds, about 600k molecules are selected as the training set and 15k molecules are selected as the verification set, respectively, and then the molecular characterization model is pre-trained (target molecular characteristics are 1024 dimensions) to obtain the pre-trained molecular characterization model with the smallest error on the verification set composed of 15k molecules for the next testing.

It should be appreciated that since each dataset is used for different attributes and levels of predictive tasks, the molecular characterization models corresponding to the solubility dataset, the photovoltaic efficiency dataset, the malaria bioactivity dataset, and the biotoxicity dataset are pre-trained separately.

S402, splitting a prediction task data set;

in this example, the solubility dataset, photovoltaic efficiency dataset, malaria bioactivity dataset, and biotoxicity dataset were all divided into 5 equal parts for 5 × cross validation. Secondly, based on the step of performing the molecular characterization model for the pre-training in step S401, the solubility dataset, the photovoltaic efficiency dataset, the malaria bioactivity dataset and the biotoxicity dataset need to be subjected to 5 × cross validation respectively.

S403, training a multi-layer perceptron MLP;

in this embodiment, the prediction task of the physicochemical properties of the molecules, the prediction task of the biophysical level, and the prediction task of the physiological level are predicted separately using one and the same multilayer perceptron (MLP) including the convolutional layer shown in fig. 10. In the training process of the prediction task, the method only optimizes the parameters of the MLP of the prediction task, and does not optimize the model parameters of the pre-trained molecular characterization model.

And S404, evaluating the performance.

In this example, for the three regression prediction tasks of solubility, photovoltaic efficiency, and malaria bioactivity, the Mean Squared Error (Mean Squared Error) was used as the loss function, and the Mean Absolute Error (MAE) was used to evaluate the difference between the regression value and the true value. For the classification task of Tox21, Cross Entropy (Cross Entropy) is used as a loss function, and the area under the curve (AUC) of Receiver Operation Characteristics (ROC) is used to evaluate Tox21 classification performance. And 5 x cross validation is used for each task, so as to obtain the final performance evaluation result.

Specifically, in this embodiment, the performance evaluation of the molecular property prediction is performed in three different molecular characterization manners, which are ECFP, the SMILES-based VAE, and the pre-trained molecular characterization model introduced in this embodiment. Referring to fig. 15, fig. 15 is a schematic diagram of one example of performance evaluation results in the examples of the present application, as shown in fig. 15, the graph (a) in fig. 15 indicates performance evaluation corresponding to malaria bioactivity prediction, the graph (B) in fig. 15 indicates performance evaluation corresponding to malaria bioactivity prediction, the graph (C) in fig. 15 indicates performance evaluation corresponding to solubility prediction, and the graph (D) in fig. 15 indicates performance evaluation corresponding to biotoxicity prediction. As can be seen from the graph (a) in fig. 15, in the prediction of the biological activity of malaria, the performance evaluation results obtained by the pre-trained molecular characterization model (AutoMol) described in this example are significantly different from those of 512-dimensional ECFP6, and are not significantly different from those of 1024-dimensional ECFP6 and 2048-dimensional ECFP 6. As can be seen from the graph (B) in fig. 15, in the solubility prediction, the performance evaluation result obtained by the pre-trained molecular characterization model (AutoMol) described in this embodiment is significantly improved compared to the 2048-dimensional ECFP 6. As can be seen from the graph (C) in fig. 15, in the photovoltaic efficiency prediction, the performance evaluation result obtained by the pre-trained molecular characterization model (AutoMol) described in this embodiment is also significantly improved compared to the 2048-dimensional ECFP 6. As can be seen from the graph (D) in fig. 15, in the prediction of biological toxicity, the performance evaluation result obtained by the pre-trained molecular characterization model (AutoMol) described in this example is significantly improved compared to the 2048-dimensional ECFP 6. Therefore, as can be seen from the performance evaluation results shown in fig. 15, since the pre-trained molecular characterization model can encode molecules by maximally using information included in the chemical molecules, the structural information of the chemical molecules is more accurate, and therefore, the pre-trained molecular characterization model is applied to molecular attribute prediction, and the molecular attribute prediction performance can be improved.

The scheme provided by the embodiment of the application is mainly introduced in the aspect of a method. It is to be understood that the molecular characterization device, in order to perform the above-described functions, includes corresponding hardware structures and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed in hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments of the present application, the molecular feature determination device may be divided into function modules based on the above method examples, for example, each function module may be divided for each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

Therefore, the molecular feature determination device in the present application will be described in detail below, referring to fig. 16, where fig. 16 is a schematic diagram of an embodiment of the molecular feature determination device in the present application, and as shown in the figure, the molecular feature determination device 1700 includes:

an obtaining module 1701 for obtaining a target molecular graph;

the obtaining module 1701 is further configured to obtain N node features and M edge features based on the target molecular graph, where N and M are integers greater than 1;

a processing module 1702, configured to process the N node features and the M edge features by using a pre-trained molecular characterization model to obtain target molecular features, where the pre-trained molecular characterization model includes an encoder and a decoder, and when the molecular characterization model is pre-trained, the encoder is configured to encode the features of the training molecular graph, and the decoder is configured to reconstruct an encoding result of the encoder into a simplified molecular linear input specification SMILES expression, and parameters of the encoder and the decoder are iteratively adjusted.

In some alternative embodiments of the present application, the target molecule map represents structural information of a chemical molecule, and the target molecule characteristics are used for drug screening and/or molecular property prediction.

In some optional embodiments of the present application, the processing module 1702 is specifically configured to:

In some optional embodiments of the present application, the processing module 1702 is further configured to, after performing, by the convolutional layer of the encoder, an attention convolution operation on the N processed node features and the M processed edge features to obtain N attention convolved node features, perform, by the convolutional layer, an attention convolution operation on the N attention convolved node features and the M processed edge features to obtain N attention convolved node features;

the processing module 1702 is specifically configured to stack, through the node embedding layer, the N processed node features, the N attention convolved node features, and the N attention convolved node features to obtain N stacked node features.

In some optional embodiments of the present application, the obtaining module 1701 is further configured to obtain a first message after the pre-trained molecular characterization model is used to process the N node features and the M edge features to obtain target molecular features;

the processing module 1702 is further configured to perform a re-parameter processing on the target molecule feature based on the first message, so as to obtain at least one re-parameter-processed target molecule feature;

the obtaining module 1701 is further configured to obtain, through a decoder included in the pre-trained molecular characterization model, at least one simplified molecular linear input canonical SMILES expression based on the at least one re-parameterized processed target molecule feature, where the at least one SMILES expression is used for similar compound generation.

The molecular feature determination device in the embodiment of the present application may be deployed in a terminal device, may also be deployed in a server, and may also be a chip applied in the terminal device or the server, or other combined devices and components that can implement the functions of the terminal device. When the molecular characteristics determination apparatus is a terminal device, the obtaining module 1701 and the processing module 1702 may be implemented by a processor executing codes, for example, the processor may be an application chip of a certain type. When the molecular characteristics determination apparatus is a component having the above-described terminal device function, the acquisition module 1701 and the processing module 1702 may be implemented by a processor executing codes. When the molecular feature determination device is a chip-on-a-chip system, the acquisition module 1701 and the processing module 1702 may be processors of the chip-on-a-chip system.

Specifically, referring to fig. 17, fig. 17 is a schematic structural diagram of an embodiment of a molecular characteristics determining apparatus according to the present application, and as shown in fig. 17, the molecular characteristics determining apparatus 1800 includes a processor 1810, a memory 1820 coupled to the processor 1810, and an input/output port 1830. In some implementations, they may be coupled together by a bus. The molecular characteristics determining apparatus 1800 may be a server or a terminal device. The processor 1810 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of the CPU and the NP. The processor may also be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof. The processor 1810 may refer to one processor or may include a plurality of processors. The memory 1820 may include volatile memory (volatile memory), such as Random Access Memory (RAM). The memory 1820 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory (flash memory), a Hard Disk Drive (HDD), or a solid-state drive (SSD); the memory 1820 can also include combinations of the above types of memory.

The memory 1820 has stored therein computer readable instructions for performing any of the methods of the possible embodiments described above. Processor 1810 may carry out corresponding operations as instructed by the computer-readable instructions when the computer-readable instructions are executed. Further, after the processor 1810 executes the computer readable instructions in the memory 1820, all operations that the server or the terminal device can perform can be performed according to the instructions of the computer readable instructions.

Input/output ports 1830 include ports for outputting data and, in some cases, for inputting data. The processor 1810 may call the input/output port 1830 by executing code to obtain a target molecule graph, and in some cases, the processor 1810 may also call the input/output port 1830 by executing code to obtain a target molecule graph from another device.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes, descriptions of the working processes and technical effects of the systems, apparatuses and units described above may refer to the corresponding processes in the foregoing method embodiments, and no further description is provided herein.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

Claims

1. A method of determining a characteristic of a molecule, comprising:

acquiring a target molecular graph;

acquiring N node features and M edge features based on the target molecular graph, wherein N and M are integers greater than 1;

and processing the N node features and the M edge features by using a pre-trained molecular characterization model to obtain target molecular features, wherein the pre-trained molecular characterization model comprises an encoder and a decoder, the encoder is used for encoding the features of the training molecular graph when the molecular characterization model is pre-trained, the decoder is used for reconstructing the encoding result of the encoder into a simplified molecular linear input specification SMILES expression, and parameters of the encoder and the decoder are adjusted in an iterative manner.

2. The method of claim 1, wherein the target molecule map represents structural information of a chemical molecule, the target molecule features being used for drug screening and/or molecular property prediction.

3. The method according to claim 1 or 2, wherein the processing the N node features and the M edge features using the pre-trained molecular characterization model to obtain target molecular features comprises:

and encoding the N processed node characteristics and the M processed edge characteristics by using an encoder in the pre-trained molecular characterization model to obtain the target molecular characteristics.

4. The method of claim 3, wherein the encoding the N processed node features and the M processed edge features with an encoder in the pre-trained molecular characterization model to obtain the target molecular feature comprises:

stacking the N processed node features and the N attention convolved node features through a node embedding layer of the encoder to obtain N stacked node features;

and aggregating the N stacked node characteristics through an aggregation layer of the encoder to obtain the target molecular characteristics.

5. The method of claim 4, wherein after performing an attention convolution operation on the N processed node features and the M processed edge features by the convolutional layer of the encoder to obtain N attention convolved node features, the method further comprises:

performing attention convolution operation on the N attention convolved node features and the M processed edge features through the convolution layer to obtain N attention convolved node features;

stacking the N processed node features and the N attention convolved node features by the node embedding layer of the encoder to obtain N stacked node features, including:

and stacking the N processed node features, the N attention convolved node features and the N attention convolved node features through the node embedding layer to obtain the N stacked node features.

6. The method according to any one of claims 1 to 5, wherein after the processing the N node features and the M edge features using the pre-trained molecular characterization model to obtain target molecular features, the method further comprises:

acquiring a first message;

carrying out re-parameter processing on the target molecule characteristics based on the first message to obtain at least one re-parameter-processed target molecule characteristic;

obtaining, by the decoder comprised by the pre-trained molecular characterization model, at least one simplified molecular linear input canonical SMILES expression based on the at least one re-parametrically processed target molecule feature, wherein the at least one SMILES expression is used for similar compound generation.

7. A molecular feature determination device, comprising:

the acquisition module is used for acquiring a target molecular graph;

the obtaining module is further configured to obtain N node features and M edge features based on the target molecular graph, where N and M are integers greater than 7;

and the processing module is used for processing the N node features and the M edge features by utilizing a pre-trained molecular characterization model to obtain target molecular features, wherein the pre-trained molecular characterization model comprises an encoder and a decoder, the encoder is used for encoding the features of the training molecular graph when the molecular characterization model is pre-trained, the decoder is used for reconstructing an encoding result of the encoder into a simplified molecular linear input specification SMILES expression, and parameters of the encoder and the decoder are adjusted in an iterative manner.

8. The molecular characterization device according to claim 7, wherein the target molecular map represents structural information of a chemical molecule, and the target molecular characterization is used for drug screening and/or molecular property prediction.

9. The molecular feature determination device of claim 7 or 8, wherein the processing module is specifically configured to:

10. The molecular signature determination device of claim 9, wherein the processing module is specifically configured to:

11. The molecular feature determination device of claim 10, wherein the processing module is further configured to, after performing an attention convolution operation on the N processed node features and the M processed edge features by a convolution layer of the encoder to obtain N attention-convolved node features, perform an attention convolution operation on the N attention-convolved node features and the M processed edge features by the convolution layer to obtain N re-attention-convolved node features;

the processing module is specifically configured to stack the N processed node features, the N attention convolved node features, and the N attention convolved node features through the node embedding layer, so as to obtain the N stacked node features.

12. The molecular feature determination device according to any one of claims 7 to 11, wherein the obtaining module is further configured to obtain a first message after the N node features and the M edge features are processed by using a pre-trained molecular characterization model to obtain target molecular features;

the processing module is further configured to perform re-parameter processing on the target molecule feature based on the first message to obtain at least one re-parameter-processed target molecule feature;

the obtaining module is further configured to obtain, by the decoder included in the pre-trained molecular characterization model, at least one simplified molecular linear input canonical SMILES expression based on the at least one re-parametric processed target molecule feature, where the at least one simplified molecular linear input canonical SMILES expression is used for similar compound generation.

13. A computing device, comprising:

a processor and a memory;

the processor performs the method of any of claims 1 to 6 by executing code in the memory.

14. A computer-readable storage medium comprising instructions that, when executed by a computer, the computer performs the method of any of claims 1-6.