CN115274008A - Molecular property prediction method and system based on graph neural network - Google Patents

Molecular property prediction method and system based on graph neural network Download PDF

Info

Publication number
CN115274008A
CN115274008A CN202210944763.6A CN202210944763A CN115274008A CN 115274008 A CN115274008 A CN 115274008A CN 202210944763 A CN202210944763 A CN 202210944763A CN 115274008 A CN115274008 A CN 115274008A
Authority
CN
China
Prior art keywords
molecular
graph
data
predicted
property prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210944763.6A
Other languages
Chinese (zh)
Inventor
司马鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Chuangteng Software Co ltd
Original Assignee
Suzhou Chuangteng Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Chuangteng Software Co ltd filed Critical Suzhou Chuangteng Software Co ltd
Priority to CN202210944763.6A priority Critical patent/CN115274008A/en
Publication of CN115274008A publication Critical patent/CN115274008A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/80Data visualisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a molecular property prediction method and a molecular property prediction system based on a graph neural network, wherein the method comprises the following steps: acquiring a data file of a molecule to be predicted, and converting the data file into graph data, wherein the graph data comprises a plurality of nodes and a plurality of edges, the nodes represent atoms forming the molecule to be predicted, and the edges represent chemical bonds with the predicted molecule; inputting the graph data into a pre-trained molecular property prediction model to obtain the molecular characteristics of the molecules to be predicted; the molecular property prediction model is obtained according to a molecular pattern book training, the molecular graph samples are undirected graphs formed by converting data file samples, nodes in the undirected graphs represent atoms forming the molecular samples, and edges in the undirected graphs represent chemical bonds of the molecular samples. The method and the device can be matched with the characteristics of molecular property prediction, and the molecular property can be rapidly and accurately output by utilizing a pre-trained molecular property prediction model.

Description

Molecular property prediction method and system based on graph neural network
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a molecular property prediction method and system based on a graph neural network.
Background
Drug development is an expensive and time consuming process that requires testing thousands of compounds to find a safe and effective drug. The compound molecules are filtered through a series of progressive tests that determine their nature, effectiveness and toxicity at later stages.
The QSAR/QSPR model is currently widely adopted, and applications of machine learning in drug development include, but are not limited to, the following: prediction of biological activity or physicochemical, prediction of drug-protein and drug-drug pair interactions, de novo molecular design to produce molecular structures with desirable pharmacological properties, prediction of synthesis accessibility, prediction of products of synthetic reactions. Since traditional machine learning methods can only handle fixed-size inputs, most early drug discovery used feature engineering, i.e., the generation and use of problem-specific molecular descriptors. Typically, a set of problem-specific molecular descriptors are used as features in a task. Commonly used descriptors include: (1) Molecular fingerprints, which encode a molecular structure by a series of binary digits representing the presence of a particular substructure; (2) Descriptors derived from quantum chemistry, physicochemical and differential topology, processed by statisticians and chemists; (3) The SMILES string, uniquely characterizes the structure of the molecule and represents it as a line symbol. Given predefined predictive variables, a classification or predictive model is then constructed and learned by machine learning algorithms.
In recent years, more and more large chemical databases are available for drug development. Therefore, new attempts have emerged in applying deep neural networks to drug development. An advantage of deep learning is that it can learn complex relationships between input features and large-scale data output decisions. Its use in drug discovery and molecular informatics is still in the infancy but has shown great potential. Several common deep architectures have been used in drug-related work and have made substantial improvements over traditional machine learning approaches. However, the depth model still has limitations for the following reasons. First, most current depth models are still based on artificially made features or predefined descriptors, thereby preventing structural information from learning directly from the original input; secondly, existing architectures are not well suited to structured data like molecules, and internal structural information is neither considered nor fully used in the feature extraction process of these architectures.
Disclosure of Invention
To this end, embodiments of the present invention provide a molecular property prediction method and system based on graph neural network to at least partially solve the above technical problems.
In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:
a method of molecular property prediction based on a graph neural network, the method comprising:
acquiring a data file of a molecule to be predicted, and converting the data file into graph data, wherein the graph data comprises a plurality of nodes and a plurality of edges, the nodes represent atoms forming the molecule to be predicted, and the edges represent chemical bonds with the predicted molecule;
inputting the graph data into a pre-trained molecular property prediction model to obtain the molecular characteristics of the molecules to be predicted;
the molecular property prediction model is obtained by training according to a molecular pattern book, the molecular graph sample is an undirected graph formed by converting a data file sample, nodes in the undirected graph represent atoms forming the molecular sample, and edges in the undirected graph represent chemical bonds of the molecular sample.
In some embodiments, the method further comprises:
obtaining the structure data name of a molecule to be predicted;
and inputting the structure data name into a pre-trained molecular property prediction model to obtain the molecular structure of the molecule to be predicted.
In some embodiments, obtaining the molecular property of the molecule to be predicted further comprises:
determining a target atom in the molecule to be predicted and a target characteristic in the molecular characteristics in response to an input operation;
calculating a contribution value of the target atom to the target characteristic;
the contribution values are output graphically.
In some embodiments, the training process of the molecular property prediction model specifically includes:
acquiring mass data file samples, and respectively converting the data file samples into the molecular graph samples to form a data sample set;
and dividing the data sample set into a training set and a testing set.
Extracting feature data of all molecular graph samples in the training set, and training a graph neural network which is constructed in advance based on the feature data to obtain a molecular property prediction model;
wherein the characteristic data comprises node characteristics characterizing atoms making up the molecular sample, and edge characteristics characterizing chemical bonds of the molecular sample.
In some embodiments, the format of the data file sample is any one of:
a mol structure file, an sdf structure file, or a table file containing smiles.
In some embodiments, the network structure of the graph neural network comprises: convolutional layer, convergence layer, output layer, wherein the convolutional layer has at most 3 layers, and the activation function is relu, sigmoid or tanh.
The invention also provides a molecular property prediction device based on a graph neural network, which comprises:
the data acquisition unit is used for acquiring a data file of the molecule to be predicted and converting the data file into graph data, wherein the graph data comprises a plurality of nodes and a plurality of edges, the nodes represent atoms forming the molecule to be predicted, and the edges represent chemical bonds with the predicted molecule;
the result output unit is used for inputting the graph data into a pre-trained molecular property prediction model so as to obtain the molecular characteristics of the molecules to be predicted;
the molecular property prediction model is obtained by training according to a molecular pattern book, the molecular graph sample is an undirected graph formed by converting a data file sample, nodes in the undirected graph represent atoms forming the molecular sample, and edges in the undirected graph represent chemical bonds of the molecular sample.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method as described above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method as described above.
According to the molecular property prediction method and device based on the graph neural network, a data file of a molecule to be predicted is obtained and converted into graph data, wherein the graph data comprises a plurality of nodes and a plurality of edges, the nodes represent atoms forming the molecule to be predicted, and the edges represent chemical bonds with the predicted molecule; inputting the graph data into a pre-trained molecular property prediction model to obtain the molecular characteristics of the molecules to be predicted; the molecular property prediction model is obtained by training according to a molecular pattern book, the molecular graph sample is an undirected graph formed by converting a data file sample, nodes in the undirected graph represent atoms forming the molecular sample, and edges in the undirected graph represent chemical bonds of the molecular sample. The method and the device can be matched with the characteristics of molecular property prediction, and the molecular property can be rapidly and accurately output by utilizing a pre-trained molecular property prediction model.
In some embodiments, deep learning in the field of drug discovery enables large-scale prediction of chemistry and activity in a relatively short time, automating and accelerating the drug discovery process by the methods provided herein; the introduction of the graph-volume network provides a more accurate prediction by considering the intrinsic molecular structure compared to conventional methods; furthermore, when combined with other mechanisms, the graphical convolutional network produces a biologically interpretable result.
In some embodiments, by the method provided by the invention, the prediction of the properties of the small drug molecules, including regression and classification tasks, is realized by building a GCN workflow and adjusting corresponding parameters, including data set loading, graph data preprocessing, data set division, GCN model building, model training, model evaluation, model extraction prediction and the like, and an atomic contribution graph of corresponding molecules to the properties is given, so that model interpretability is improved.
In some embodiments, by integrating the ADMET property prediction GCN model, the method provided by the invention can realize the rapid prediction of various properties such as absorption, distribution, metabolism, toxicity, physicochemical properties and the like in the ADMET data set, and meanwhile, the data set with poor effect can be retrained, thereby improving the accuracy of the model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary and that other implementation drawings may be derived from the provided drawings by those of ordinary skill in the art without inventive effort.
The structures, the proportions, the sizes, and the like shown in the specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical essence, and any modifications of the structures, changes of the proportion relation, or adjustments of the sizes, should still fall within the scope of the technical contents disclosed in the present invention without affecting the efficacy and the achievable purpose of the present invention.
FIG. 1 is a flow chart of a molecular property prediction method based on graph neural networks according to the present invention;
FIG. 2 is a second flowchart of the molecular property prediction method based on graph neural network according to the present invention;
FIG. 3 is a third flowchart of the method for predicting molecular properties based on graph neural networks according to the present invention;
FIG. 4 is a block diagram of the molecular property prediction device based on graph neural network provided in the present invention;
fig. 5 is a block diagram of a computer device according to the present invention.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of a molecular property prediction method based on a graph neural network according to the present invention.
In one embodiment, the method for predicting molecular properties based on the graph neural network provided by the invention comprises the following steps:
s101: acquiring a data file of a molecule to be predicted, and converting the data file into graph data, wherein the graph data comprises a plurality of nodes and a plurality of edges, the nodes represent atoms forming the molecule to be predicted, and the edges represent chemical bonds with the predicted molecule;
s102: inputting the graph data into a pre-trained molecular property prediction model to obtain the molecular characteristics of the molecules to be predicted;
the molecular property prediction model is obtained by training according to a molecular pattern book, the molecular graph sample is an undirected graph formed by converting a data file sample, nodes in the undirected graph represent atoms forming the molecular sample, and edges in the undirected graph represent chemical bonds of the molecular sample.
Structured data is, in principle, a special structure of a deep neural network, e.g. an image has been successfully processed by a Convolutional Neural Network (CNN), which reveals up-to-date performance in image-dependent tasks, since it can automatically extract task-dependent features from a drawing image by the convolution operator. There are different types of structures, i.e. patterns, for drugs and small molecules consisting of atoms and chemical bonds, for which each atom is a node and each chemical bond is an edge. One simple attempt is to adapt the trellis diagram similarly to the convolution process. However, unlike images, graphics have irregular shapes and sizes; there is no spatial order on the node and its neighbors are also position dependent. Therefore, the conventional convolution on the conventional mesh-like structure cannot be directly applied to the graphic. In fact, various structural data in the real world are generally formed into figures rather than images, which means that it is very important and urgently required to develop a method of dealing with irregular structures.
Graph and volume networks (GCNs) are the most advanced methods for drug related tasks by: (1) extracting features by considering a data structure; (2) The ability to automatically extract features from raw input rather than from hand-made features may lead to important information arising from expert bias. The advantages are that: (1) Deep learning in the field of drug discovery enables large-scale prediction of chemical properties and activity in a relatively short time, automating and accelerating the drug discovery process; (2) The introduction of the graph-volume network provides a more accurate prediction by considering the intrinsic molecular structure compared to conventional methods; (3) Furthermore, when combined with other mechanisms, the graphical convolutional network produces a biologically interpretable result. Despite the recent success of patterned convolutional networks, challenges still remain to fully release the potential of patterned convolutional networks in drug discovery.
In some embodiments, the method further comprises:
obtaining the structure data name of a molecule to be predicted;
and inputting the structure data name into a pre-trained molecular property prediction model to obtain the molecular structure of the molecule to be predicted.
That is, in addition to the form of the molecule, the molecular structure can be obtained by the molecular property prediction model.
Further, in step S102, as shown in fig. 2, the method further includes the following steps:
s201: in response to an input operation, determining a target atom in the molecule to be predicted and a target characteristic in the molecular characteristics;
s202: calculating a contribution value of the target atom to the target characteristic;
s203: the contribution values are output graphically.
In an actual use scene, each atom in the molecules to be predicted can be determined as a target atom, the target characteristics to be determined can be acquired in a manual input mode, each atom calculates a contribution value aiming at the target characteristics, a graph is generated, and the graph is output on a screen to visually display the contribution value.
Specifically, for a one-dimensional text sequence and a two-dimensional picture which belong to Euclidean data, information can be extracted by using a convolution kernel, and for a medicine small molecule which belongs to non-Euclidean data, the translation invariance is not provided, and the same structural information cannot be extracted by using the convolution kernel, so that the convolution neural network cannot be applied to the data, and a network for processing the data, namely a Graph Neural Network (GNN), is derived. The molecular diagram is composed of nodes and edges, the nodes of the present invention are described by atom type, atom element, number of hydrogen atoms, valence, aromatic character and other characteristics, these descriptors of each node are one-hot encoded, and the connectivity between atom pairs is represented using adjacency matrix. In this example, the compound SMILES was processed using RDkit to obtain a molecular map and Morgan fingerprint and used for GNN. The technical scheme realizes the prediction of the small molecule property by using the graph neural network, wherein the graph convolution operation is the conceptual extension and transfer of the traditional convolution operation on the topological graph and is called as a graph convolution neural network (GCN), the realization mode is that the graph convolution operation is carried out by using a Laplace matrix and a Fourier transform of the graph, and besides the property prediction task, other tasks can be realized: 1. labels of predicted nodes or edges, which can be used for predicting reactions and inverse synthesis; 2. an implicit representation of a learning graph, which can be used for one-step generation and optimization of molecules; 3. the conversion rule of the learning graph can be used for iterative generation and optimization of molecules.
As shown in fig. 3, the training process of the molecular property prediction model specifically includes the following steps:
s301: acquiring mass data file samples, and respectively converting the data file samples into the molecular graph samples to form a data sample set;
s302: the data sample set is divided into a training set and a testing set.
S303: extracting feature data of all molecular graph samples in the training set, and training a graph neural network which is constructed in advance based on the feature data to obtain a molecular property prediction model;
wherein the characteristic data comprises node characteristics characterizing atoms making up the molecular sample, and edge characteristics characterizing chemical bonds of the molecular sample.
Wherein the format of the data file sample is any one of:
an mol structure file, an sdf structure file, or a table file containing smiles.
Wherein the network structure of the graph neural network comprises: convolutional layer, convergence layer, output layer, wherein, convolutional layer is at most 3 layers, activation function is relu, sigmoid or tanh.
Specifically, the functions of building, training, evaluating and predicting the neural network model of the graph are realized mainly through a component and workflow technology. The research objects are mainly drug small molecules, the basic principle is that chemical Graph theory describes the molecules as Undirected graphs (Undirected graphs), wherein nodes and edges correspond to atoms and chemical bonds respectively, and neural networks operating on the Graph can be regarded as generalized GNN which can be used for predicting various molecular characteristics according to different task targets.
The input data can be small molecule structure files such as mol and sdf, table files containing smiles, label data is needed during data set training, and the data can be realized by using a data reading component in a MaxFlow platform common component. The data set may be divided into a training set and a test set after input using a data set partitioning component that may select a partition ratio and a random seed number for training and evaluation, respectively. Whether the data is structural data or table data, in order to access a graph neural network, a graph feature coding mode needs to be carried out, the graph feature coding mode is converted into an array mode, X represents an input feature array, y represents a label array, w represents a weight array, and ids represents a small molecule id array (generally smiles), and the graph data preprocessing component can be used for realizing the graph feature coding mode. The graph feature code can be connected with a GCN model, and a Maxflow platform can provide two modes, wherein the first mode is a 'black box' GCN model, and the model can be used after data is loaded, so that the method has the advantages of simple operation and rapid convergence of a training loss function; the second is a GCN model which can be freely built and comprises a convolution layer, a convergence layer and an output layer, in order to prevent the problem of over-smoothness of the neural network of the graph, the convolution layer recommends at most 3 layers, and the activation functions can be selected from relu, sigmoid, tanh and the like. The two models are selected by classification and regression tasks, a model training component and a model evaluation component can be connected later, training epoch and batch _ size parameters can be filled in by the training component, and evaluation results including accuracy, precision, recall, MSE, R2 scores and the like can be checked after training is completed. At this point, the GCN model training evaluation component is constructed.
For a trained model, extraction and integration can be performed, for example, the invention downloads a partial ADMET data set from a TDC database, including Absorption: pgp-inhibitor, pgp-substrate; distribution: BBB _ peer; metabolism: CYP1A2 inhibitor, CYP1A2 substrate, CYP2C9 inhibitor, CYP2C9 substrate,
CYP3A4 inhibitor, CYP3A4 substrate; physical property: logP, logS; toxity: 29 data sets such as Eye _ Corroson, FDAMDD, NR-AhR, NR-AR-LBD, NR-AR, NR-Aromatase, NR-ER-LBD, NR-ER, NR-PPAR-gamma, skin _ sensitivity, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, SR-p53, biocontrol _ Factor, LC50DM, LC50FM and the like. After the components are built into a workflow, the 29 data sets are trained one by one, after the evaluation reaches a higher level, the 29 models are integrated and encapsulated into an ADMET (GNN) component, molecules with unknown properties can be directly predicted, multiple models can be selected at the same time, the predicted result can show the probability or regression value of multiple properties, and an atom contribution diagram of each predicted molecule corresponding to each property is displayed at the same time, wherein red represents atoms with inhibiting effect, and green represents atoms with contributing effect, so that the atom property prediction has certain explanatory effect on the molecular property prediction.
The workflow layout process on the platform of the method provided by the invention is briefly described below by taking a specific use scenario as an example.
Taking a MaxFlow platform as an example, a workflow meeting the requirements can be built in a dragging assembly mode, and the GNN assembly is formed by splitting a plurality of functional modules for deep learning. The following may present embodiments in two respects: 1. a training workflow of small molecule property datasets; 2. a predictive workflow of a small molecule property model.
In the training workflow of the small molecule property data set, the input of the GCN model can adopt a black box GCN model and a user-defined GCN model. The GCN model of the black box only has one component of GraphConvModel, and has the advantages of simple operation and rapid convergence of the training loss function; the self-defined GCN model comprises three components, namely GClayer, GCGather and GCOutput, and has the advantages that more tuning modes are provided, in order to prevent the problem of excessive smoothness of a graph neural network, the convolutional layer recommends at most 3 layers, and the activation function can select relu, sigmoid, tanh and the like; common to both GCN models is the need to input the task type (classification or regression). In the deep learning model training process, a data set is required to be input besides a model required to be built, two branches are connected with a model training component, the upper end is the model, and the lower end is data. The input of the small molecules is generally a mol and sdf structure file or a table file containing smiles, the data set of the invention takes the table file as an example, the 'read data file' component can load an original file of the data set, the 'target variable y' component can acquire a label name, and the 'acquire structure data name' component can convert the smiles into the mol structure file and display the mol structure file to a page. The assembly for dividing the training set and the test set can divide the data set in proportion, the training set can be converted into node and edge characteristic data after entering the assembly for preprocessing the graph data, and then enters the assembly for training the model together with the built model, and the epoch and the batch _ size can be selected for debugging; and the test set enters a model evaluation component, is converted into node and edge characteristic data, is finished in the evaluation component, outputs a model after the model training component finishes training, enters the model evaluation component together with the test set for evaluation, and finally gives evaluation scores including accuracy, precision, recall rate, MSE, R2 score and the like. When a better evaluation score is obtained, the model can be saved for later molecular property prediction.
In the prediction workflow of the small molecule property model, after 29 ADMET data sets in a TDC database are respectively subjected to graph convolution neural network training, 29 models with better evaluation scores are extracted and integrated to form an ADMET (GNN version) component for predicting the ADMET property of unknown molecules. The ADMET forecasting workflow component does not have the same amount of training workflow, only one 'read data file' component and one 'ADMET (GNN version)' component are needed to complete the forecasting, if a molecular structure is needed to be seen, one 'obtain structure data name' component is connected in the middle, and the 'ADMET (GNN version)' component can select a plurality of properties to forecast simultaneously. In addition, GCN can also calculate the contribution value of each atom to a molecule in specific properties, and a corresponding atom contribution graph can be drawn according to RDkit.
In addition to the above method, the present invention provides a molecular property prediction apparatus based on a graph neural network, as shown in fig. 4, the apparatus comprising:
a data obtaining unit 401, configured to obtain a data file of a molecule to be predicted, and convert the data file into graph data, where the graph data includes a plurality of nodes and a plurality of edges, the nodes represent atoms constituting the molecule to be predicted, and the edges represent chemical bonds with the predicted molecule;
a result output unit 402, configured to input the graph data into a pre-trained molecular property prediction model to obtain a molecular characteristic of the molecule to be predicted;
the molecular property prediction model is obtained by training according to a molecular pattern book, the molecular graph sample is an undirected graph formed by converting a data file sample, nodes in the undirected graph represent atoms forming the molecular sample, and edges in the undirected graph represent chemical bonds of the molecular sample.
In some embodiments, the method further comprises:
obtaining the structure data name of a molecule to be predicted;
and inputting the structural data name into a pre-trained molecular property prediction model to obtain the molecular structure of the molecule to be predicted.
In some embodiments, obtaining the molecular characteristics of the molecule to be predicted further comprises:
in response to an input operation, determining a target atom in the molecule to be predicted and a target characteristic in the molecular characteristics;
calculating a contribution value of the target atom to the target characteristic;
the contribution values are output graphically.
In some embodiments, the training process of the molecular property prediction model specifically includes:
acquiring mass data file samples, and respectively converting the data file samples into the molecular graph samples to form a data sample set;
the data sample set is divided into a training set and a testing set.
Extracting feature data of all molecular graph samples in the training set, and training a graph neural network which is constructed in advance based on the feature data to obtain a molecular property prediction model;
wherein the feature data comprises node features characterizing atoms comprising the molecular sample, and edge features characterizing chemical bonds of the molecular sample.
In some embodiments, the format of the data file sample is any one of:
a mol structure file, an sdf structure file, or a table file containing smiles.
In some embodiments, the network structure of the graph neural network comprises: convolutional layer, convergence layer, output layer, wherein, convolutional layer is at most 3 layers, activation function is relu, sigmoid or tanh.
The molecular property prediction method and device based on the graph neural network have the following technical effects:
1. the prediction of the absorption, distribution, metabolism, excretion and toxicity (ADMET) properties of the drug is important for the effectiveness and safety of the drug, and the ADMET property prediction in the early stage of drug discovery can greatly reduce the probability of drug development failure. The traditional ADMET property prediction is based on molecular fingerprint vectorization, and in recent years, the graph convolution method has great advantages in processing molecular graphs. The invention compares a deep learning method based on graph convolution with a traditional random forest method based on molecular fingerprints aiming at 29 ADMET data set systems. The result shows that the prediction accuracy of the graph convolution neural network is obviously improved. To further determine the generalization ability of the model, we obtained external test set data from the academic literature, including data on membrane permeability and logD of macrocyclic compounds, which data set was chosen because the authors performed extensive consistency measurements on high molecular weight compounds. For membrane permeability, the random forest R2 score was 0.15, the gcn R2 score was 0.381; for logD, the random forest R2 score was 0.394 and the GCN R2 score was 0.603. The result shows that the prediction model for deep learning by using graph convolution is greatly superior to the RF prediction model based on molecular fingerprints.
2. Besides the advantages of predicting micromolecule properties by the GCN model, the method supports free construction of the workflow and free selection of parameters, replaces codes with components, greatly simplifies complex processes of deep learning model construction, training, evaluation and prediction, and reduces the learning threshold of a user. The method and the device can be matched with the characteristics of molecular property prediction, and the molecular property can be rapidly and accurately output by utilizing a pre-trained molecular property prediction model.
3. By the method provided by the invention, deep learning in the field of drug discovery can predict chemical properties and activity in a relatively short time on a large scale, and the drug discovery process is automated and accelerated; the introduction of the graph-volume network provides a more accurate prediction by considering the intrinsic molecular structure compared to conventional methods; furthermore, when combined with other mechanisms, the graphical convolutional network produces a bioanalytically interpretable result.
4. According to the method provided by the invention, the prediction of the properties of the small molecules of the medicine, including regression and classification tasks, is realized by building a GCN workflow and adjusting corresponding parameters, including data set loading, graph data preprocessing, data set division, GCN model building, model training, model evaluation, model extraction prediction and the like, and the atomic contribution graph of corresponding molecules aiming at the properties is given out, so that the model interpretability is improved.
5. By the method provided by the invention, the GCN model is predicted by integrating ADMET properties, so that the rapid prediction of various properties such as absorption, distribution, metabolism, toxicity, physicochemical properties and the like of ADMET data sets can be realized, and meanwhile, the data sets with poor effects can be retrained, and the model accuracy is improved.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a model prediction. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The model prediction of the computer device is used to store static information and dynamic information data. The network interface of the computer device is used for communicating with an external terminal through a network connection. Which computer program is executed by a processor to carry out the steps in the above-described method embodiments.
It will be appreciated by those skilled in the art that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with the inventive arrangements and is not intended to limit the computing devices to which the inventive arrangements may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In correspondence with the above embodiments, embodiments of the present invention also provide a computer storage medium containing one or more program instructions therein. Wherein the one or more program instructions are for performing the method described above by a weight verification system.
The invention also provides a computer program product comprising a computer program, storable on a non-transitory computer readable storage medium, which, when executed by a processor, is capable of executing the above method by a computer.
In an embodiment of the invention, the processor may be an integrated circuit chip having signal processing capability. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The processor reads the information in the storage medium and completes the steps of the method in combination with the hardware.
The storage medium may be a memory, for example, which may be volatile memory or nonvolatile memory, or which may include both volatile and nonvolatile memory.
The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory.
The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), SLDRAM (SLDRAM), and Direct Rambus RAM (DRRAM).
The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.
Those skilled in the art will appreciate that the functionality described in the present invention may be implemented in a combination of hardware and software in one or more of the examples described above. When software is applied, the corresponding functionality may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The above embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above embodiments are only examples of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for molecular property prediction based on graph neural networks, the method comprising:
acquiring a data file of a molecule to be predicted, and converting the data file into graph data, wherein the graph data comprises a plurality of nodes and a plurality of edges, the nodes represent atoms forming the molecule to be predicted, and the edges represent chemical bonds of the molecule to be predicted;
inputting the graph data into a pre-trained molecular property prediction model to obtain the molecular characteristics of the molecules to be predicted;
the molecular property prediction model is obtained by training according to a molecular pattern book, the molecular graph sample is an undirected graph formed by converting a data file sample, nodes in the undirected graph represent atoms forming the molecular sample, and edges in the undirected graph represent chemical bonds of the molecular sample.
2. The molecular property prediction method of claim 1, further comprising:
obtaining the structure data name of a molecule to be predicted;
and inputting the structural data name into a pre-trained molecular property prediction model to obtain the molecular structure of the molecule to be predicted.
3. The method of claim 1, wherein obtaining the molecular property of the molecule to be predicted further comprises:
determining a target atom in the molecule to be predicted and a target characteristic in the molecular characteristics in response to an input operation;
calculating a contribution value of the target atom to the target characteristic;
outputting the contribution value graphically.
4. The molecular property prediction method of any one of claims 1-3, wherein the training process of the molecular property prediction model specifically comprises:
acquiring mass data file samples, and respectively converting the data file samples into the molecular graph samples to form a data sample set;
dividing the data sample set into a training set and a testing set;
extracting feature data of all molecular graph samples in the training set, and training a graph neural network which is constructed in advance based on the feature data to obtain a molecular property prediction model;
wherein the feature data comprises node features characterizing atoms comprising the molecular sample, and edge features characterizing chemical bonds of the molecular sample.
5. The molecular property prediction method of claim 4, wherein the format of the data file sample is any one of:
an mol structure file, an sdf structure file, or a table file containing smiles.
6. The molecular property prediction method of claim 4, wherein the network structure of the graph neural network comprises: convolutional layer, convergence layer, output layer, wherein the convolutional layer has at most 3 layers, and the activation function is relu, sigmoid or tanh.
7. A molecular property prediction apparatus based on a graph neural network, the apparatus comprising:
the data acquisition unit is used for acquiring a data file of the molecule to be predicted and converting the data file into graph data, wherein the graph data comprises a plurality of nodes and a plurality of edges, the nodes represent atoms forming the molecule to be predicted, and the edges represent chemical bonds of the molecule to be predicted;
the result output unit is used for inputting the graph data into a pre-trained molecular property prediction model so as to obtain the molecular characteristics of the molecules to be predicted;
the molecular property prediction model is obtained by training according to a molecular pattern book, the molecular graph sample is an undirected graph formed by converting a data file sample, nodes in the undirected graph represent atoms forming the molecular sample, and edges in the undirected graph represent chemical bonds of the molecular sample.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when executing the program.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method according to any one of claims 1 to 6 when executed by a processor.
CN202210944763.6A 2022-08-08 2022-08-08 Molecular property prediction method and system based on graph neural network Pending CN115274008A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210944763.6A CN115274008A (en) 2022-08-08 2022-08-08 Molecular property prediction method and system based on graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210944763.6A CN115274008A (en) 2022-08-08 2022-08-08 Molecular property prediction method and system based on graph neural network

Publications (1)

Publication Number Publication Date
CN115274008A true CN115274008A (en) 2022-11-01

Family

ID=83749907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210944763.6A Pending CN115274008A (en) 2022-08-08 2022-08-08 Molecular property prediction method and system based on graph neural network

Country Status (1)

Country Link
CN (1) CN115274008A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497576A (en) * 2022-11-17 2022-12-20 苏州创腾软件有限公司 Polymer property prediction method and system based on graph neural network
CN116312862A (en) * 2023-05-19 2023-06-23 苏州创腾软件有限公司 Molecular property prediction method and device based on converter framework
CN117935971A (en) * 2024-03-22 2024-04-26 中国石油大学(华东) Deep drilling fluid treatment agent performance prediction evaluation method based on graphic neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348573A (en) * 2019-07-16 2019-10-18 腾讯科技(深圳)有限公司 The method of training figure neural network, figure neural network unit, medium
WO2020016579A2 (en) * 2018-07-17 2020-01-23 Gtn Ltd Machine learning based methods of analysing drug-like molecules
CN113707236A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on graph neural network
CN113707235A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020016579A2 (en) * 2018-07-17 2020-01-23 Gtn Ltd Machine learning based methods of analysing drug-like molecules
CN110348573A (en) * 2019-07-16 2019-10-18 腾讯科技(深圳)有限公司 The method of training figure neural network, figure neural network unit, medium
CN113707236A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on graph neural network
CN113707235A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
宫宇翀: "基于双层图神经网络的miRNA-药物抗性关联预测研究", 中国优秀硕士学位论文全文数据库, pages 054 - 38 *
胡继敏: "基于图卷积神经网络的分子性质预测及不确定性分析", 中国优秀硕士学位论文全文数据库, pages 017 - 49 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497576A (en) * 2022-11-17 2022-12-20 苏州创腾软件有限公司 Polymer property prediction method and system based on graph neural network
CN116312862A (en) * 2023-05-19 2023-06-23 苏州创腾软件有限公司 Molecular property prediction method and device based on converter framework
CN117935971A (en) * 2024-03-22 2024-04-26 中国石油大学(华东) Deep drilling fluid treatment agent performance prediction evaluation method based on graphic neural network

Similar Documents

Publication Publication Date Title
CN115274008A (en) Molecular property prediction method and system based on graph neural network
Müller et al. Surrogate optimization of deep neural networks for groundwater predictions
CN112365171B (en) Knowledge graph-based risk prediction method, device, equipment and storage medium
CN113420163B (en) Heterogeneous information network knowledge graph completion method and device based on matrix fusion
CN112699941B (en) Plant disease severity image classification method, device, equipment and storage medium
CN113159273B (en) Neural network training method and related equipment
CN112560966B (en) Polarized SAR image classification method, medium and equipment based on scattering map convolution network
CN113554175B (en) Knowledge graph construction method and device, readable storage medium and terminal equipment
CN115170934A (en) Image segmentation method, system, equipment and storage medium
CN114048468A (en) Intrusion detection method, intrusion detection model training method, device and medium
CN112631898A (en) Software defect prediction method based on CNN-SVM
CN112819073A (en) Classification network training method, image classification device and electronic equipment
CN115148302A (en) Compound property prediction method based on graph neural network and multi-task learning
CN114064852A (en) Method and device for extracting relation of natural language, electronic equipment and storage medium
CN112633362B (en) Rotary machine self-adaptive fault diagnosis method, device, equipment and medium
Shubh et al. Handwriting recognition using deep learning
CN112859034B (en) Natural environment radar echo amplitude model classification method and device
CN114496068A (en) Protein secondary structure prediction method, device, equipment and storage medium
Alamgeer et al. Data Mining with Comprehensive Oppositional Based Learning for Rainfall Prediction.
CN111368889A (en) Image processing method and device
CN115640336B (en) Business big data mining method, system and cloud platform
CN116629348B (en) Intelligent workshop data acquisition and analysis method and device and computer equipment
Guo et al. Interpretable Task-inspired Adaptive Filter Pruning for Neural Networks Under Multiple Constraints
CN110457700B (en) Short text description method and device
Ribes Machine Learning for Predicting Targeted Protein Degradation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination