CN114386694B - Drug molecular property prediction method, device and equipment based on contrast learning - Google Patents

Drug molecular property prediction method, device and equipment based on contrast learning Download PDF

Info

Publication number
CN114386694B
CN114386694B CN202210026795.8A CN202210026795A CN114386694B CN 114386694 B CN114386694 B CN 114386694B CN 202210026795 A CN202210026795 A CN 202210026795A CN 114386694 B CN114386694 B CN 114386694B
Authority
CN
China
Prior art keywords
neural network
network model
feature vector
target
molecular
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210026795.8A
Other languages
Chinese (zh)
Other versions
CN114386694A (en
Inventor
王俊
叶贤斌
高鹏
谢国彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210026795.8A priority Critical patent/CN114386694B/en
Publication of CN114386694A publication Critical patent/CN114386694A/en
Priority to PCT/CN2022/089691 priority patent/WO2023134063A1/en
Application granted granted Critical
Publication of CN114386694B publication Critical patent/CN114386694B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medicinal Chemistry (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Pharmacology & Pharmacy (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)

Abstract

The application discloses a drug molecular property prediction method, device and equipment based on contrast learning, relates to the technical field of artificial intelligence, and can solve the technical problems of low efficiency and poor prediction performance of the existing drug molecular property prediction. Comprising the following steps: generating a target molecular graph structure of the target drug molecule according to the chemical molecular structure, and generating a target three-dimensional conformation of the target drug molecule; determining a first feature vector corresponding to the target molecular diagram structure by using the trained diagram neural network model; determining a second feature vector corresponding to the target three-dimensional conformation by using the trained convolutional neural network model, wherein the graph neural network model and the convolutional neural network model are obtained through contrast learning of a positive sample pair and a negative sample pair and combined training; and constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into the property prediction model after training to obtain a property prediction result of the target drug molecule.

Description

Drug molecular property prediction method, device and equipment based on contrast learning
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a drug molecular property prediction method, device and equipment based on contrast learning.
Background
The drug research and development period is long, the investment is large and the risk is very high. In order to fully mine rules behind drug molecules, speed of drug development is accelerated, researchers in the field of drug development attempt to introduce a machine learning method into drug chemistry research from the beginning of this century, and an accurate and efficient molecular property prediction model can greatly reduce dependence on experiments, reduce cost and accelerate progress.
At present, the property prediction of the drug molecules can be performed based on a molecular fingerprint and molecular descriptor method, however, the method requires a great deal of expertise to perform optimal design, and lacks versatility and expansibility. The selection of the molecular descriptors is a tedious and time-consuming process, and the selected descriptors apply a strong preset priori to the model, so that the model is biased, and the prediction performance of the model is affected.
Disclosure of Invention
In view of the above, the application provides a method, a device and equipment for predicting the properties of drug molecules based on contrast learning, which can be used for solving the technical problems of low efficiency and poor prediction performance of the existing prediction of the properties of drug molecules.
According to one aspect of the present application, there is provided a method of predicting properties of a drug molecule based on contrast learning, the method comprising:
generating a target molecular graph structure of a target drug molecule according to the chemical molecular structure, and generating a target three-dimensional conformation of the target drug molecule;
determining a first feature vector corresponding to the target molecular graph structure by utilizing a pre-trained graph neural network model;
determining a second feature vector corresponding to the target three-dimensional conformation by utilizing a convolutional neural network model which is trained in advance, wherein the graph neural network model and the convolutional neural network model are obtained through comparison learning of a positive sample pair and a negative sample pair and combined training;
and constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.
According to another aspect of the present application, there is provided a drug molecule property prediction device based on contrast learning, the device comprising:
the first generation module is used for generating a target molecular graph structure of a target drug molecule according to the chemical molecular structure and generating a target three-dimensional conformation of the target drug molecule;
The first determining module is used for determining a first feature vector corresponding to the target molecular graph structure by utilizing a pre-trained graph neural network model;
the second determining module is used for determining a second feature vector corresponding to the target three-dimensional conformation by utilizing a convolutional neural network model which is trained in advance, wherein the graph neural network model and the convolutional neural network model are obtained through comparison learning of a positive sample pair and a negative sample pair and combined training;
and the input module is used for constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.
According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which when executed by a processor implements the above-described contrast learning-based drug molecule property prediction method.
According to yet another aspect of the present application, there is provided a computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, the processor implementing the above-mentioned contrast learning based drug molecule property prediction method when executing the program.
By means of the technical scheme, compared with the traditional mode of predicting the drug molecular property based on the molecular fingerprint and the molecular descriptor, the drug molecular property prediction method, device and equipment based on the contrast learning can firstly construct positive and negative sample pairs, realize joint training of a graph neural network model and a convolutional neural network model through double-angle contrast learning by utilizing the positive and negative sample pairs, and further put the graph neural network model and the convolutional neural network model which are trained in advance into the drug molecular property prediction. When predicting the property of the drug molecule, specifically, firstly, generating a target molecular graph structure of the target drug molecule according to a chemical molecular structure, generating a target three-dimensional conformation of the target drug molecule, further determining a first feature vector corresponding to the target molecular graph structure by using a pre-trained graph neural network model, and determining a second feature vector corresponding to the target three-dimensional conformation by using a pre-trained convolutional neural network model; and finally, constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule. According to the technical scheme, a pre-training strategy of combined training of 2D molecular diagram structural data and 3D conformation double angles is provided, and key 2D and 3D structural information can be learned while calculating efficiently. The method has the advantages that the positive and negative sample pairs are constructed for pre-training, the information of the compound plane structure and the three-dimensional structure can be learned from large-scale label-free data, the model obtained under the normal condition has better generalization, when specific downstream tasks need to be solved, the pre-training model can be directly used for fine adjustment, so that the problem of insufficient generalization performance caused by deep learning model training on scenes of medicine molecules lacking labels is avoided, the efficiency of medicine molecule property prediction is improved, and the property prediction accuracy of medicine molecules is ensured.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the present application. In the drawings:
FIG. 1 is a schematic flow chart of a method for predicting properties of drug molecules based on contrast learning according to an embodiment of the present application;
FIG. 2 is a flow chart of another method for predicting properties of drug molecules based on contrast learning according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a device for predicting properties of drug molecules based on contrast learning according to an embodiment of the present application;
fig. 4 shows a schematic structural diagram of another drug molecule property prediction device based on contrast learning according to an embodiment of the present application.
Detailed Description
The embodiment of the application can realize the prediction of the molecular property of the medicine based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
In recent years, as an emerging technology in deep learning, graph neural networks exhibit excellent performance on graph data. Graph neural networks based on supervised learning have achieved tremendous success over the past few years, and rely on a large number of manually-given labeled graph data for optimization in order to learn a strong expressive power. Large scale tagged map data, particularly tag data based on the field of pharmaceutical chemistry, is often difficult to obtain and labeling of such data often requires expert knowledge in the field of biochemistry. In most cases, we have difficulty acquiring large amounts of tag data, so a graph neural network based on supervised learning has difficulty deploying its strong learning ability. How to use large-scale unlabeled molecular data for pre-training, so that the graph network learns potential characteristics and information, which are hot spots and difficulties of research.
Similar to the pretraining task of the BERT language model (Bidirectional Encoder Representation Transformers, BERT), many researchers have proposed training strategies for pretraining based on data of a molecular graph, which is a self-supervised pretraining at the node level of the graph first, followed by a multitasking pretraining at the global level of the graph. And after the pretraining of the graphic neural network (Graph Neural Networks, GNN) model is completed by utilizing a large amount of unlabeled graphic data, fine tuning is carried out on the pretrained GNN model on a downlink task. Specifically, a linear classifier is added over the graph-level representation to predict the downstream graph labels. Subsequently, end-to-end fine tuning is performed on the entire model, i.e., the pre-trained GNN and downstream linear classifier.
The pre-training strategy described above is directed to a molecular graph, i.e., a 2D planar structure of a molecule is seen as a molecular graph structure (graph) data with atoms as nodes and chemical bonds as edges of the graph. And respectively constructing a large number of unlabeled molecular data into graph data, and feeding the graph data into the GNN model for pre-training. However, molecular graph data based on 2D planar structures ignores 3D structural information of chemical molecules, i.e., stereochemical information, and such pretraining strategies lack stereochemical information of chemicals, so GNN models do not capture general information of chemicals well.
Thus, in view of the drawbacks of the pretraining strategies described above, pretraining strategies based on two-angle driven contrast learning of molecular 3D conformation and molecular 2D graph data are presented in the present application. Specifically, for a piece of unlabeled molecular data, three-dimensional conformation (3D pixel) data and graph data of optimal conformations are respectively constructed, and feature extraction is carried out on the three-dimensional conformation (3D pixel) data and the graph data through a simple convolutional neural network (Convolutional Neural Network, CNN) model and a graph neural network (Graph Neural Networks, GNN) model, and model training is carried out based on ideas of contrast learning. Finally, CNN and GNN models after the pre-training is completed are obtained, and the characteristics of the 3D conformation and the 2D graph data are fused according to specific downstream tasks, and then the property prediction model is utilized to predict the properties of the drug molecules based on the fused characteristics.
The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments and features of the embodiments in the present application may be combined with each other.
Aiming at the technical problems of low efficiency and poor prediction performance of the existing prediction of the molecular property of the drug, the application provides a drug molecular property prediction method based on contrast learning, as shown in figure 1, which comprises the following steps:
101. Generating a target molecular graph structure of the target drug molecule according to the chemical molecular structure, and generating a target three-dimensional conformation of the target drug molecule.
Drugs can be generally classified into chemical small molecule drugs and biological large molecule drugs, wherein the small molecule drugs are chemically synthesized active substance small molecules, and how the small molecule drugs affect a receptor depends on the affinity and efficacy of the receptor, and the properties are determined by chemical structures. In the application, the target drug molecules are the chemically synthesized active substance small molecules, the relative molecular mass is between 200 and 700, and the aim of predicting the unknown drug property is realized through the intelligent analysis of the chemical structure of the drug small molecules.
In a specific application scenario, before the steps of this embodiment are performed, the chemical molecular structure of the target drug molecule may be extracted in advance, so as to generate a target molecular structure of the target drug molecule according to the chemical molecular structure, and generate a target three-dimensional conformation of the target drug molecule. Accordingly, for the present embodiment, each atom in a drug molecule can be represented as a node in the molecular diagram structure, with forces between atoms represented by edges between nodes. The nodes can carry different information to express different atomic symbols, and the edges (edges) can also carry different information to express different acting force modes, so that the chemical molecular structure of the chemical molecules is expressed by a molecular diagram structure in a computer. Accordingly, a three-dimensional conformation of the target may be generated using a conventional distance geometry (Distance Geometry) method, as follows: the connection boundary matrix of the molecules can be generated through the connection table information in the chemical molecular structure corresponding to the target drug molecules; performing smoothing treatment on the boundary matrix by using a triangle boundary smoothing algorithm; randomly generating a distance matrix according to the boundary matrix; mapping the generated distance matrix into a three-dimensional space and calculating coordinates for each atom; and roughly optimizing the calculated coordinate result by using a force field and a boundary matrix to further obtain a target three-dimensional conformation of the target drug molecule.
The execution main body of the method can be a device for predicting the property of the drug molecules, can be configured at a client side or a server side, can be subjected to comparison learning of a positive sample pair and a negative sample pair in advance, and is used for jointly training a graph neural network model and a convolution neural network model, so that after a target molecular graph structure and a target three-dimensional conformation of the target drug molecules are determined, a first feature vector corresponding to the target molecular graph structure is determined by utilizing the graph neural network model which is trained in advance; determining a second eigenvector corresponding to the target three-dimensional conformation by utilizing the convolutional neural network model which is trained in advance; and finally, constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.
102. And determining a first eigenvector corresponding to the target molecular diagram structure by using the pre-trained diagram neural network model.
The embodiment can be applied to a graph neural network to extract the first feature vector of the target drug molecule. The input to the graph neural network is typically a graph structure with node or edge attributes as described above, i.e., including the adjacency matrix a of the graph and corresponding attribute information X. The final output of which generally depends on specific tasks such as node classification outputting labels of nodes, graph classification outputting labels of graphs, link prediction outputting the presence or absence of links. Taking a chemical molecular diagram as an example, GNN updates its own information by aggregating the features of neighboring nodes and its own features of the previous layer in each iteration according to the adjacency matrix of the molecular diagram and the attribute of each node (atom) and the information of the edge (chemical bond) connected between them, and typically, nonlinear transformation is also performed on the aggregated information. By stacking the multi-layer network, each node can obtain neighbor node information within the corresponding hop count. For chemical molecule graphs, the hidden vectors of the nodes alone cannot well represent chemical molecules, so that the whole information of the molecules can be represented from the topological structure of the graph, and finally, the information vector representation of the whole graph can be obtained in a common way of average pooling and the like, namely, the hidden variables rich in structural information are used for representing the whole information representation of the graph.
Before the application of the graph neural network, the graph neural network needs to be pre-trained in combination with a task scene. In the past, the training of the graphic neural network in the pharmaceutical chemistry task depends on a specific task and a large amount of corresponding labeling data, and the discovery of the pharmaceutical molecules based on supervised learning has been greatly successful in the past few years, so that many researches indicate that the graphic neural network can well process the data of the pharmaceutical chemistry molecules and extract the corresponding characterization. However, large scale tag data, especially tag data based on the field of pharmaceutical chemistry, are often difficult to obtain and labeling of these data requires expert knowledge of the corresponding field of biochemistry. Similarly, the same problems are faced in the fields of natural language processing and computer vision.
Fortunately, a vast amount of raw unlabeled chemical molecular data can be obtained relatively easily, which is unlabeled and therefore can fall into the category of unsupervised learning. How to train with these unlabeled chemical molecular data and get a pretrained model with very strong generalization capability is a difficulty of current research. In this regard, in the present application, a pre-model may be obtained by a self-supervised learning method, and in particular, a model-to-model learning process, a supervised signal may be constructed by using input data, and supervised learning may be performed on the model, thereby effectively learning potential features and information in the data. From the viewpoint of methodology, the currently mainstream self-supervised pre-training learning methods can be divided into two main categories, namely generation-based and contrast-based learning. In the application, the idea of contrast learning can be adopted, and the main idea is to construct positive and negative samples from input data, so that the model can distinguish the positive and negative samples in an implicit representation space, and the self-supervision learning of the graph neural network model is realized by constructing a pre-training task, namely a supervision signal, from unmarked input data by utilizing the positive and negative samples.
Correspondingly, for the embodiment, after the graph neural network model is obtained by training, the molecular graph structure of the target drug molecule can be input into the graph neural network model to obtain the first feature vector under the corresponding molecular scale.
103. And determining a second eigenvector corresponding to the target three-dimensional conformation by utilizing the convolutional neural network model which is trained in advance, wherein the graph neural network model and the convolutional neural network model are obtained through the contrast learning of a positive sample pair and a negative sample pair and the combined training.
For the embodiment, the method can be applied to a convolutional neural network model to extract the second eigenvector of the target drug molecule. The convolutional neural network is a feedforward neural network with a convolutional calculation and a depth structure, and an implicit layer of the feedforward neural network can comprise common structures of a convolutional layer, a pooling layer and a full-connection layer 3, and complex structures such as an acceptance module, a residual block (residual block) and the like can exist in some more modern algorithms.
It should be noted that, before executing steps 102 and 103 of the embodiments, a positive sample pair and a negative sample pair may be constructed in advance by using unlabeled drug molecules, where the positive sample pair is configured to have a molecular diagram structure and a three-dimensional conformation corresponding to the same drug molecule, and the negative sample pair is configured to have a molecular diagram structure and a three-dimensional conformation corresponding to different drug molecules; the graph neural network model and the convolution neural network model can be trained in a combined mode through the contrast learning of the positive sample pair and the negative sample pair, so that the embedded vector distance of the graph neural network model and the convolution neural network model under the positive sample pair is smaller, and the embedded vector distance under the negative sample pair is larger. The purpose of contrast learning is to shorten the distance between similar samples and increase the distance between dissimilar samples, wherein the distances between the positive and negative samples and the embedded vector are measured by the inner product of the vector. Through joint training, the positions of the corresponding output hidden vectors of the two models in the vector space can be adjusted, the distance between homologous vectors is reduced, and the distance between non-homologous vectors is increased.
104. And constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.
The property prediction model may correspond to any one of the existing neural network models, for example, may be a linear regression model, a decision tree model, a neural network model, a support vector machine model, a hidden markov model, etc., and is not specifically limited in the present application; the property prediction result may specifically include one or more of prediction of target binding property, activity prediction, toxicity prediction, efficacy prediction, water solubility prediction, adverse reaction prediction, treatment effect prediction for a certain disease, and the like, and the property prediction type may specifically be set according to the actual application prediction scenario, which is not specifically limited in this scheme. It should be noted that, before executing the steps of this embodiment, the property prediction model needs to be trained in advance by using the label sample, so as to implement property prediction of the target drug molecule by using the property prediction model that is trained in advance.
For this embodiment, after determining that the first feature vector of the target drug molecule corresponding to the target molecule graph structure and the second feature vector of the target three-dimensional conformation are obtained based on the steps 102 and 103 of the embodiment, the third feature vector obtained by fusion may be input into the pre-trained property prediction model to determine and obtain the property prediction result of the target drug molecule.
According to the drug molecular property prediction method based on contrast learning in the embodiment, positive and negative sample pairs can be constructed first, and combined training of the graph neural network model and the convolutional neural network model is achieved through double-angle contrast learning by utilizing the positive and negative sample pairs, so that the graph neural network model and the convolutional neural network model which are trained in advance can be put into drug molecular property prediction. When predicting the property of the drug molecule, specifically, firstly, generating a target molecular graph structure of the target drug molecule according to a chemical molecular structure, generating a target three-dimensional conformation of the target drug molecule, further determining a first feature vector corresponding to the target molecular graph structure by using a pre-trained graph neural network model, and determining a second feature vector corresponding to the target three-dimensional conformation by using a pre-trained convolutional neural network model; and finally, constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule. According to the technical scheme, a pre-training strategy of combined training of 2D molecular diagram structural data and 3D conformation double angles is provided, and key 2D and 3D structural information can be learned while calculating efficiently. The method has the advantages that the positive and negative sample pairs are constructed for pre-training, the information of the compound plane structure and the three-dimensional structure can be learned from large-scale label-free data, the model obtained under the normal condition has better generalization, when specific downstream tasks need to be solved, the pre-training model can be directly used for fine adjustment, so that the problem of insufficient generalization performance caused by deep learning model training on scenes of medicine molecules lacking labels is avoided, the efficiency of medicine molecule property prediction is improved, and the property prediction accuracy of medicine molecules is ensured.
Further, as a refinement and extension of the specific implementation of the foregoing embodiment, for a complete description of the specific implementation process in this embodiment, another method for predicting properties of a drug molecule based on contrast learning is provided, as shown in fig. 2, where the method includes:
201. based on the positive sample pair and the negative sample pair, the graph neural network model and the convolutional neural network model are jointly trained through comparison and learning.
In a specific application scenario, for this embodiment, the embodiment step 202 may specifically include: obtaining first drug molecules and second drug molecules without labels, wherein the chemical molecular structures corresponding to the first drug molecules and the second drug molecules are different; generating a first molecular map structure and a first three-dimensional conformation of a first drug molecule, and generating a second molecular map structure and a second three-dimensional conformation of a second drug molecule; constructing a positive sample pair using the first molecular map structure and the first three-dimensional conformation, and/or using the second molecular map structure and the second three-dimensional conformation, and/or constructing a negative sample pair using the first molecular map structure and the second three-dimensional conformation, and/or using the second molecular map structure and the first three-dimensional conformation; training the graph neural network model and the convolutional neural network model by utilizing the combination of the positive sample pair and the negative sample pair, and adjusting model parameters of the graph neural network model and/or the convolutional neural network model so that the embedded vector distance of the graph neural network model and the convolutional neural network model under the positive sample pair is smaller than a first preset threshold value, and the embedded vector distance under the negative sample pair is larger than a second preset threshold value, wherein the second preset threshold value is larger than the first preset threshold value.
For this embodiment, in the training process, besides determining whether the positive and negative samples are homologous (derived from the same drug molecule), the objective function during training is based on the definition of the maximized mutual information, and InfoNCE is additionally used as a loss function to estimate the distance between the embedded vectors of the positive and negative sample pairs, respectively, and the calculation of the distance is based on the inner product formula of the vectors, so that the generalization performance of the model is optimized by minimizing the distance between the positive sample pairs and maximizing the distance between the negative sample pairs, and the model can fully learn the mutual information of the molecular double angles.
Wherein, the formula of the InfoNCE loss function is as follows:
where f refers to a neural network of graphs with trainable parameters, x refers to raw data (raw molecular graph data, also commonly referred to as anchor data points), x + Refers to data (3D volume) similar to or equal to x, x j Refers to the j-th negative sample of the structure, and N refers to the number of negative samples. The purpose of contrast learning is to shorten the distance of similar samples and to increase the distance of dissimilar samples, where the distances of both positive and negative samples from the anchor data point are measured as the inner product of the vectors. Minimizing the InfoNCE loss function is equivalent to maximizing the lower bound of mutual information between the positive sample and anchor point data, so that the graph neural network can learn information similar to the local 3D conformation and the 2D plane graph data.
In addition, the present solution is additionally related to designing corresponding self-supervised training tasks for constructing positive and negative sample pairs, wherein in the preprocessing step, the image data of the first drug molecule and/or the second drug molecule and the corresponding 3D conformational data form positive sample pairs, and correspondingly, the image data of the first drug molecule and the 3D conformational data of the second drug molecule form negative sample pairs, or the image data of the first drug molecule and the image data of the second drug molecule form negative sample pairs. As an alternative, when generating negative sample pairs, a corresponding number of different graph data may also be randomly selected from the graph database with a probability of 50%, forming negative sample pairs with the 3D conformational data of the first drug molecule and/or the second drug molecule. Another 50% probability randomly selects a corresponding number of different 3D conformations from the 3D conformations database, forming negative-sample pairs with the graph data of the first drug molecule and/or the second drug molecule. The method of constructing the negative examples is dynamically adjusted randomly during the training process. Specifically, one original molecular map data and the corresponding 3D conformational data form a positive sample pair, and a corresponding number of graph data are randomly selected or a corresponding number of 3D conformational data are randomly selected to form a negative sample pair.
It should be noted that, when the graph neural network model and the convolutional neural network model are jointly trained, a molecular graph structure and a three-dimensional conformation are included in a positive sample pair of correct pairing and a negative sample pair of incorrect pairing. When the contrast learning training is carried out, the molecular diagram structure in the same sample pair (positive sample pair/negative sample pair) can be input into the graph neural network model, the three-dimensional conformation in the same sample pair is input into the convolution neural network model at the same time, the graph neural network model and the convolution neural network model can output an embedded vector for the sample pair (positive sample pair/negative sample pair), the matching condition of the sample pair is known (correct/incorrect), and further the joint training condition of the graph neural network model and the convolution neural network model can be judged according to the calculated embedded vector distance by contrast learning of the two embedded vectors, so that the model parameters of the graph neural network model and the convolution neural network model are adjusted, the finally trained graph neural network model and the finally trained convolution neural network model are minimized in the embedded vector distance under the positive sample pair and the embedded vector distance under the negative sample pair is maximized.
202. Generating a target molecular graph structure of the target drug molecule according to the chemical molecular structure, and generating a target three-dimensional conformation of the target drug molecule.
The target molecule graph structure carries an adjacent matrix and attribute information, the attribute information comprises a node initial feature vector and an edge initial feature vector, the node initial feature vector and the edge initial feature vector are determined according to a preset vector generation rule, the adjacent matrix is an n matrix which is formed by representing node connection relations, elements with connection relations in the adjacent matrix are represented as 1, elements without connection relations are 0, and n is the number of nodes contained in the target small molecule; the attribute information may include node initial feature vectors and edge initial feature vectors of atoms. The node initial feature vector is generated according to a first preset vector generation rule, wherein the first preset vector generation rule can be shown in table 1, and the node initial feature vector can be a 27-bit feature vector formed by mixing 6-bit chemical bond number+5-bit form charge+4-bit atomic chirality+5-bit bound hydrogen atom number+5-bit atomic orbitals into +1-bit aromaticity+1-bit atomic mass. The edge initial feature vector is generated according to a second preset vector generation rule, wherein the second preset vector generation rule can be shown in table 2, and the edge initial feature vector can be a 12-bit feature vector formed by 4-bit chemical bond type +1-bit conjugation +1-bit in-loop +6-bit three-dimensional.
TABLE 1
TABLE 2
203. And inputting the adjacency matrix and attribute information carried in the target molecular graph structure and the target molecular graph structure into a pre-trained graph neural network model to obtain node hidden vectors of all nodes in the target molecular graph structure.
For the embodiment, the target molecular graph structure and the adjacency matrix and attribute information carried in the target molecular graph structure can be input into the graph neural network model, and the node hidden vectors of all nodes in the target molecular graph structure can be obtained by utilizing iterative learning of the graph neural network model.
In particular, the main process of learning the graph neural network model is to aggregate and update neighbor information of nodes in graph data through iteration. In one iteration, each node updates its own information by aggregating the features of neighboring nodes and its own features of the previous layer, and typically, nonlinear transformation is performed on the aggregated information. By stacking the multi-layer network, each node can obtain neighbor node information within the corresponding hop count.
The learning of the neural network model involves two processes, namely, a message passing (message passing) phase and a reading (readout) phase, when the learning is understood in a node message passing manner. The information transfer stage is a forward propagation stage which circularly runs T steps and passes through the information function M t Acquiring information by updating the function U t Updating the node.
Information function M t Updating the function U t Is characterized by the formula:
wherein e vw Representing the feature vector of the edge from node v to w.
The read out stage computes a feature vector for the representation of the entire graph (presentation), implemented using a function R whose formula features are described as:
wherein the whole time step number is represented, wherein the function M t ,U t And R may use different model settings such as Graph Attention, GAT, etc. network of volumes (Graph Convolutional Network, GCN).
The central idea of the graph neural network model for molecular representation learning can be understood as: if the initial feature vector is used for expressing different nodes and different edges respectively, the final stable feature vector expression mode of the nodes can be found through an iteration mode of message propagation. After a fixed step, such as a T step, the feature vectors corresponding to each node may tend to balance to some extent and not change. Thus, with each node's final stable feature vector, each node's final feature vector also contains information about its neighbors and the entire graph (e.g., some atomic nodes in a chemical molecule, assuming that it contributes most to some property of the molecule, there will be a corresponding more specific expression in the final feature vector) as compared to the original node feature vector.
204. And generating a first characteristic vector of the target drug molecule by using the node hidden vector of each node.
For this embodiment, after determining the node hidden vectors of each node in the target molecular structure based on embodiment step 203, the information vector representation of the entire target molecular structure may be further obtained according to the node hidden vectors of each node (for example, the information representation of the molecular level of the entire molecular compound is extracted by the characteristics of the atomic nodes and the information of the chemical bonds between atoms). As a preferred manner, the steps of the embodiment may specifically include: calculating the hidden vector average value of the node hidden vector, and determining the hidden vector average value as a first characteristic vector of the target drug molecule; or, determining the node hidden vector with the maximum corresponding hidden vector value as the first feature vector.
205. And determining a second eigenvector corresponding to the target three-dimensional conformation by using the convolutional neural network model which is trained in advance.
The convolutional neural network model comprises a data input layer, a convolutional calculation layer, a pooling layer and a full connection layer. For this embodiment, the target three-dimensional conformation may be input to the convolution calculation layer through the data input layer to perform convolution operation to obtain a feature map, the feature map is further subjected to pooling operation by the pooling layer, and finally, after iterative convolution pooling processing of the multi-layer convolution calculation layer and the pooling layer, a plurality of (e.g. 5) feature maps are obtained, that is, a plurality of (e.g. 5) matrices are obtained, and then the matrices are expanded according to rows and connected to form vectors, and the vectors are transmitted into the full connection layer, which is a BP neural network, each feature map in the map may be regarded as neurons arranged in a matrix form, and after multi-layer convolution, a hidden variable representation (second feature vector) of the 3D conformation may be obtained, where the hidden variable may well represent the feature of the conformation. Accordingly, the embodiment step 205 may specifically include: inputting the target three-dimensional conformation into a convolutional neural network model which is trained in advance through a data input layer, and performing iterative convolution pooling treatment of a convolution calculation layer and a pooling layer to obtain a feature map; and expanding the feature map according to the row and transmitting the feature map into the full-connection layer to obtain a second feature vector corresponding to the target three-dimensional conformation.
206. And constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.
In a specific application scenario, before executing the steps of this embodiment, the steps of this embodiment further include: taking a sample feature vector matched with a preset property prediction task corresponding to a target drug molecule as a training sample, and training a preset property prediction model; and calculating a loss function of the property prediction model, and judging that the training of the property prediction model is completed when the loss function is smaller than a third preset threshold value. The loss function is used for representing a prediction error of a prediction result of the property prediction model relative to a sample marking result, a preset threshold value is between 0 and 1, the preset threshold value is used for representing training accuracy of the property prediction model, the closer the preset threshold value is to 1, the higher the training accuracy of the property prediction model is, and specific values of the preset threshold value can be set according to actual application scenes, and are not particularly limited. The property prediction model may correspond to any one of the existing neural network models, for example, may be a linear regression model, a decision tree model, a neural network model, a support vector machine model, a hidden markov model, and the like, and may be adaptively selected according to actual application requirements, which is not specifically limited in the present application.
Accordingly, for the present embodiment, as a preferred manner, the embodiment step 206 specifically may include: vector fusion processing is carried out on the first feature vector and the second feature vector according to a preset vector fusion rule, and a third feature vector is obtained; and taking the third feature vector as an input feature, inputting the input feature vector into a pre-trained property prediction model, and obtaining a property prediction result of the target drug molecule. The preset vector splicing rule may include: splicing the first feature vector to the second feature vector to obtain a third feature vector; or, splicing the second feature vector to the first feature vector to obtain a third feature vector; or, the first feature vector and the second feature vector are added to obtain a third feature vector, and so on.
By means of the drug molecular property prediction method based on contrast learning, positive and negative sample pairs can be constructed first, and combined training of the graph neural network model and the convolutional neural network model is achieved through double-angle contrast learning by means of the positive and negative sample pairs, so that the graph neural network model and the convolutional neural network model which are trained in advance can be put into drug molecular property prediction. When predicting the property of the drug molecule, specifically, firstly, generating a target molecular graph structure of the target drug molecule according to a chemical molecular structure, generating a target three-dimensional conformation of the target drug molecule, further determining a first feature vector corresponding to the target molecular graph structure by using a pre-trained graph neural network model, and determining a second feature vector corresponding to the target three-dimensional conformation by using a pre-trained convolutional neural network model; and finally, constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule. According to the technical scheme, a pre-training strategy of combined training of 2D molecular diagram structural data and 3D conformation double angles is provided, and key 2D and 3D structural information can be learned while calculating efficiently. The method has the advantages that the positive and negative sample pairs are constructed for pre-training, the information of the compound plane structure and the three-dimensional structure can be learned from large-scale label-free data, the model obtained under the normal condition has better generalization, when specific downstream tasks need to be solved, the pre-training model can be directly used for fine adjustment, so that the problem of insufficient generalization performance caused by deep learning model training on scenes of medicine molecules lacking labels is avoided, the efficiency of medicine molecule property prediction is improved, and the property prediction accuracy of medicine molecules is ensured.
Further, as a specific implementation of the method shown in fig. 1 and fig. 2, an embodiment of the present application provides a device for predicting properties of a drug molecule based on contrast learning, as shown in fig. 3, where the device includes: a first generation module 31, a first determination module 32, a second determination module 33, an input module 34;
a first generation module 31 operable to generate a target molecular pattern structure of the target drug molecule from the chemical molecular structure and to generate a target three-dimensional conformation of the target drug molecule;
a first determining module 32, configured to determine a first feature vector corresponding to the target molecular map structure by using a pre-trained neural network model;
a second determining module 33, configured to determine a second feature vector corresponding to the three-dimensional conformation of the target by using a convolutional neural network model that is trained in advance, where the graph neural network model and the convolutional neural network model are obtained through a comparison learning of a positive sample pair and a negative sample pair, and a joint training;
the input module 34 may be configured to construct a third feature vector according to the first feature vector and the second feature vector, and input the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule.
In a specific application scenario, in order to realize the joint training of the graph neural network model and the convolutional neural network model through contrast learning, as shown in fig. 4, the device further includes: an acquisition module 35, a second generation module 36, a construction module 37, a first training module 38;
an obtaining module 35, configured to obtain a first drug molecule and a second drug molecule without labels, where chemical molecular structures corresponding to the first drug molecule and the second drug molecule are different;
a second generation module 36 operable to generate a first molecular map structure and a first three-dimensional conformation of a first drug molecule and a second molecular map structure and a second three-dimensional conformation of a second drug molecule;
a construction module 37 operable to construct a positive sample pair using the first molecular map structure and the first three-dimensional conformation, and/or using the second molecular map structure and the second three-dimensional conformation, and/or constructing a negative sample pair using the first molecular map structure and the second three-dimensional conformation, and/or using the second molecular map structure and the first three-dimensional conformation;
the first training module 38 is configured to jointly train the graph neural network model and the convolutional neural network model by using the positive sample pair and the negative sample pair, and adjust model parameters of the graph neural network model and/or the convolutional neural network model so that an embedding vector distance of the graph neural network model and the convolutional neural network model under the positive sample pair is smaller than a first preset threshold value, and an embedding vector distance under the negative sample pair is larger than a second preset threshold value, where the second preset threshold value is larger than the first preset threshold value.
In a specific application scene, an adjacency matrix and attribute information are carried in a target molecular graph structure, wherein the attribute information comprises a node initial feature vector and an edge initial feature vector, and the node initial feature vector and the edge initial feature vector are determined according to a preset vector generation rule; correspondingly, the first determining module 32 is specifically configured to input the target molecular graph structure, the adjacency matrix and the attribute information into a pre-trained graph neural network model, and obtain node hidden vectors of each node in the target molecular graph structure; and generating a first characteristic vector of the target drug molecule by using the node hidden vector of each node.
In a specific application scenario, when the node hidden vectors of each node are utilized to generate the first feature vector of the target drug molecule, the first determining module 32 is specifically configured to calculate a hidden vector average value of the node hidden vectors, and determine the hidden vector average value as the first feature vector of the target drug molecule; or, determining the node hidden vector with the maximum corresponding hidden vector value as the first feature vector.
In a specific application scene, the convolutional neural network model comprises a data input layer, a convolutional calculation layer, a pooling layer and a full connection layer; correspondingly, the second determining module 33 is specifically configured to input the target three-dimensional conformation into the convolutional neural network model that is trained in advance through the data input layer, and obtain the feature map through iterative convolution pooling processing of the convolution calculation layer and the pooling layer; and expanding the feature map according to the row and transmitting the feature map into the full-connection layer to obtain a second feature vector corresponding to the target three-dimensional conformation.
In a specific application scenario, when a third feature vector is constructed according to the first feature vector and the second feature vector, and the third feature vector is input into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule, the input module 34 is specifically configured to perform vector fusion processing on the first feature vector and the second feature vector according to a preset vector fusion rule to obtain the third feature vector; and taking the third feature vector as an input feature, inputting the input feature vector into a pre-trained property prediction model, and obtaining a property prediction result of the target drug molecule.
In a specific application scenario, to implement pre-training of the property prediction model, as shown in fig. 4, the apparatus further includes: a second training module 39, a calculation module 310;
the second training module 39 is configured to train a preset property prediction model by using, as a training sample, a sample feature vector that matches a preset property prediction task corresponding to a target drug molecule;
the calculating module 310 may be configured to calculate a loss function of the property prediction model, and determine that training of the property prediction model is completed when the loss function is less than a third predetermined threshold.
It should be noted that, in other corresponding descriptions of each functional unit related to the drug molecular property prediction device based on contrast learning provided in this embodiment, reference may be made to corresponding descriptions of fig. 1 to fig. 2, and no further description is given here.
Based on the above-mentioned methods shown in fig. 1 to 2, correspondingly, the present embodiment further provides a storage medium, which may be specifically volatile or nonvolatile, and on which computer readable instructions are stored, the readable instructions being executed by a processor to implement the above-mentioned method for predicting properties of a drug molecule based on contrast learning shown in fig. 1 to 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method of each implementation scenario of the present application.
Based on the method shown in fig. 1 to 2 and the virtual device embodiments shown in fig. 3 and 4, in order to achieve the above object, the present embodiment further provides a computer device, where the computer device includes a storage medium and a processor; a storage medium storing a computer program; a processor for executing a computer program to implement the above-described contrast learning-based drug molecule property prediction method as shown in fig. 1 to 2.
Optionally, the computer device may also include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.
It will be appreciated by those skilled in the art that the architecture of a computer device provided in this embodiment is not limited to this physical device, but may include more or fewer components, or may be combined with certain components, or may be arranged in a different arrangement of components.
The storage medium may also include an operating system, a network communication module. An operating system is a program that manages the computer device hardware and software resources described above, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the information processing entity equipment.
From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware.
By applying the technical scheme, compared with the prior art, the method can construct positive and negative sample pairs first, realize the combined training of the graph neural network model and the convolutional neural network model by utilizing the positive and negative sample pairs through double-angle comparison learning, and further put the graph neural network model and the convolutional neural network model which are trained in advance into the drug molecular property prediction. When predicting the property of the drug molecule, specifically, firstly, generating a target molecular graph structure of the target drug molecule according to a chemical molecular structure, generating a target three-dimensional conformation of the target drug molecule, further determining a first feature vector corresponding to the target molecular graph structure by using a pre-trained graph neural network model, and determining a second feature vector corresponding to the target three-dimensional conformation by using a pre-trained convolutional neural network model; and finally, constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule. According to the technical scheme, a pre-training strategy of combined training of 2D molecular diagram structural data and 3D conformation double angles is provided, and key 2D and 3D structural information can be learned while calculating efficiently. The method has the advantages that the positive and negative sample pairs are constructed for pre-training, the information of the compound plane structure and the three-dimensional structure can be learned from large-scale label-free data, the model obtained under the normal condition has better generalization, when specific downstream tasks need to be solved, the pre-training model can be directly used for fine adjustment, so that the problem of insufficient generalization performance caused by deep learning model training on scenes of medicine molecules lacking labels is avoided, the efficiency of medicine molecule property prediction is improved, and the property prediction accuracy of medicine molecules is ensured.
Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims (9)

1. A method for predicting molecular properties of a drug based on contrast learning, comprising:
generating a target molecular graph structure of a target drug molecule according to the chemical molecular structure, and generating a target three-dimensional conformation of the target drug molecule;
determining a first feature vector corresponding to the target molecular graph structure by utilizing a pre-trained graph neural network model;
Determining a second feature vector corresponding to the target three-dimensional conformation by utilizing a convolutional neural network model which is trained in advance, wherein the graph neural network model and the convolutional neural network model are obtained through comparison learning of a positive sample pair and a negative sample pair and combined training;
constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule;
when the graph neural network model and the convolution neural network model are subjected to contrast learning, a molecular graph structure in the same positive sample pair/negative sample pair is input into the graph neural network model in joint training, a three-dimensional conformation in the same positive sample pair/negative sample pair is input into the convolution neural network model at the same time, the graph neural network model and the convolution neural network model both output embedded vectors for the positive sample pair/negative sample pair, and the joint training condition of the graph neural network model and the convolution neural network model is judged according to the calculated embedded vector distance by contrast learning of the two embedded vectors so as to adjust model parameters of the graph neural network model and the convolution neural network model;
Before determining the first feature vector corresponding to the target molecular graph structure by using the pre-trained graph neural network model, the method further comprises:
obtaining a first drug molecule and a second drug molecule without labels, wherein the chemical molecular structures corresponding to the first drug molecule and the second drug molecule are different;
generating a first molecular map structure and a first three-dimensional conformation of the first drug molecule, and generating a second molecular map structure and a second three-dimensional conformation of the second drug molecule;
constructing a positive sample pair using the first molecular map structure and the first three-dimensional conformation, and/or using the second molecular map structure and the second three-dimensional conformation, and/or constructing a negative sample pair using the first molecular map structure and the second three-dimensional conformation, and/or using the second molecular map structure and the first three-dimensional conformation;
or when generating negative sample pairs, randomly selecting a corresponding number of different image data from the image database with 50% probability, and forming the negative sample pairs with the three-dimensional conformation data of the first drug molecules and/or the second drug molecules; the other 50% probability randomly selects a corresponding number of different three-dimensional conformation data from the three-dimensional conformation database, and the three-dimensional conformation data and the map data of the first medicine molecules and/or the second medicine molecules form negative sample pairs;
Training a graph neural network model and a convolutional neural network model by utilizing the positive sample pair and the negative sample pair in a combined way, and adjusting model parameters of the graph neural network model and/or the convolutional neural network model so that the embedded vector distance of the graph neural network model and the convolutional neural network model under the positive sample pair is smaller than a first preset threshold value, and the embedded vector distance under the negative sample pair is larger than a second preset threshold value, wherein the second preset threshold value is larger than the first preset threshold value;
using InfoNCE as a loss function, respectively estimating the distance of the embedded vectors of the positive and negative sample pairs;
wherein, the formula of the InfoNCE loss function is as follows:
f meansParameter trainable graphic neural network, x refers to raw data, x + Refer to data similar to or equal to x, x j Refers to the j-th negative sample of the structure, and N refers to the number of negative samples.
2. The method according to claim 1, wherein the target molecular graph structure carries an adjacency matrix and attribute information, and the attribute information includes a node initial feature vector and an edge initial feature vector, wherein the node initial feature vector and the edge initial feature vector are determined according to a preset vector generation rule;
The determining the first feature vector corresponding to the target molecular graph structure by using the pre-trained graph neural network model comprises the following steps:
inputting the target molecular graph structure, the adjacency matrix and the attribute information into a pre-trained graph neural network model, and obtaining node hidden vectors of all nodes in the target molecular graph structure;
and generating a first characteristic vector of the target drug molecule by using the node hidden vector of each node.
3. The method of claim 2, wherein generating the first feature vector of the target drug molecule using the node hidden vectors of the respective nodes comprises:
calculating the average value of hidden vectors of the node hidden vectors, and determining the average value of the hidden vectors as a first characteristic vector of the target drug molecule; or alternatively, the first and second heat exchangers may be,
and determining the node hidden vector with the maximum corresponding hidden vector value as the first feature vector.
4. The method of claim 1, wherein the convolutional neural network model comprises a data input layer, a convolutional calculation layer, a pooling layer, a fully-connected layer;
the determining the second eigenvector corresponding to the target three-dimensional conformation by using the convolutional neural network model which is completed through pre-training comprises the following steps:
Inputting the target three-dimensional conformation into a convolutional neural network model which is trained in advance through the data input layer, and performing iterative convolution pooling processing of the convolutional calculation layer and the pooling layer to obtain a feature map;
and expanding the feature map according to the row and transmitting the feature map into the full-connection layer to obtain a second feature vector corresponding to the target three-dimensional conformation.
5. The method according to claim 1, wherein the constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model, to obtain a property prediction result of the target drug molecule, includes:
vector fusion processing is carried out on the first feature vector and the second feature vector according to a preset vector fusion rule, and a third feature vector is obtained;
and taking the third feature vector as an input feature, inputting the input feature vector into a pre-trained property prediction model, and obtaining a property prediction result of the target drug molecule.
6. The method according to claim 1, wherein the method further comprises:
taking a sample feature vector matched with a preset property prediction task corresponding to the target drug molecule as a training sample, and training a preset property prediction model;
And calculating a loss function of the property prediction model, and judging that the training of the property prediction model is completed when the loss function is smaller than a third preset threshold value.
7. A contrast learning-based drug molecule property prediction device, comprising:
the first generation module is used for generating a target molecular graph structure of a target drug molecule according to the chemical molecular structure and generating a target three-dimensional conformation of the target drug molecule;
the first determining module is used for determining a first feature vector corresponding to the target molecular graph structure by utilizing a pre-trained graph neural network model;
the second determining module is used for determining a second feature vector corresponding to the target three-dimensional conformation by utilizing a convolutional neural network model which is trained in advance, wherein the graph neural network model and the convolutional neural network model are obtained through comparison learning of a positive sample pair and a negative sample pair and combined training;
the input module is used for constructing a third feature vector according to the first feature vector and the second feature vector, and inputting the third feature vector into a pre-trained property prediction model to obtain a property prediction result of the target drug molecule;
When the graph neural network model and the convolution neural network model are subjected to contrast learning, a molecular graph structure in the same positive sample pair/negative sample pair is input into the graph neural network model in joint training, a three-dimensional conformation in the same positive sample pair/negative sample pair is input into the convolution neural network model at the same time, the graph neural network model and the convolution neural network model both output embedded vectors for the positive sample pair/negative sample pair, and the joint training condition of the graph neural network model and the convolution neural network model is judged according to the calculated embedded vector distance by contrast learning of the two embedded vectors so as to adjust model parameters of the graph neural network model and the convolution neural network model;
the apparatus further comprises: the system comprises an acquisition module, a second generation module, a construction module and a first training module;
the acquisition module is used for acquiring unlabeled first drug molecules and unlabeled second drug molecules, wherein the chemical molecular structures corresponding to the first drug molecules and the second drug molecules are different;
a second generation module for generating a first molecular map structure and a first three-dimensional conformation of the first drug molecule, and generating a second molecular map structure and a second three-dimensional conformation of the second drug molecule;
A construction module for constructing a positive sample pair using the first molecular map structure and the first three-dimensional conformation, and/or using the second molecular map structure and the second three-dimensional conformation, and/or constructing a negative sample pair using the first molecular map structure and the second three-dimensional conformation, and/or using the second molecular map structure and the first three-dimensional conformation;
or when generating negative sample pairs, randomly selecting a corresponding number of different image data from the image database with 50% probability, and forming the negative sample pairs with the three-dimensional conformation data of the first drug molecules and/or the second drug molecules; the other 50% probability randomly selects a corresponding number of different three-dimensional conformation data from the three-dimensional conformation database, and the three-dimensional conformation data and the map data of the first medicine molecules and/or the second medicine molecules form negative sample pairs;
the first training module is used for training the graphic neural network model and the convolutional neural network model by utilizing the positive sample pair and the negative sample pair in a combined way, and adjusting model parameters of the graphic neural network model and/or the convolutional neural network model so that the embedded vector distance of the graphic neural network model and the convolutional neural network model under the positive sample pair is smaller than a first preset threshold value, and the embedded vector distance under the negative sample pair is larger than a second preset threshold value, wherein the second preset threshold value is larger than the first preset threshold value;
Using InfoNCE as a loss function, respectively estimating the distance of the embedded vectors of the positive and negative sample pairs;
wherein, the formula of the InfoNCE loss function is as follows:
f refers to a trainable parameter graph neural network, x refers to raw data, x + Refer to data similar to or equal to x, x j Refers to the j-th negative sample of the structure, and N refers to the number of negative samples.
8. A storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the contrast learning-based drug molecular property prediction method of any one of claims 1 to 6.
9. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the contrast learning based drug molecular property prediction method of any one of claims 1 to 6 when the program is executed by the processor.
CN202210026795.8A 2022-01-11 2022-01-11 Drug molecular property prediction method, device and equipment based on contrast learning Active CN114386694B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210026795.8A CN114386694B (en) 2022-01-11 2022-01-11 Drug molecular property prediction method, device and equipment based on contrast learning
PCT/CN2022/089691 WO2023134063A1 (en) 2022-01-11 2022-04-27 Comparative learning-based method, apparatus, and device for predicting properties of drug molecule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210026795.8A CN114386694B (en) 2022-01-11 2022-01-11 Drug molecular property prediction method, device and equipment based on contrast learning

Publications (2)

Publication Number Publication Date
CN114386694A CN114386694A (en) 2022-04-22
CN114386694B true CN114386694B (en) 2024-02-23

Family

ID=81202457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210026795.8A Active CN114386694B (en) 2022-01-11 2022-01-11 Drug molecular property prediction method, device and equipment based on contrast learning

Country Status (2)

Country Link
CN (1) CN114386694B (en)
WO (1) WO2023134063A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114386694B (en) * 2022-01-11 2024-02-23 平安科技(深圳)有限公司 Drug molecular property prediction method, device and equipment based on contrast learning
CN117012300A (en) * 2022-09-23 2023-11-07 腾讯科技(深圳)有限公司 Training method of binding affinity detection model and binding affinity detection method
CN115631798B (en) * 2022-10-17 2023-08-08 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Biomolecule classification method and device based on graph contrast learning
CN116189809B (en) * 2023-01-06 2024-01-09 东南大学 Drug molecule important node prediction method based on challenge resistance
CN116705195B (en) * 2023-06-07 2024-03-26 之江实验室 Method and device for predicting pharmaceutical properties of graph neural network based on vector quantization
CN116486938B (en) * 2023-06-15 2023-09-29 苏州创腾软件有限公司 Method and device for predicting formation of double perovskite compound
CN116705197B (en) * 2023-08-02 2023-11-17 北京深势科技有限公司 Method and device for processing synthetic and inverse synthetic molecular diagram prediction model
CN117334292B (en) * 2023-10-10 2024-04-05 山东百康云网络科技有限公司 Medicine sales management system
CN117612633B (en) * 2024-01-23 2024-04-09 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Drug molecular property prediction method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955780A (en) * 2019-10-12 2020-04-03 中国人民解放军国防科技大学 Entity alignment method for knowledge graph
CN110970099A (en) * 2019-12-10 2020-04-07 北京大学 Medicine molecule generation method based on regularization variational automatic encoder
CN111144466A (en) * 2019-12-17 2020-05-12 武汉大学 Image sample self-adaptive depth measurement learning method
CN112669916A (en) * 2020-12-25 2021-04-16 浙江大学 Molecular diagram representation learning method based on comparison learning
CN112820361A (en) * 2019-11-15 2021-05-18 北京大学 Drug molecule generation method based on confrontation and imitation learning
CN112863696A (en) * 2021-04-25 2021-05-28 浙江大学 Drug sensitivity prediction method and device based on transfer learning and graph neural network
CN113160894A (en) * 2021-04-23 2021-07-23 平安科技(深圳)有限公司 Method, device, equipment and storage medium for predicting interaction between medicine and target
CN113593633A (en) * 2021-08-02 2021-11-02 中国石油大学(华东) Drug-protein interaction prediction model based on convolutional neural network
CN113707236A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on graph neural network
CN113707235A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning
CN113807520A (en) * 2021-11-16 2021-12-17 北京道达天际科技有限公司 Knowledge graph alignment model training method based on graph neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3150626A1 (en) * 2015-10-01 2017-04-05 University of Vienna Means and methods for accelerated breeding by inducing targeted stimulation of meiotic recombination
CN113095417B (en) * 2021-04-16 2023-07-28 西安电子科技大学 SAR target recognition method based on fusion graph convolution and convolution neural network
CN114386694B (en) * 2022-01-11 2024-02-23 平安科技(深圳)有限公司 Drug molecular property prediction method, device and equipment based on contrast learning

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955780A (en) * 2019-10-12 2020-04-03 中国人民解放军国防科技大学 Entity alignment method for knowledge graph
CN112820361A (en) * 2019-11-15 2021-05-18 北京大学 Drug molecule generation method based on confrontation and imitation learning
CN110970099A (en) * 2019-12-10 2020-04-07 北京大学 Medicine molecule generation method based on regularization variational automatic encoder
CN111144466A (en) * 2019-12-17 2020-05-12 武汉大学 Image sample self-adaptive depth measurement learning method
CN112669916A (en) * 2020-12-25 2021-04-16 浙江大学 Molecular diagram representation learning method based on comparison learning
CN113160894A (en) * 2021-04-23 2021-07-23 平安科技(深圳)有限公司 Method, device, equipment and storage medium for predicting interaction between medicine and target
CN112863696A (en) * 2021-04-25 2021-05-28 浙江大学 Drug sensitivity prediction method and device based on transfer learning and graph neural network
CN113593633A (en) * 2021-08-02 2021-11-02 中国石油大学(华东) Drug-protein interaction prediction model based on convolutional neural network
CN113707236A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on graph neural network
CN113707235A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning
CN113807520A (en) * 2021-11-16 2021-12-17 北京道达天际科技有限公司 Knowledge graph alignment model training method based on graph neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Derek Jones等.Improved Protein−Ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference.《JOURNAL OF CHEMICAL INFORMATION AND MODELING》.2021,第61卷第1583-1592页. *
基于分子图结构挖掘的分子优化方法;郑奕嘉;《中国优秀硕士学位论文全文数据库》;20210815(第8期);I138-633 *

Also Published As

Publication number Publication date
CN114386694A (en) 2022-04-22
WO2023134063A1 (en) 2023-07-20

Similar Documents

Publication Publication Date Title
CN114386694B (en) Drug molecular property prediction method, device and equipment based on contrast learning
CN113707235B (en) Drug micromolecule property prediction method, device and equipment based on self-supervision learning
CN111583263B (en) Point cloud segmentation method based on joint dynamic graph convolution
CN113707236B (en) Drug small molecule property prediction method, device and equipment based on graph neural network
CN111259936B (en) Image semantic segmentation method and system based on single pixel annotation
CN106021990B (en) A method of biological gene is subjected to classification and Urine scent with specific character
CN110619059B (en) Building marking method based on transfer learning
CN110110128B (en) Fast supervised discrete hash image retrieval system for distributed architecture
CN115690541A (en) Deep learning training method for improving recognition accuracy of small sample and small target
CN112420123A (en) Training method and device of self-supervision learning model, equipment and storage medium
CN115526316A (en) Knowledge representation and prediction method combined with graph neural network
Liu et al. Joint graph learning and matching for semantic feature correspondence
CN115018039A (en) Neural network distillation method, target detection method and device
Demirel et al. Meta-tuning loss functions and data augmentation for few-shot object detection
CN109784404A (en) A kind of the multi-tag classification prototype system and method for fusion tag information
CN116362294B (en) Neural network searching method and device and readable storage medium
CN113611366B (en) Gene module mining method and device based on graph neural network and computer equipment
CN115544307A (en) Directed graph data feature extraction and expression method and system based on incidence matrix
Wang et al. An Improved Convolutional Neural Network‐Based Scene Image Recognition Method
Xue et al. Fast and unsupervised neural architecture evolution for visual representation learning
CN113033410A (en) Domain generalization pedestrian re-identification method, system and medium based on automatic data enhancement
Zhang et al. Color clustering using self-organizing maps
Barbosa et al. A new genetic algorithm-based pruning approach for optimum-path forest
CN114724648A (en) Drug-target interaction prediction method and device
CN117612214B (en) Pedestrian search model compression method based on knowledge distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant