CN114613450A

CN114613450A - Method and device for predicting property of drug molecule, storage medium and computer equipment

Info

Publication number: CN114613450A
Application number: CN202210231663.9A
Authority: CN
Inventors: 王俊; 高鹏; 孙宁; 谢国彤
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-10
Also published as: WO2023168810A1

Abstract

The invention discloses a method and a device for predicting the property of a drug molecule, a storage medium and computer equipment. The method comprises the following steps: acquiring a drug molecule to be predicted, and carrying out modal transformation on the molecular structure of the drug molecule to obtain a multi-modal drug molecule structure, wherein the multi-modal drug molecule structure comprises a drug molecule sequence, a drug molecule graph, a drug molecule image and a drug molecule fingerprint; performing feature extraction on the multi-modal drug molecular structure through a pre-trained multi-modal feature extraction model to obtain multi-modal drug molecular feature vectors; converting the multi-modal drug molecule feature vectors into multi-modal high-dimensional feature vectors, and performing feature fusion on the multi-modal high-dimensional feature vectors to obtain fusion feature vectors of the drug molecules; and inputting the fusion characteristic vector of the drug molecule into a pre-trained drug molecule property prediction model to obtain a property prediction result of the drug molecule. The method can improve the accuracy of the prediction of the molecular properties of the drug.

Description

Method and device for predicting property of drug molecule, storage medium and computer equipment

Technical Field

The invention relates to the technical field of artificial intelligence and digital medical treatment, in particular to a method and a device for predicting the property of a drug molecule, a storage medium and computer equipment.

Background

Drug discovery is a process of identifying new candidate compounds with potential therapeutic effects, wherein prediction of various properties of drug molecules is an essential step in the drug discovery process. Poor pharmacokinetic properties (absorption, distribution, metabolism and excretion, ADME) and toxicity (T) are among the major causes of failure of drug development, and therefore, it is crucial to assess ADMET properties of candidate drug molecules at the early stages of drug research.

In the past, the properties of drug molecules are verified through experiments, but the verification method is long in time consumption and high in cost, and comprehensive and accurate prediction is particularly difficult to achieve. At present, the data distribution characterization of the drug molecule is learned based on a machine learning method, and then the data distribution characterization is applied to unknown data to realize the property prediction of the drug molecule. However, the existing drug prediction model is difficult to comprehensively express the characteristics of drug molecules, so that the prediction accuracy is low.

Disclosure of Invention

In view of the above, the present application provides a method, an apparatus, a storage medium, and a computer device for predicting a property of a drug molecule, and mainly aims to solve the technical problem of inaccurate property prediction of a drug molecule.

According to a first aspect of the present invention, there is provided a method of predicting a property of a drug molecule, the method comprising:

acquiring a drug molecule to be predicted, and carrying out modal transformation on the molecular structure of the drug molecule to obtain a multi-modal drug molecule structure, wherein the multi-modal drug molecule structure comprises at least two of a drug molecule sequence, a drug molecule graph, a drug molecule image and a drug molecule fingerprint;

performing feature extraction on the multi-modal drug molecular structure through a pre-trained multi-modal feature extraction model to obtain multi-modal drug molecular feature vectors;

converting the multi-modal drug molecule feature vectors into multi-modal high-dimensional feature vectors, and performing feature fusion on the multi-modal high-dimensional feature vectors to obtain fusion feature vectors of the drug molecules;

and inputting the fusion characteristic vector of the drug molecules into a pre-trained drug molecule property prediction model to obtain a property prediction result of the drug molecules.

According to a second aspect of the present invention, there is provided a device for predicting a property of a drug molecule, the device comprising:

the mode conversion module is used for acquiring a drug molecule to be predicted and performing mode conversion on a molecular structure of the drug molecule to obtain a multi-mode drug molecular structure, wherein the multi-mode drug molecular structure comprises at least two of a drug molecule sequence, a drug molecule graph, a drug molecule image and a drug molecule fingerprint;

the characteristic extraction module is used for extracting the characteristics of the multi-modal drug molecular structure through a pre-trained multi-modal characteristic extraction model to obtain multi-modal drug molecular characteristic vectors;

the feature fusion module is used for converting the multi-modal drug molecule feature vectors into multi-modal high-dimensional feature vectors and performing feature fusion on the multi-modal high-dimensional feature vectors to obtain fusion feature vectors of the drug molecules;

and the property prediction module is used for inputting the fusion characteristic vector of the drug molecule into a pre-trained drug molecule property prediction model to obtain a property prediction result of the drug molecule.

According to a third aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the method of property prediction for a drug molecule as described above.

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of predicting a property of a drug molecule as described above when executing the program.

According to the method, the device, the storage medium and the computer equipment for predicting the property of the drug molecule, firstly, the molecular structure of the drug molecule is converted into a multi-modal drug molecular structure, then, the feature extraction is carried out on the drug molecular structure of each modality of the drug molecule through a pre-trained multi-modal feature extraction model, further, the feature fusion is carried out on the drug molecular feature vectors of each modality, and finally, the property prediction result of the drug molecule is obtained based on the fusion feature vectors of the drug molecule. The method can obtain more comprehensive characteristic representation of the drug molecules, thereby more accurately and effectively predicting the properties of the drug molecules, effectively accelerating the speed and success rate of drug research and development and reducing the cost of drug molecule property prediction.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow chart illustrating a method for predicting a property of a drug molecule according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating an operational flow of a method for predicting a property of a drug molecule according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a device for predicting the property of a drug molecule according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating an internal structure of a computer device according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In one embodiment, as shown in fig. 1 and 2, a method for predicting a property of a drug molecule is provided, which is illustrated by applying the method to a computer device, and comprises the following steps:

101. obtaining a drug molecule to be predicted, and carrying out mode conversion on the molecular structure of the drug molecule to obtain a multi-modal drug molecule structure.

The multi-modal drug molecular structure comprises at least two of a drug molecular sequence, a drug molecular graph, a drug molecular image and a drug molecular fingerprint. In the present embodiment, the drug molecule sequence refers to a drug molecule structure represented by a character string, such as a SMILES expression, similar to a language sequence; the drug molecular graph refers to the structure of a drug molecule represented by a data structure graph; the drug molecule image refers to a drug molecule structure represented by a planar picture; a drug molecule fingerprint refers to the structure of a drug molecule represented by a series of bit strings.

Specifically, the computer device may obtain the drug molecules to be predicted through a data interface or a network, and then perform multiple rounds of mode conversion processing on the molecular structure of the drug molecules in a mode conversion manner corresponding to the drug molecular structure of each mode, so as to obtain a multi-modal drug molecular structure.

102. And (3) extracting the characteristics of the multi-modal drug molecular structure through a pre-trained multi-modal characteristic extraction model to obtain multi-modal drug molecular characteristic vectors.

For the drug molecular structure of each mode, a feature extraction mode corresponding to the mode can be adopted to extract features of the drug molecular structure of each mode, so that a multi-mode drug molecular feature vector is obtained. In this embodiment, after feature extraction, at least two kinds of feature vectors of the feature vector of the drug molecule sequence, the feature vector of the drug molecule image, and the feature vector of the drug molecule fingerprint can be obtained.

103. And converting the multi-modal drug molecule feature vectors into multi-modal high-dimensional feature vectors, and performing feature fusion on the multi-modal high-dimensional feature vectors to obtain fusion feature vectors of the drug molecules.

Specifically, after obtaining the multi-modal drug molecule feature vectors, the drug molecule feature vectors of different modalities can be firstly converted into high-dimensional feature expression, and then the high-dimensional feature expression is fused in the middle layer of the model. The intermediate fusion can convert multi-modal drug molecule feature vectors into high-dimensional feature expressions (e.g., 768 dimensions) by using a neural network, and then obtain commonalities of different modal data in a high-dimensional space, so that the multi-modal high-dimensional feature vectors are fused to obtain more complete and sufficient drug molecule fusion feature vectors.

104. And inputting the fusion characteristic vector of the drug molecule into a pre-trained drug molecule property prediction model to obtain a property prediction result of the drug molecule.

Specifically, after the fusion feature vector of the drug molecule is obtained, the fusion feature vector of the drug molecule may be input into a pre-trained drug molecule property prediction model to obtain a property prediction result of the drug molecule. The drug molecule property prediction model may be obtained by training a machine learning model such as a neural network, and the embodiment is not specifically limited herein.

The method for predicting the property of the drug molecule provided in this embodiment includes converting a molecular structure of the drug molecule into a multi-modal drug molecular structure, performing feature extraction on the drug molecular structure of each modality of the drug molecule through a pre-trained multi-modal feature extraction model, performing feature fusion on the drug molecular feature vectors of each modality, and finally obtaining a property prediction result of the drug molecule based on the fusion feature vectors of the drug molecule. The method can obtain more comprehensive characteristic representation of the drug molecules, thereby more accurately and effectively predicting the properties of the drug molecules, effectively accelerating the speed and success rate of drug research and development and reducing the cost of drug molecule property prediction.

In one embodiment, the method for performing the mode conversion on the molecular structure of the drug molecule in step 101 can be implemented by the following method: first, the molecular structure of a drug molecule is converted into a character string format according to a predetermined molecular structure conversion rule to obtain a drug molecule sequence, for example, the molecular structure of a drug molecule may be converted into a SMILES expression according to a conversion rule of SMILES. And secondly, converting atoms of the molecular structure of the drug molecules into nodes of a drug molecular graph, and converting chemical bonds of the molecular structure of the drug molecules into edges of the drug molecular graph to obtain the drug molecular graph, wherein the drug molecular graph can be added with various attribute information or characteristic information of the atoms or the chemical bonds to enrich the characteristic information of the drug molecular graph. Furthermore, the molecular structure of the drug molecules can be converted into a two-dimensional image through photographing, screenshot, image conversion and other modes, so that a drug molecule image is obtained, the image conversion mode is simple, and redundant description is omitted here. Finally, the structural features in the molecular structure of the drug molecules can be extracted, and the structural features are encoded into bit vectors to obtain drug molecule fingerprints, wherein the drug molecule fingerprints are abstract representations of molecules, and can be used for converting (encoding) the drug molecules into a series of bit strings (namely bit vectors) and then easily comparing the drug molecules. It can be understood that the modality conversion mode is various, and therefore, the conversion mode can not be limited to the above modes, and can be selected according to the actual situation.

In one embodiment, the drug molecule fingerprint may specifically be an extended connectivity fingerprint, in which case the extraction method of the drug molecule fingerprint may include the following steps: firstly, marking an identifier for each atom in the molecular structure of a drug molecule, storing the hash value of the identifier of each atom in a pre-established identifier set, then creating a bond list for each atom, storing the bond level and the identifier of the adjacent atom of the atom in the bond list of each atom, further taking the hash value of the bond list of each atom as the updated identifier of the atom, storing the updated identifier of each atom in the identifier set, and finally extracting all the identifiers in the identifier set to obtain the drug molecule fingerprint.

In the above embodiment, the extended connectivity fingerprint is a Circular fingerprint (Circular fingerprint), which is defined by setting a radius n (i.e., the number of iterations), then calculating each atom identifier (identifier) that resembles connectivity in a Morgan fingerprint, and finally being determined by the environment of radius n. The algorithm for expanding the connectivity fingerprint is as follows: firstly, creating a set S to store identifiers of all atoms, then marking each atom by using a 32-bit integer, for example, a Morgan algorithm or a CANGEN algorithm, then hashing the atom and adding the hashed atom to the S, further, for each atom, creating a "bond list" to store information of peripheral atoms of the atom, wherein the list can be sorted according to bond level (such as single bond, double bond, triple bond and the like), then sorted according to the size of the peripheral atom identifiers, and then filling the list with the following information: the content is [ n, identifier, bo1, aid1, bo2, aid2, … ], wherein n is iteration number, the beginning is 0, bo1 is the bond level of the 1 st bond, aid1 is the identifier of the atom connected with the 1 st bond, the rest is repeated, then the hash value of the feature list is calculated to be the new identifier of the atom, if the newly calculated identifier is not structurally repeated with S, the new identifier is added into S, and the iteration is continued until the loop is ended. In this embodiment, the drug molecular fingerprint can be used as a good complement for the molecular structure of the drugs in other three modes, so as to more fully mine and complement the advantages of the modes, thereby more effectively and accurately predicting the property of the small molecule drug.

In one embodiment, the method for extracting the features of the molecular structure of the drug of each modality in step 102 can be implemented by the following methods: extracting language structure characteristics in the drug molecule sequence through a language model in the multi-modal characteristic extraction model to obtain a characteristic vector of the drug molecule sequence; extracting atomic features and chemical bond features in the drug molecular graph through a graph neural network in the multi-modal feature extraction model to obtain feature vectors of the drug molecular graph; extracting image features in the drug molecule image through a convolutional neural network in the multi-modal feature extraction model to obtain feature vectors of the drug molecule image; and extracting the identifier characteristics in the drug molecule fingerprint through a deep neural network in the multi-modal characteristic extraction model to obtain the characteristic vector of the drug molecule fingerprint.

In the above embodiment, the language model can extract the structural information hidden in the drug molecule sequence and the correlation information between the sequences, and the low-dimensional dense feature vector expression of the drug molecule sequence can be obtained by splicing the extracted information together, performing dimension reduction after passing through a full-connection layer, and performing dimension reduction. The graph neural network can extract the characteristics of atomic nodes of the drug molecular graph and the chemical bond information of connecting edges between atoms, so that the characteristics of the molecular level of the whole molecular compound are extracted. The convolutional neural network can extract image features of different levels in the drug molecule image, can progress layer by layer, and can extract all image features of the whole drug molecule image. The deep neural network can extract deep features in the drug molecule fingerprint, and the features can be used as good supplements for other three modal features, so that advantage complementation among the modal features is realized, and the accuracy of drug molecule property prediction is improved.

In one embodiment, the method for performing feature fusion on the drug molecule feature vectors of each modality in step 103 can be implemented by the following method: firstly, multi-modal drug molecule feature vectors are converted into multi-modal high-dimensional feature vectors with the same dimensionality, then the multi-modal high-dimensional feature vectors are input into a pre-trained feature enhancement model to obtain the attention coefficients of the multi-modal high-dimensional feature vectors, and finally the multi-modal high-dimensional feature vectors are subjected to weighted summation according to the attention coefficients of the multi-modal high-dimensional feature vectors to obtain the fusion feature vectors of the drug molecules.

In the above examples, a variety of results were obtainedAfter the feature vectors of the drug molecules of different modalities, the feature vectors of different modalities can be integrated by conventional operations, for example, by means of splicing and weighted summation. However, the conventional integration operation may not have any relation between the parameters, and therefore, the embodiment automatically performs an adaptive operation on the fusion operation of the feature vectors through the network layer, and determines the contribution degree of each modality through the pre-trained feature enhancement model. In this embodiment, the attention coefficient of the feature vector of each modality can be obtained by using an attention mechanism, and thus the fusion of multi-modality information can be realized. Specifically, the high-dimensional feature vector F of each modality can be setⁱInput into a trained attention network, and the attention weight occupied by the mode i is beta_iThrough weighted accumulation, the final fused total characteristic F for predicting the property of the drug molecule can be obtained_allThe expression calculated is:

β_i＝softmax(P_i)

wherein: p_iIn order to hide the state of the cell,

and

respectively weight and offset, beta_iIs a normalized weight vector. By the method, the feature expression accuracy of the fusion feature vector can be effectively improved, so that the accuracy of the drug molecule property prediction is improved.

In one embodiment, the multi-modal feature extraction model and the drug molecule property prediction model can be trained by the following methods:

201. obtaining a plurality of drug molecule samples, and carrying out mode conversion on the molecular structure of each drug molecule sample to obtain the multi-modal drug molecule structure of each drug molecule sample.

The mode of performing the mode conversion on the molecular structure of the drug molecule sample is as described above, and is not described herein again. In this embodiment, the multi-modal drug molecule structure includes a drug molecule sequence, a drug molecule graph, a drug molecule image and a drug molecule fingerprint, each drug molecule sample includes a classification label with predetermined properties, that is, if the drug molecule property prediction model needs to predict the toxicity of a drug molecule, the predetermined property is the toxicity, and the classification label is toxic or non-toxic.

202. And respectively constructing a language model, a graph neural network, a convolutional neural network, a deep neural network and a neural network according to the multi-modal drug molecule structures of the multiple drug molecule samples.

The system comprises a language model, a graph neural network, a convolution neural network, a deep neural network, an attention network and a neural network, wherein the language model is used for extracting the characteristics of a drug molecule sequence, the graph neural network is used for extracting the characteristics of a drug molecule graph, the convolution neural network is used for extracting the characteristics of a drug molecule image, the deep neural network is used for extracting the characteristics of a drug molecule fingerprint, the attention network is used for fusing the high-dimensional characteristics of each mode, and the neural network is used for classifying the fused multi-mode characteristics, namely predicting the properties of drug molecules.

203. And respectively inputting the multi-modal drug molecule structures of the multiple drug molecule samples into the language model, the graph neural network, the convolutional neural network and the deep neural network to obtain the multi-modal drug molecule feature vector of each drug molecule sample.

204. And converting the multi-modal drug molecule feature vectors of each drug molecule sample into multi-modal high-dimensional feature vectors, and performing feature fusion on the multi-modal high-dimensional feature vectors of each drug molecule sample to obtain fusion feature vectors of each drug molecule sample.

205. And performing synchronous iterative training on a language model, a graph neural network, a convolutional neural network, a deep neural network and a neural network by taking the fusion feature vector of each drug molecule sample as input and the classification label of each drug molecule sample as output to obtain a multi-modal feature extraction model and a drug molecule property prediction model.

In one embodiment, the model training process may further include the following steps: constructing an attention network, inputting the multi-modal high-dimensional feature vector of each drug molecule sample into the attention network to obtain the attention coefficient of the multi-modal high-dimensional feature vector of each drug molecule sample, performing weighted summation on the multi-modal high-dimensional feature vector of each drug molecule sample according to the attention coefficient of the multi-modal high-dimensional feature vector of each drug molecule sample to obtain a fusion feature vector of each drug molecule sample, and performing iterative training on the attention network by taking the fusion feature vector of each drug molecule sample as input and the classification label of each drug molecule sample as output to obtain a feature enhancement model.

In the embodiment, the multi-modal feature extraction model and the drug molecule property prediction model combine the advantages of multiple models such as a language model, a graph neural network, a convolutional neural network, a deep neural network, an attention network and a neural network, can accurately extract feature information of each mode of a drug molecule, and can accurately fuse and predict feature vectors of each mode, so that the accuracy and the generalization of drug molecule property prediction are effectively improved, the speed and the success rate of drug research and development are improved, and the cost of drug molecule property prediction is reduced.

Further, as a specific implementation of the method shown in fig. 1 and fig. 2, the present embodiment provides a device for predicting a property of a drug molecule, as shown in fig. 3, the device includes: a modality conversion module 31, a feature extraction module 32, a feature fusion module 33, and a property prediction module 34, wherein:

the mode conversion module 31 may be configured to obtain a drug molecule to be predicted, and perform mode conversion on a molecular structure of the drug molecule to obtain a multi-modal drug molecule structure, where the multi-modal drug molecule structure includes at least two of a drug molecule sequence, a drug molecule graph, a drug molecule image, and a drug molecule fingerprint;

the feature extraction module 32 is configured to perform feature extraction on the multi-modal drug molecular structure through a pre-trained multi-modal feature extraction model to obtain multi-modal drug molecular feature vectors;

the feature fusion module 33 is configured to convert the multi-modal drug molecule feature vectors into multi-modal high-dimensional feature vectors, and perform feature fusion on the multi-modal high-dimensional feature vectors to obtain fusion feature vectors of the drug molecules;

the property prediction module 34 may be configured to input the fusion feature vector of the drug molecule into a pre-trained drug molecule property prediction model to obtain a property prediction result of the drug molecule.

In a specific application scenario, the mode conversion module 31 is specifically configured to convert the molecular structure of the drug molecule into a character string format according to a predetermined molecular structure conversion rule, so as to obtain a drug molecule sequence; converting atoms of the molecular structure of the drug molecules into nodes of a drug molecular graph, and converting chemical bonds of the molecular structure of the drug molecules into edges of the drug molecular graph to obtain the drug molecular graph; converting the molecular structure of the drug molecules into a two-dimensional image to obtain a drug molecule image; extracting the structural characteristics in the molecular structure of the drug molecules, and coding the structural characteristics into bit vectors to obtain the drug molecule fingerprints.

In a specific application scene, the drug molecule fingerprint is an extended connectivity fingerprint; the modality conversion module 31 may be further configured to mark an identifier for each atom in the molecular structure of the drug molecule, and store a hash value of the identifier of each atom in a pre-established identifier set; creating a key list for each atom and storing the bond levels and identifiers of the atoms' neighbors in the key list of each atom; taking the hash value of the bond list of each atom as an updated identifier of each atom, and storing the updated identifier of each atom in an identifier set; and extracting all identifiers in the identifier set to obtain the drug molecule fingerprint.

In a specific application scenario, the feature extraction module 32 is specifically configured to extract the language structure features in the drug molecule sequence through the language model in the multi-modal feature extraction model to obtain the feature vector of the drug molecule sequence; extracting atomic features and chemical bond features in the drug molecular graph through a graph neural network in the multi-modal feature extraction model to obtain feature vectors of the drug molecular graph; extracting image features in the drug molecule image through a convolutional neural network in the multi-modal feature extraction model to obtain feature vectors of the drug molecule image; and extracting the identifier characteristics in the drug molecule fingerprint through a deep neural network in the multi-modal characteristic extraction model to obtain the characteristic vector of the drug molecule fingerprint.

In a specific application scenario, the feature fusion module 33 is specifically configured to convert multi-modal drug molecule feature vectors into multi-modal high-dimensional feature vectors with the same dimensionality; inputting the multi-modal high-dimensional feature vector into a pre-trained feature enhancement model to obtain an attention coefficient of the multi-modal high-dimensional feature vector; and according to the attention coefficient of the multi-modal high-dimensional feature vector, carrying out weighted summation on the multi-modal high-dimensional feature vector to obtain the fusion feature vector of the drug molecules.

In a specific application scenario, the device further includes a model training module 35, where the model training module 35 is specifically configured to obtain a plurality of drug molecule samples, and perform modality conversion on a molecule structure of each drug molecule sample to obtain a multi-modal drug molecule structure of each drug molecule sample, where each drug molecule sample includes a classification label with a predetermined property; respectively constructing a language model, a graph neural network, a convolutional neural network, a deep neural network and a neural network according to the multi-modal drug molecule structures of a plurality of drug molecule samples; respectively inputting the multi-modal drug molecule structures of a plurality of drug molecule samples into a language model, a graph neural network, a convolution neural network and a deep neural network to obtain multi-modal drug molecule feature vectors of each drug molecule sample; converting the multi-modal drug molecule feature vectors of each drug molecule sample into multi-modal high-dimensional feature vectors, and performing feature fusion on the multi-modal high-dimensional feature vectors of each drug molecule sample to obtain fusion feature vectors of each drug molecule sample; and performing synchronous iterative training on a language model, a graph neural network, a convolutional neural network, a deep neural network and a neural network by taking the fusion feature vector of each drug molecule sample as input and the classification label of each drug molecule sample as output to obtain a multi-modal feature extraction model and a drug molecule property prediction model.

In a specific application scenario, the model training module 35 may be further configured to construct an attention network; inputting the multi-modal high-dimensional feature vector of each drug molecule sample into an attention network to obtain an attention coefficient of the multi-modal high-dimensional feature vector of each drug molecule sample; according to the attention coefficient of the multi-modal high-dimensional feature vector of each drug molecule sample, carrying out weighted summation on the multi-modal high-dimensional feature vector of each drug molecule sample to obtain a fusion feature vector of each drug molecule sample; and (3) taking the fusion characteristic vector of each drug molecule sample as input, taking the classification label of each drug molecule sample as output, and performing iterative training on the attention network to obtain a characteristic enhancement model.

It should be noted that other corresponding descriptions of the functional units involved in the device for predicting the property of a drug molecule provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not repeated herein.

Based on the methods shown in fig. 1 and fig. 2, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the method for predicting the property of the drug molecule shown in fig. 1 and fig. 2.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, and the software product to be identified may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, or the like), and include several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the method according to the implementation scenarios of the present application.

Based on the method shown in fig. 1 and fig. 2 and the embodiment of the device for predicting the property of a drug molecule shown in fig. 3, in order to achieve the above object, as shown in fig. 4, the embodiment further provides a computer device for predicting the property of a drug molecule, which may be a personal computer, a server, a smart phone, a tablet computer, a smart watch, or other network devices, and the computer device includes a storage medium and a processor; a storage medium for storing a computer program and an operating system; a processor for executing the computer program to implement the method shown in fig. 1 and fig. 2.

Optionally, the computer device may further include an internal memory, a communication interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, a Display (Display), an input device such as a Keyboard (Keyboard), and the like, and optionally, the communication interface may further include a USB interface, a card reader interface, and the like. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

It will be appreciated by those skilled in the art that the configuration of a computer device for recognition of an operational action provided in the present embodiments does not constitute a limitation of the computer device, and may include more or fewer components, or some components in combination, or a different arrangement of components.

The storage medium may further include an operating system and a network communication module. The operating system is a program that manages the hardware of the above-described computer device and the software resources to be identified, and supports the execution of the information processing program and other software and/or programs to be identified. The network communication module is used for realizing communication among components in the storage medium and communication with other hardware and software in the information processing computer equipment.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the technical scheme of the application, the molecular structure of the drug molecule is firstly converted into a multi-modal drug molecular structure, then the feature extraction is carried out on the drug molecular structure of each modality of the drug molecule through a pre-trained multi-modal feature extraction model, further the feature fusion is carried out on the drug molecular feature vector of each modality, and finally the property prediction result of the drug molecule is obtained based on the fusion feature vector of the drug molecule. Compared with the prior art, the method can obtain more comprehensive drug molecule characteristic representation, thereby more accurately and effectively predicting the property of the drug molecule, effectively accelerating the speed and success rate of drug research and development and reducing the cost of drug molecule property prediction.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art can understand that the modules in the device in the implementation scenario may be distributed in the device in the implementation scenario according to the implementation scenario description, and may also be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A method for predicting a property of a drug molecule, the method comprising:

converting the multi-modal drug molecule feature vectors into multi-modal high-dimensional feature vectors, and performing feature fusion on the multi-modal high-dimensional feature vectors to obtain fusion feature vectors of drug molecules;

2. The method according to claim 1, wherein the performing the modal transformation on the molecular structure of the drug molecule to obtain a multi-modal molecular structure of the drug molecule comprises:

converting the molecular structure of the drug molecule into a character string format according to a preset molecular structure conversion rule to obtain a drug molecule sequence;

converting atoms of the molecular structure of the drug molecules into nodes of a drug molecular graph, and converting chemical bonds of the molecular structure of the drug molecules into edges of the drug molecular graph to obtain the drug molecular graph;

converting the molecular structure of the drug molecule into a two-dimensional image to obtain a drug molecule image;

extracting the structural features in the molecular structure of the drug molecules, and encoding the structural features into bit vectors to obtain the drug molecule fingerprints.

3. The method of claim 2, wherein the drug molecule fingerprint is an extended connectivity fingerprint; extracting the structural features in the molecular structure of the drug molecule, and encoding the structural features into bit vectors to obtain the drug molecule fingerprint, including:

marking each atom in the molecular structure of the drug molecule with an identifier, and storing the hash value of the identifier of each atom in a pre-established set of identifiers;

creating a key list for each of the atoms and storing the bond levels and identifiers of the atoms' neighbors in the key list of each of the atoms;

taking the hash value of the bond list of each atom as the updated identifier of the atom, and storing each updated identifier of the atom in the identifier set;

and extracting all identifiers in the identifier set to obtain the drug molecule fingerprint.

4. The method of claim 1, wherein the feature extraction of the multi-modal drug molecular structure through a pre-trained multi-modal feature extraction model to obtain multi-modal drug molecular feature vectors comprises:

extracting language structure features in the drug molecule sequence through a language model in the multi-modal feature extraction model to obtain a feature vector of the drug molecule sequence;

extracting atomic features and chemical bond features in the drug molecular graph through a graph neural network in the multi-modal feature extraction model to obtain feature vectors of the drug molecular graph;

extracting image features in the drug molecule image through a convolutional neural network in a multi-modal feature extraction model to obtain a feature vector of the drug molecule image;

and extracting the identifier characteristics in the drug molecule fingerprint through a deep neural network in a multi-modal characteristic extraction model to obtain the characteristic vector of the drug molecule fingerprint.

5. The method of claim 1, wherein the converting the multi-modal drug molecule feature vector into a multi-modal high-dimensional feature vector and performing feature fusion on the multi-modal high-dimensional feature vector to obtain a fused feature vector of the drug molecule comprises:

converting the multi-modal drug molecule feature vectors into multi-modal high-dimensional feature vectors with the same dimensionality;

inputting the multi-modal high-dimensional feature vector into a pre-trained feature enhancement model to obtain an attention coefficient of the multi-modal high-dimensional feature vector;

and according to the attention coefficient of the multi-modal high-dimensional feature vector, carrying out weighted summation on the multi-modal high-dimensional feature vector to obtain the fusion feature vector of the drug molecules.

6. The method of any one of claims 1-5, wherein the method for training the multi-modal feature extraction model and the drug molecule property prediction model comprises:

obtaining a plurality of drug molecule samples, and performing mode conversion on the molecular structure of each drug molecule sample to obtain a multi-modal drug molecule structure of each drug molecule sample, wherein each drug molecule sample comprises a classification label with a predetermined property;

respectively constructing a language model, a graph neural network, a convolutional neural network, a deep neural network and a neural network according to the multi-modal drug molecule structures of the multiple drug molecule samples;

respectively inputting the multi-modal drug molecule structures of the multiple drug molecule samples into the language model, the graph neural network, the convolution neural network and the deep neural network to obtain multi-modal drug molecule feature vectors of each drug molecule sample;

converting the multi-modal drug molecule feature vector of each drug molecule sample into a multi-modal high-dimensional feature vector, and performing feature fusion on the multi-modal high-dimensional feature vector of each drug molecule sample to obtain a fusion feature vector of each drug molecule sample;

and performing synchronous iterative training on the language model, the graph neural network, the convolutional neural network, the deep neural network and the neural network by taking the fusion feature vector of each drug molecule sample as input and the classification label of each drug molecule sample as output to obtain the multi-modal feature extraction model and the drug molecule property prediction model.

7. The method according to claim 6, wherein the feature fusion of the multi-modal high-dimensional feature vectors of each drug molecule sample to obtain a fused feature vector of each drug molecule sample comprises:

constructing an attention network;

inputting the multi-modal high-dimensional feature vector of each drug molecule sample into the attention network to obtain an attention coefficient of the multi-modal high-dimensional feature vector of each drug molecule sample;

according to the attention coefficient of the multi-modal high-dimensional feature vector of each drug molecule sample, carrying out weighted summation on the multi-modal high-dimensional feature vector of each drug molecule sample to obtain a fusion feature vector of each drug molecule sample;

the method further comprises the following steps:

and performing iterative training on the attention network by taking the fusion characteristic vector of each drug molecule sample as input and the classification label of each drug molecule sample as output to obtain a characteristic enhancement model.

8. An apparatus for predicting a property of a drug molecule, the apparatus comprising:

the model conversion module is used for acquiring a drug molecule to be predicted and performing model conversion on a molecular structure of the drug molecule to obtain a multi-model drug molecule structure, wherein the multi-model drug molecule structure comprises at least two of a drug molecule sequence, a drug molecule graph, a drug molecule image and a drug molecule fingerprint;

9. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, realizing the steps of the method of any one of claims 1 to 7.

10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.