CN114334029A

CN114334029A - Compound activity prediction method, network training method, device, medium, and apparatus

Info

Publication number: CN114334029A
Application number: CN202111387109.1A
Authority: CN
Inventors: 黄隆锴; 魏颖
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-04-12

Abstract

The application discloses a compound activity prediction method, a network training method, a device, a medium and equipment, which can be applied to scenes such as artificial intelligence, machine learning, deep neural network, meta learning, drug analysis, and finding of a compound at the head of a seedling. The method comprises the following steps: obtaining target point protein corresponding to the compound to be detected; determining target characteristics of the target protein according to the information of the tested active compound corresponding to the target protein; inputting target spot characteristics into a trained functional area determination network for processing so as to determine a target functional area corresponding to each layer of network in a basic neural network, and determining an activity prediction model of a target spot protein according to the target functional area corresponding to each layer of network in the basic neural network; and predicting the compound to be tested according to the activity prediction model of the target protein to obtain the activity prediction result of the compound to be tested on the target protein.

Description

Compound activity prediction method, network training method, device, medium, and apparatus

Technical Field

The application relates to the technical field of information processing, in particular to a compound activity prediction method, a network training method, a device, a medium and equipment.

Background

Drug screening is a step in modern drug development processes for the detection and acquisition of compounds with specific physiological activities, and is a process of selecting compounds with high activity on a specific action target from a large number of compounds or new compounds mainly through standardized experimental means. The process of drug screening is essentially the process of performing pharmacological activity experiments on compounds, and with the development of drug development technology, the physiological activity experiments on new compounds are gradually changed from early verification experiments to screening experiments, namely, so-called drug screening.

In the virtual screening process based on the molecular structure, the activity of the candidate compound against the target protein needs to be detected, and the higher the activity value is, the better the inhibition effect of the candidate compound on the target protein is, and the more likely the candidate compound is to be selected as a drug against the target protein. Currently, the activity detection method generally detects a compound through multiple experimental repetitions to determine the activity value of a candidate compound against a certain target protein. However, the method for detecting the activity of the compound through experiments has a great deal of repetitive work and consumes a great deal of manpower and material resources.

Disclosure of Invention

The embodiment of the application provides a compound activity prediction method, a network training method, a device, a medium and equipment, and improves the accuracy of activity prediction of a compound to be detected in a current target protein.

In one aspect, there is provided a method of predicting activity of a compound, the method comprising: obtaining target point protein corresponding to the compound to be detected;

determining target characteristics of the target protein according to the information of the tested active compound corresponding to the target protein;

inputting the target point features into a trained functional area determination network for processing so as to determine a target functional area corresponding to each layer of the network in the basic neural network, wherein the trained functional area determination network is used for predicting the selected probability of the target functional area;

determining an activity prediction model of the target protein according to the target function region corresponding to each layer of network in the basic neural network;

and predicting the compound to be tested according to the activity prediction model of the target protein to obtain the activity prediction result of the compound to be tested on the target protein.

In another aspect, a method of network training for compound activity prediction is provided, the method comprising:

obtaining training sample data, wherein the training sample data comprises a historical data set of all historical target proteins, and each historical data comprises at least one historical target protein and activity data of a historical tested active compound on the historical target protein;

determining the target characteristics of the historical target protein by adopting a target characteristic network according to the molecular structure characteristics of the historical active compound of the historical target protein and the activity data of the historical active compound on the historical target protein;

inputting the target characteristics of the historical target protein into a functional area determination network to determine a historical functional area corresponding to each layer of the network in the basic neural network;

determining network parameters of the basic neural network corresponding to the historical target protein according to the historical functional area corresponding to each layer of the basic neural network;

and training the target feature network, the basic neural network and the functional area determination network by using the network parameters of the basic neural network corresponding to the historical target protein and the training sample data to obtain the trained target feature network, the trained basic neural network and the trained functional area determination network.

In another aspect, there is provided a compound activity prediction device, the device comprising:

the acquisition unit is used for acquiring a target protein corresponding to a compound to be detected;

a determining unit, configured to determine a target feature of the target protein according to information of the detected active compound corresponding to the target protein; and

inputting the target point features into a trained functional area determination network for processing so as to determine a target functional area corresponding to each layer of the network in the basic neural network, wherein the trained functional area determination network is used for predicting the selected probability of the target functional area; and

and the prediction unit is used for predicting the compound to be tested according to the activity prediction model of the target protein so as to obtain the activity prediction result of the compound to be tested on the target protein.

In another aspect, there is provided a network training apparatus for compound activity prediction, the apparatus comprising:

the device comprises an acquisition unit, a comparison unit and a comparison unit, wherein the acquisition unit is used for acquiring training sample data, the training sample data comprises a historical data set of all historical target proteins, and each historical data comprises at least one historical target protein and activity data of a historical measured active compound on the historical target protein;

the determining unit is used for determining the target characteristics of the historical target proteins by adopting a target characteristic network according to the molecular structure characteristics of the historical active compounds of the historical target proteins and the activity data of the historical active compounds on the historical target proteins; and

inputting the target characteristics of the historical target protein into a functional area determination network to determine a historical functional area corresponding to each layer of the network in the basic neural network; and

and the training unit is used for training the target feature network, the basic neural network and the functional area determination network by using the network parameters of the basic neural network corresponding to the historical target protein and the training sample data so as to obtain the trained target feature network, the trained basic neural network and the trained functional area determination network.

In another aspect, there is provided a computer readable storage medium having stored thereon a computer program adapted to be loaded by a processor to perform the steps of the compound activity prediction method according to any one of the embodiments above.

In another aspect, a computer device is provided, which includes a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the steps of the compound activity prediction method according to any one of the above embodiments by calling the computer program stored in the memory.

In another aspect, there is provided a computer program product comprising computer instructions which, when executed by a processor, carry out the steps of the compound activity prediction method according to any one of the embodiments above.

The method comprises the steps of obtaining a target protein corresponding to a compound to be detected; determining target characteristics of the target protein according to the information of the tested active compound corresponding to the target protein; inputting the target point characteristics into a trained functional area determination network for processing so as to determine a target functional area corresponding to each layer of the network in the basic neural network, wherein the trained functional area determination network is used for predicting the selected probability of the target functional area; determining an activity prediction model of target protein according to target function regions corresponding to each layer of network in the basic neural network; and predicting the compound to be tested according to the activity prediction model of the target protein to obtain the activity prediction result of the compound to be tested on the target protein. Compared with the existing activity prediction method in which the data of other historical target proteins are directly utilized for training and prediction, the embodiment of the application provides a functional regional meta-learning algorithm, extracts the target characteristics of the target proteins, and improves the accuracy of activity prediction in the current target proteins. Further, by improving the accuracy of activity prediction, the quality of virtual screening of drug molecules can be ensured to a certain extent, and more excellent and accurate shoot compounds can be found, so that subsequent discovery and development of lead compounds and candidate compounds can be carried out.

The method comprises the steps of obtaining training sample data, wherein the training sample data comprise historical data sets of all historical target proteins, and each historical data comprises at least one historical target protein and activity data of a historical active compound to the historical target protein; determining the target characteristics of the historical target protein by adopting a target characteristic network according to the molecular structure characteristics of the historical active compound of the historical target protein and the activity data of the historical active compound on the historical target protein; inputting target characteristics of historical target proteins into a functional area determination network to determine a historical functional area corresponding to each layer of the network in the basic neural network; determining network parameters of the basic neural network corresponding to the historical target protein according to the historical functional area corresponding to each layer of the basic neural network; and training the target point characteristic network, the basic neural network and the functional area determination network by using the network parameters of the basic neural network corresponding to the historical target point protein and training sample data to obtain the trained target point characteristic network, the trained basic neural network and the trained functional area determination network. The activity prediction model is trained by fully utilizing the data of the tested compounds of all known historical target proteins, so that the problems that the learned meta-initial model is easy to generate overfitting and the effect of generalizing to new data is poor due to the small quantity of the target proteins are solved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of a drug discovery process provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for predicting the activity of a compound provided in the examples herein;

FIG. 3 is a schematic flow chart of another method for predicting activity of a compound provided in an embodiment of the present application;

FIG. 4 is a schematic flow chart of a method for predicting activity of a compound provided in the examples herein;

FIG. 5a is a schematic diagram of another process of a method for predicting activity of a compound according to an embodiment of the present application;

fig. 5b is a diagram illustrating an application scenario of a compound activity prediction method provided in an embodiment of the present application;

FIG. 6 is a schematic flow chart of a network training method for compound activity prediction provided in an embodiment of the present application;

fig. 7a is a diagram illustrating an application scenario of a network training method for compound activity prediction according to an embodiment of the present application;

fig. 7b is a diagram illustrating an application scenario of a network training method for compound activity prediction according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a compound activity prediction device provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of a network training device for compound activity prediction provided in an embodiment of the present application;

fig. 10 is a schematic structural view of a compound activity prediction apparatus provided in an embodiment of the present application;

fig. 11 is a schematic structural diagram of a network training device for compound activity prediction provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a compound activity prediction method, a network training method, a device, a medium and equipment. Specifically, the method of the embodiment of the present application may be executed by a computer device, where the computer device may be a terminal or a server, and the like. The embodiment of the application can be applied to scenes such as artificial intelligence, machine learning, deep neural network, meta learning, drug analysis, and finding of the Miao compounds, and is also suitable for predicting other properties of drug molecules, including absorption, distribution, metabolism, excretion and toxicity of drugs.

First, some terms or expressions appearing in the course of describing the embodiments of the present application are explained as follows:

artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.

Artificial Neural Networks (ANNs): an arithmetic mathematical model for simulating animal neural network behavior characteristics and performing distributed parallel information processing is provided, wherein the purpose of processing input information is achieved through network parameters of a large number of nodes (or called neurons) and the interconnection relationship among the nodes.

Deep Neural Network (DNN): is a neural network with at least one hidden layer. Similar to the shallow neural network, the deep neural network can also provide modeling for a complex nonlinear system, but the extra levels provide higher abstraction levels for the model, thereby improving the capability of the model. In the compound activity prediction process, the incidence relation between the characteristic information of the molecular structure of the tested active compound and the activity data of the tested active compound on the target protein is learned by adopting an activity prediction model based on machine learning or deep learning, so that the activity prediction result of the tested compound can be generated according to the input molecular structure characteristic of the tested compound. Wherein, the activity prediction model is based on the improvement of a deep neural network model.

Recurrent Neural Network (RNN): the artificial neural network is an artificial neural network which has a tree hierarchical structure and allows network nodes to recurse input information according to the connection sequence of the network nodes, and is one of deep learning algorithms. In the embodiment of the application, in the function area determination network, a Recurrent Neural Network (RNN) is used for predicting the selected probability of a target function area, the network of each hidden layer in a trained basic neural network is divided into a plurality of function areas, the RNN is used for predicting the selected probability of one function area of each layer, and the target point characteristics are input into the RNN to obtain the selected function area, namely the target function area.

Morgan molecular Fingerprints (Morgan Fingerprints): are used to characterize the molecular structure of a substance, and in particular to characterize activity-related features in the molecular structure. Morgan molecular fingerprint is a circular fingerprint, also belonging to topological type, and like Extended-Connectivity Fingerprints (ECFPs), each element in the fingerprint represents a specific substructure. In the embodiment of the application, the Morgan molecular fingerprint is adopted to describe the molecular structure characteristics related to the activity in the molecular structure of the compound, and the Morgan molecular fingerprint of the compound can be obtained by processing the molecular structure of the compound through the Morgan algorithm.

Meta learning: the mapping relation between the state characteristics and the quality parameters of the neural network in each stage of the machine learning framework can be mined in a supervised learning mode, and the performance of the neural network is optimized according to the characteristics of a new learning task. The core idea of meta-learning is to learn initial parameters of a neural network from a large number of training tasks, and the initial parameters can enable a new machine learning task to quickly converge to a better solution under the condition of a small sample. The method comprises the steps of determining a target function area determined by a network based on the function area by adopting a meta-learning method to form a basic activity prediction model, learning initial parameters of a deep neural network model according to activity data of target point proteins with known activities, and training to obtain an activity prediction network model corresponding to the target point proteins by adopting a small amount of activity data of the target point proteins based on the initial parameters.

The block chain system: it may be a distributed system formed by a client, a plurality of nodes (any form of computing device in an access network, such as a server, a user terminal) connected by a network communication form. A Peer-To-Peer (P2P, Peer To Peer) network is formed among nodes, a P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP), in a distributed system, any machine such as a server and a terminal can be added To become a node, and the node includes a hardware layer, an intermediate layer, an operating system layer and an application layer.

Referring to fig. 1, a process of drug discovery is shown, which may include: target identification and validation, shoot-head compound discovery, lead compound discovery and optimization, candidate compound validation and development, and clinical trials. The method for predicting the activity of the compound and the network training method can be applied to virtual screening based on the molecular structure, can provide services for the pharmaceutical process, and can accelerate the discovery process of the seedling compound by using an artificial intelligence algorithm.

The process of finding the head-shoot compound comprises virtual screening based on molecular structure. Compared with the traditional experimental screening, the virtual screening performed by the calculation method does not need to consume a compound sample, so that the manpower and material resources are greatly saved, and the screening process is accelerated. A ligand-based drug design method (ligand-based drug design) is one of the common methods for virtual screening, and refers to a model for predicting a new compound by learning and establishing a relationship between a molecular structure and activity based on a known active ligand small molecular structure.

The existing method for predicting the activity of the drug molecules of the target protein mainly comprises two types, one type is that the prediction model obtained by training each historical target protein is used for predicting the activity-measured training drug molecules of the current target protein. In the method, a machine learning model of each historical target protein is a random forest, a model of the target protein is partial least squares regression, and the model expression capacities of the two models are weak.

The other is a multi-task learning technology, namely all target proteins adopt a deep neural network model and share bottom deep neural network parameters to realize knowledge migration, and the high-level parameters are specific to each target protein to adapt to the difference between the proteins. In the method, multitask learning requires parameters of each target protein with unique high level, and the parameters are trained by using measured activity data of the target protein. Considering the limited measured activity data, overfitting problems are easily caused. In addition, the tasks with low relevance are learned together, which may cause low stability of the model, and even damage the prediction effect of the model.

The other type is a meta-learning technology, namely, a common initial model is learned, the initial model is rapidly updated to a prediction model of a target point for each target point of historical non-target point proteins through training data of the initial model, and then the initial model is optimized through test data, so that the performance of the prediction model after each target point is updated is optimal. The initial model is finally applied to target protein, and a final prediction model of the target is obtained through rapid updating of training data. However, due to the small amount of target proteins, the learned meta-initial model is easy to generate overfitting, and the generalization to new data is poor. Meanwhile, the knowledge of target proteins with low similarity to target targets may be migrated, thereby causing negative effects.

The method aims at the virtual screening based on the molecular structure, predicts the activity data of the compound to be detected, can provide service for a pharmaceutical factory, and further accelerates the discovery process of the compound at the seedling head.

The key problem to be solved by the embodiments of the present application is that, in predicting the activity value of the feature of the drug molecular structure by using meta-learning, since the number of target proteins is small, the learned meta-initial model is likely to generate overfitting, and the knowledge of the target proteins with low similarity to the target may be migrated, thereby causing negative effects. The embodiment of the application provides a functional regionalized meta-learning algorithm in the ligand-based drug design, and the data with high correlation in the activity data of other historical target proteins is utilized to summarize the historical target protein data with measured activity into different functional regions according to the similarity, so that the data with high correlation in the activity data of other historical target proteins is fully utilized to improve the accuracy of activity prediction in the target protein.

In order to better understand the technical solution provided by the embodiment of the present application, some brief descriptions are provided below for application scenarios to which the technical solution provided by the embodiment of the present application is applicable, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application and are not limited. The compound activity prediction method is executed as an example by a computer device, wherein the computer device can be a terminal or a server or other devices.

In the compound activity prediction stage, a user can upload information of a compound to be detected through a client, a browser client or an instant messaging client installed in computer equipment, and after the uploaded compound information is acquired by the computer equipment, target characteristics of a target protein are determined according to the information of the detected active compound corresponding to the target protein; inputting the target point characteristics into a trained functional area determination network for processing so as to determine a target functional area corresponding to each layer of the network in the basic neural network, wherein the trained functional area determination network is used for predicting the selected probability of the target functional area; determining an activity prediction model of target protein according to target function regions corresponding to each layer of network in the basic neural network; and predicting the compound to be tested according to the activity prediction model of the target protein to obtain the activity prediction result of the compound to be tested on the target protein.

In the training stage, the computer equipment acquires training sample data, wherein the training sample data comprises historical data sets of all historical target proteins, and each historical data comprises at least one historical target protein and activity data of a historical measured active compound on the historical target protein; determining the target characteristics of the historical target protein by adopting a target characteristic network according to the molecular structure characteristics of the historical active compound of the historical target protein and the activity data of the historical active compound on the historical target protein; inputting target characteristics of historical target proteins into a trained functional area determination network to determine a historical functional area; determining an activity prediction model corresponding to a historical target protein according to a historical functional region corresponding to each layer of network in a basic neural network; and training the target point feature network and the functional area determination network by using the training sample data.

The compound activity prediction process, the activity prediction model training process, and the actual prediction process may be completed in the server or the terminal. When the training process and the actual prediction process of the model are finished in the server and the trained activity prediction model needs to be used, the compound to be tested can be input into the server, and after the actual prediction of the server is finished, the obtained activity value of the compound to be tested is sent to the terminal for displaying. When the training process and the actual prediction process of the model are finished in the terminal and the trained activity prediction model is needed to be used, the compound to be tested can be input into the terminal, and after the actual prediction of the terminal is finished, the terminal displays the activity value of the compound to be tested. When the training process of the model is completed in the server, the actual prediction process of the model is completed in the terminal, and the trained drug analysis model needs to be used, the compound to be tested can be input into the terminal, and after the actual prediction of the terminal is completed, the terminal displays the activity value of the compound to be tested. Optionally, a model file (model file) trained in the server may be transplanted to the terminal, and if the activity value of the compound to be tested needs to be predicted, the compound to be tested is input to the trained model file (model file), and the activity value of the compound to be tested can be obtained through calculation.

The embodiment of the application can be implemented by combining a cloud technology or a block chain network technology. A method of predicting the activity of a compound as disclosed in the examples herein wherein the data may be stored on a blockchain. For example, an activity prediction model, a basic neural network, a target feature network, a functional region determination network, a feature extraction network, a fully-connected network, a recurrent neural network, a measured compound dataset of a target protein, and a historical dataset of historical target proteins may all be stored on the blockchain.

In order to facilitate the implementation of storage and query of a target protein corresponding to a compound to be tested, target characteristics of the target protein, a trained functional region determination network, a target functional region corresponding to each layer of the trained basic neural network, an activity prediction model of the target protein, and an activity prediction result of the compound to be tested on the target protein, optionally, the activity prediction method of the compound further includes: sending the target protein corresponding to the compound to be tested, the target characteristics of the target protein, the trained functional region determination network, the target functional region corresponding to each layer of the trained basic neural network, the activity prediction model of the target protein and the activity prediction result of the compound to be tested on the target protein to the block chain network, and filling a target protein corresponding to the compound to be detected, target characteristics of the target protein, a trained functional region determination network, a target functional region corresponding to each layer of the trained basic neural network, an activity prediction model of the target protein and an activity prediction result of the compound to be detected on the target protein into the new block by using the nodes of the block chain network, and adding the new block to the tail part of the block chain when the new block is identified uniformly. According to the embodiment of the application, the target point protein corresponding to the compound to be tested, the target point characteristics of the target point protein, the trained function region determination network, the target function region corresponding to each layer of the trained basic neural network, the activity prediction model of the target point protein and the activity prediction result of the compound to be tested on the target point protein can be stored in a chain manner, so that the backup of records is realized, and when the activity value of the predicted compound corresponding to the target point protein needs to be obtained, the activity prediction result of the corresponding compound to be tested on the target point protein can be directly and quickly obtained from the block chain, so that the activity prediction efficiency of the compound is improved.

The following are detailed below. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.

The embodiments of the present application provide a compound activity prediction method, which may be executed by a terminal or a server, or may be executed by both the terminal and the server; the present embodiment is described as an example of the prediction of compound activity performed by a server.

Referring to fig. 2, fig. 2 is a schematic flow chart of a method for predicting activity of a compound according to an embodiment of the present application, the method comprising:

step 210, obtaining a target protein corresponding to the compound to be detected.

As described above, the user can input the test compound and the target protein through the terminal device. The terminal device may be a mobile terminal, a Personal Digital Assistant (PDA), a computer, a notebook, or the like. The terminal equipment is provided with a compound activity prediction client, and the compound activity prediction client is provided with a functional module for predicting the activity of the compound. In the process of developing a new drug, when a user needs to screen a young plant compound aiming at a target protein, a compound activity prediction client on a terminal device is opened, and the user can input information of the compound to be detected and information of the target protein in a corresponding page.

Step 220, determining target characteristics of the target protein according to the information of the tested active compound corresponding to the target protein.

Specifically, for a target protein inputted by a user, a small amount of information of the measured active compound corresponding to the target protein can be obtained, and the information of the measured active compound can be a molecular structure of the measured active compound, characteristic information of the molecular structure of the measured active compound, activity data of the measured active compound on the target protein, and the like. And the data of a small amount of tested active compounds corresponding to the target protein can be stored in a database, and extracted from the database according to the target protein input by a user.

After the detected active compound information corresponding to the target protein is obtained, the detected active compound information can be subjected to feature extraction to determine the target feature of the target protein.

Optionally, the step of determining the target characteristics of the target protein according to the information of the measured active compound corresponding to the target protein comprises:

inputting the molecular structure characteristics of the tested active compound and the activity data of the tested active compound on the target protein into a trained target characteristic network for processing so as to determine the target characteristics of the target protein, wherein the trained target characteristic network is obtained based on similarity training among the activity data of all historical target proteins.

Wherein, the molecular structure characteristics of the active compound refer to the characteristic information of the molecular structure of the active compound, and the characteristic information of the molecular structure can be represented by Morgan molecular fingerprints of the compound to be detected. The process of generating the morgan molecular fingerprint can comprise the following steps: atom initialization, iterative updating and feature generation. The molecular structure information of the compound to be detected includes arrangement structure information of atoms, and atom initialization refers to allocating an integer identifier to each atom, for example, by applying a fixed hash function to a connection feature of an atom and a previous layer of network neighboring area, a feature representing the atom is generated, and an output result of the hash function is used as the integer identifier of the atom. The iterative updating is to take each atom as the center, and combine the atoms of the surrounding circle until the designated radius is reached to form a substructure. And the characteristic generation is to calculate the substructure, generate a characteristic list and obtain the Morgan molecular fingerprint of the compound to be detected according to the generated characteristic list.

The trained target feature network can be trained based on the similarity between activity data of all historical target proteins. Wherein the historical target protein comprises a large amount of active target protein data measured in the existing public data set, and the historical target protein can comprise a data set of target protein and non-target protein.

Optionally, as shown in fig. 3, the step of inputting the molecular structure characteristics of the measured active compound and the activity data of the measured active compound on the target protein into the trained target feature network for processing to determine the target feature of the target protein may be implemented through steps 211 to 213, and specifically includes:

and step 211, performing feature extraction on the molecular structure features of the tested active compound by using a feature extraction network to obtain intermediate features of the tested active compound.

Specifically, the feature extraction network can be derived from all training data x of the target protein of interest^sThrough a feature extraction network

May be composed of a three-Layer Multi-Layer Perceptron (MLP), where the middle two hidden layers each contain a predetermined number of neurons, such as 500, and the middle two hidden layers each contain 500 neurons. By using

And (4) performing characteristic extraction on the molecular structure characteristics of the tested active compound to obtain the intermediate characteristics of the tested active compound. The intermediate features characterize target features of the target protein of interest to some extent.

Step 212, the intermediate characteristics of the tested active compounds and the activity data of the tested active compounds on the target proteins are concatenated.

Step 213, inputting the serially connected data into a fully-connected network containing a preset number of neurons, and averaging the output results of the fully-connected layers corresponding to the molecular structures of all the tested active compounds to obtain the target characteristics of the target protein.

Specifically, after obtaining the intermediate characteristics, the intermediate characteristics of the tested active compound and the activity data y of the tested active compound on the target protein of interest are compared^sAre connected in series. Where concatenation refers to concatenating the intermediate features and activity values together into a new data vector, for example, 1024 dimensions for the intermediate features, 1 dimension for the activity values, and 1025 dimensions for the concatenation.

The concatenated data is input into a fully connected network containing a preset number of neurons. For example, the preset number may be 64. Namely, the series data passes through a full-connected network (Fully-connected layer) MF containing 64 neurons, and the target characteristics are obtained. Can be expressed as the following formula (1):

wherein the content of the first and second substances,

is the intermediate characteristic output by the characteristic extraction network and the activity data y of the target protein^sConcatenated, MF () represents the output of the fully connected network.

Further, output results of the full connection layer corresponding to the molecular structures of all the tested active compounds are averaged to obtain target characteristics of the target protein. The target feature network can be expressed as the following formula (2):

wherein k represents k drug molecules of the tested compound. Averaging can reduce the effect of different drug molecules if the drug molecules on different target proteins are different. Otherwise, if they are all the same, averaging may not be performed. w is a_aParameters representing a target feature network, including a feature extraction network

And parameters of the fully connected network MF.

Therefore, the target characteristics of the target protein are represented by the combination of the intermediate characteristics of the detected compound molecular structure characteristics and the activity values corresponding to the detected compound molecular structure characteristics, so that the phenomenon that the same molecular structure characteristics but different activity values possibly appear in different targets can be avoided, and the accuracy of the target protein characteristic representation by the target characteristics is improved.

And step 230, inputting the target point characteristics into the trained functional area determination network for processing to determine a target functional area corresponding to each layer of the network in the trained basic neural network, wherein the trained functional area determination network is used for predicting the selected probability of the target functional area.

In particular, to determine the target functional region of each layer of the activity prediction model, the present application proposes a functional region determination network to predict the probability of each functional region being selected. The functional region determination network can use a Recurrent Neural Network (RNN) to perform prediction, and divides the network of each hidden layer in the trained basic neural network into a plurality of functional regions. Optionally, each layer of the deep neural network is decomposed into 3 pending functional regions, and when a user gives a new target protein, target features are input into the functional region determination network to output the most relevant target functional region of each layer. Therefore, the data of other target protein with measured activity can be summarized into different functional regions according to the similarity, so that the data with high correlation in the activity data of other target protein can be fully utilized to improve the accuracy of activity prediction in the current target protein.

The trained functional area determination network can be obtained by training the existing data of the historical target protein, or can be a functional area determination network with given initial parameter values, and the functional area determination network is established in the process of the predicted meta-learning.

The trained basic neural network can be a multi-layer deep neural network, and comprises an input layer, a hidden layer and an output layer. The basic neural network can be obtained by training the existing data of the historical target protein, or can be the basic neural network with given initial parameter values, and the basic neural network is established in the process of predicted meta-learning. For example, the basic neural network may be constructed as a preset number of deep neural networks, such as four layers of deep neural networks, and the middle three hidden layers each contain a preset number of neurons, such as 500 neurons. The number of the network layers of the basic neural network can be selected according to actual conditions. It should be noted that too few layers may produce under-fitting, and too many layers may produce over-fitting, resulting in poor prediction effect. Optionally, the application can construct a four-layer deep neural network.

Optionally, as shown in fig. 4, each layer network in the trained basic neural network includes a plurality of undetermined functional regions, and step 230 may be implemented by steps 231 to 233, which specifically include:

and 231, determining the series connection characteristics according to the target point characteristics and the selected probability corresponding to all the undetermined functional areas of the previous layer of network in the basic neural network.

Step 232, determining the hidden characteristics of the current layer network according to the series characteristics and the hidden characteristics of the previous layer network in the basic neural network.

And 233, processing the hidden features of the current layer network by using the trained functional area determination network so as to determine a target functional area of the current layer network from a plurality of functional areas to be determined.

Specifically, in order to determine the target functional region of each layer of the activity prediction model, a trained functional region determination network may be employed to predict the probability of each pending functional region being selected. A recurrent neural network RNN may be used to make the prediction, which may be used to process the sequenced data and output the prediction. The functional area determination network is sequential, and RNN is preferably used as the functional area determination network.

The basic neural network can be constructed into a preset number of deep neural networks, and each layer of the basic neural network is divided into a preset number of regions to be determined.

Optionally, each layer of the basic neural network is divided into 3 regions to be determined, and the basic neural network has a 4-layer network structure.

For example, the basic neural network has l layers of networks, each layer of network has c regions to be determined, and the probability that the c regions to be determined of the l layer are selected is

The input to the network being a characteristic of the target

And the probability p that all the pending functional areas of the previous layer of network are selected^l-1And the implicit characteristic h of the previous layer network^l-1Then outputting the hidden feature h of the layer^lThe hidden feature can be expressed as formula (3):

wherein, w_bNetwork parameters of the RNN network are represented.

Optionally, the step of processing the hidden feature of the current layer network by using the trained functional area determination network to determine a target functional area of the current layer network from a plurality of functional areas to be determined includes:

the hidden characteristics of all undetermined functional areas of each layer network in the basic neural network are normalized through the trained functional area determination network to obtain the selected probability corresponding to all the undetermined functional areas;

and determining the target function area of the current layer network according to the undetermined function area corresponding to the maximum value in the selected probabilities corresponding to all the undetermined function areas.

For example, the selected probability of c pending functional regions of layer I

Can be based on the hidden feature h^lThe result of the normalization process by the softmax function is, for example,

can be expressed as formula (4):

further, taking

The value with the highest median probability value is expressed as

And selecting a maximum probability value

The corresponding pending function area is expressed as

As the target functional area of the l-th layer.

And 240, determining an activity prediction model of the target protein according to the target function region corresponding to each layer of network in the trained basic neural network.

And after the target function region of each layer network is determined, connecting the target function regions of each layer network according to the weight or sequentially connecting the target function regions of each layer network to determine the activity prediction model of the target protein. The activity prediction model can be used for predicting a compound to be detected input by a user to obtain an activity prediction result of the compound to be detected on the target protein. Therefore, based on the correlation among a large number of historical target proteins and by utilizing the information of a small number of tested compounds of the target protein, a specific activity prediction model for the target protein is generated, and the accuracy of activity prediction in the current target protein can be effectively improved.

Optionally, referring to fig. 5a, the step of determining an activity prediction model of the target protein according to the target functional region corresponding to each layer network in the trained basic neural network includes:

and 241, sequentially connecting target function areas corresponding to each layer of network in the basic neural network to obtain a basic activity prediction model of the target protein.

Determining a target functional area for each layer of a network in an underlying neural network

Then, the application connects each layer in sequence to obtain a basic activity prediction model of the target protein:

for example, referring to FIG. 5b, the target functional region predicted according to the target point features is

The above target functional regions were connected in order to obtain a basic activity prediction model as shown in FIG. 5 b.

And 242, adjusting the basic activity prediction model of the target protein at least once by adopting a gradient descent method.

Specifically, the gradient descent method is obtained by automatic gradient derivation using a deep learning framework PyTorch or the like. For example, the adjustment of the basic activity prediction model of the target protein of interest by using the gradient descent method can be expressed as formula (5):

wherein, Φ is the network parameter of the basic activity prediction model of the target protein obtained in the step 241, α is the learning rate of gradient optimization, and optionally, α may be set to 0.01. (x)^s,y^s) Training data set of tested compounds as target proteins of interest

Molecular structure characteristic x of tested active compound output by basic activity prediction model of target protein before adjustment^sPredicted activity of a target protein of interest; phi is a_i-1Predicting the network parameters of the model for the basic activity of the target protein before adjustment; phi is a_iAnd predicting the network parameters of the model for the adjusted basic activity of the target protein.

Assuming that the initial value of the network in the gradient descent method is theta, in the process of model adjustment, the network parameters of the activity prediction model corresponding to each target protein are obtained by performing gradient optimization for a plurality of steps from the value theta.

The objective function is subjected to derivation about phi, gradient reduction is carried out through the derivation, the objective function is converged, and therefore the network parameter phi of the basic activity prediction model is rapidly updated.

For example, in the fast updating process, the optimization is started from the initial value theta of the network, and phi is obtained after one-step gradient calculation₁Will phi₁The network initial value θ is replaced to perform a second step of gradient descent, and so on.

Wherein the content of the first and second substances,

watch capable of showingShown as equation (6):

wherein k represents k drug molecules in the tested compound.

Representing the kth molecular structural feature of the tested active compound,

to represent

Measured activity value of (2).

Shows the adjusted basic activity prediction model phi_iThe activity value predicted later.

And 243, determining an activity prediction model of the target protein according to the adjusted basic activity prediction model.

Specifically, the basic activity prediction model phi after the gradient descent method is adjusted_iAnd determining an activity prediction model of the target protein.

And step 250, predicting the compound to be tested according to the activity prediction model of the target protein to obtain the activity prediction result of the compound to be tested on the target protein.

Specifically, the compound x to be tested is^qInputting the data of a small amount of target protein tested compounds into an activity prediction model to obtain an activity prediction result f of the target protein_φ(x^q)。

All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and are not described herein again.

The method comprises the steps of obtaining a target protein corresponding to a compound to be detected; determining target characteristics of the target protein according to the information of the tested active compound corresponding to the target protein; inputting the target point characteristics into a trained functional area determination network for processing so as to determine a target functional area corresponding to each layer of the network in the basic neural network, wherein the trained functional area determination network is used for predicting the selected probability of the target functional area; determining an activity prediction model of target protein according to target function regions corresponding to each layer of network in the basic neural network; and predicting the compound to be tested according to the activity prediction model of the target protein to obtain the activity prediction result of the compound to be tested on the target protein. Compared with the existing activity prediction method in which the data of other historical target proteins are directly utilized for training and prediction, the embodiment of the application provides a functional regional meta-learning algorithm, and the target characteristics of the target proteins are extracted, and the data of the measured activity is summarized into different functional regions according to the similarity, so that the data with high correlation in the activity data of other target proteins are fully utilized, and the accuracy of activity prediction in the current target protein is improved. Further, by improving the accuracy of activity prediction, the quality of virtual screening of drug molecules can be ensured to a certain extent, and more excellent and accurate shoot compounds can be found, so that subsequent discovery and development of lead compounds and candidate compounds can be carried out.

Embodiments of the present application also provide a network training method for compound activity prediction, which may be performed by a terminal or a server, or may be performed by both the terminal and the server; the embodiment of the present application is described as an example in which the training method is executed by a server.

Referring to fig. 6, fig. 6 is a first flowchart of a network training method for compound activity prediction according to an embodiment of the present application, the method including:

step 610, obtaining training sample data, where the training sample data includes historical data sets of all historical target proteins, and each historical data includes at least one historical target protein and activity data of a historical active compound on the historical target protein.

Specifically, the training sample data comprises a historical dataset of all historical target proteins. Wherein the historical target protein comprises known target proteins published or contained in a database. Each historical data comprises at least one historical target protein and data on the activity of historically tested active compounds against the historical target protein.

And step 620, determining the target characteristics of the historical target proteins by using a target characteristic network according to the molecular structure characteristics of the historical active compounds of the historical target proteins and the activity data of the historical active compounds on the historical target proteins.

Wherein the molecular structural characteristics of the historical measured active compound are the molecular structural characteristics of the measured active compound corresponding to a certain historical target protein. The molecular structure characteristics refer to the characteristic information of the molecular structure of the active compound, and the characteristic information of the molecular structure can be represented by Morgan molecular fingerprints of the compound to be detected.

Optionally, the target feature network includes a feature extraction network and a full-connection network, and the step of determining the target feature of the historical target protein according to the molecular structure feature of the historical measured active compound of the historical target protein and the activity data of the historical measured active compound on the historical target protein by using the target feature network includes:

extracting the characteristics of the molecular structure characteristics of the historical measured active compound by adopting a characteristic extraction network to obtain the intermediate characteristics of the historical measured active compound;

the intermediate characteristics of the historical measured active compounds and the activity data of the historical measured active compounds on the historical target proteins are connected in series;

and inputting the data after series connection into a full-connection network containing a preset number of neurons, and averaging output results of full-connection layers corresponding to the molecular structures of all historical tested active compounds to obtain the target characteristics of the historical target proteins.

Processing all historical target protein N in the training sample data to obtain the target characteristics of the corresponding historical target protein

The target feature network can refer to formula (2) in the prediction method:

And parameters of the fully connected network MF.

Step 630, inputting the target characteristics of the historical target protein into the trained functional region determination network to determine the historical functional region.

Optionally, each layer of the network in the basic neural network includes a plurality of undetermined functional regions, and the step of inputting the target characteristics of the historical target protein into the trained functional region determination network to determine the historical functional region includes:

and determining the series connection characteristics according to the target point characteristics and the selected probability corresponding to all the undetermined functional areas of the previous layer of network in the basic neural network.

And determining the hidden characteristics of the current layer network according to the series characteristics and the hidden characteristics of the previous layer network in the basic neural network.

And processing the hidden characteristics of the current layer network by adopting the trained functional area determination network so as to determine a target functional area of the current layer network from a plurality of functional areas to be determined.

Optionally, each layer of the basic neural network may be divided into 3 regions to be determined, and the basic neural network has a 4-layer network structure.

The specific implementation manner can refer to the corresponding embodiment of the compound activity prediction method, and details are not repeated here.

And step 640, determining network parameters of the basic neural network corresponding to the historical target protein according to the historical functional region corresponding to each layer of the network in the basic neural network.

The method for determining the network parameters of the basic neural network corresponding to the historical target protein according to the historical functional region corresponding to each layer of the network in the basic neural network is the same as the determination method in the prediction method, and the determined network parameters can be expressed as follows:

optionally, target function regions corresponding to each layer of the network in the basic neural network are sequentially connected, and a gradient descent method is adopted to perform at least one adjustment, so as to determine network parameters of the basic neural network corresponding to the historical target protein. The gradient descent method may refer to formula (5) in the prediction method:

wherein phi is a network parameter of a basic neural network corresponding to the historical target protein, alpha is a learning rate optimized by gradient, and optionally, alpha can be set to be 0.01。(x^s,y^s) Training data set of tested compounds as historical target proteins

Molecular structural characteristic x of tested active compound output for basic neural network corresponding to historical target protein^sPredicted activity against historical target proteins. Phi is a_i-1Network parameters of a basic neural network corresponding to the history target protein after the i-1 th round of parameter adjustment; phi is a_iAnd (5) network parameters of the basic neural network corresponding to the historical target protein after the ith round of parameter adjustment.

Assuming that the initial value of the network in the gradient descent method is theta, in the model training process, the network parameters of the basic neural network corresponding to each target protein are obtained by performing gradient optimization for a plurality of steps from the value theta.

That is, the objective function is differentiated with respect to phi, and gradient reduction is performed through differentiation so that the objective function converges, thereby rapidly updating phi.

And 650, training the target feature network, the basic neural network and the functional area determination network by using the network parameters of the basic neural network corresponding to the historical target protein and the training sample data to obtain the trained target feature network, the trained basic neural network and the trained functional area determination network.

Optionally, the step of training the target feature network, the basic neural network, and the functional area determination network by using the network parameters of the basic neural network corresponding to the historical target protein and the training sample data to obtain a trained target feature network, a trained basic neural network, and a trained functional area determination network includes:

inputting the molecular structure characteristics of the historical measured active compounds of the historical target proteins into a basic neural network corresponding to the historical target proteins with network parameters to obtain the historical predicted activity of the historical measured active compounds on the historical target proteins;

determining a first objective function based on historical predicted activity of the historical measured active compound against the historical target protein and the activity data of the historical measured active compound against the historical target protein.

Specifically, the molecular structure characteristic x of the active compound which has been determined historically_kInputting a basic neural network corresponding to the historical target protein to obtain the historical predicted activity of the tested active compound output by the activity prediction network model to be trained on the historical target protein

For example, the first objective function may be expressed as equation (6):

wherein k represents k drug molecules in the tested compound.

to represent

Measured activity value of (2).

Representing the molecular structural characteristics of historically measured active compounds

And inputting the historical predicted activity obtained by the basic neural network corresponding to the historical target protein. Phi is a_iAnd the network parameters of the basic neural network corresponding to the historical target protein adjusted by the gradient descent method are shown.

And adjusting the network parameters of the basic neural network according to the determined first objective function until the training end condition is met.

Specifically, the training end condition may be that the loss value converges to a preset target value. In other embodiments, the training end condition may be that a preset number of training times is reached.

Optionally, the step of training the target feature network, the basic neural network, and the functional area determination network by using the network parameters of the basic neural network corresponding to the historical target protein and the training sample data to obtain a trained target feature network, a trained basic neural network, and a trained functional area determination network further includes:

(1) the method comprises the steps of randomly dividing a historical data set into a plurality of disjoint historical data subsets, and extracting features of the plurality of historical data subsets by adopting a target feature network to obtain target features of the plurality of historical data subsets.

For example, all historical data sets are collected

And randomly dividing the historical data into n disjoint subsets, and then extracting the features of the n disjoint subsets by adopting a target feature network to obtain the target features of the multiple historical data subsets.

(2) And determining a second objective function according to the similarity of the target characteristics of the same historical target proteins in the target characteristics of the plurality of historical data subsets and the similarity of the target characteristics of different historical target proteins, and optimizing the second objective function to train the target characteristic network. For example, the second objective function may be expressed as equation (7) using a contrast-learned objective function:

wherein the content of the first and second substances,

and

represented as two subsets of the same historical target protein i.

And

representing subsets

And

the respective target characteristics.

And

two subsets respectively representing historical target protein i and historical target protein j

And

the respective corresponding target point characteristics. N indicates that there are N historical target proteins.

By applying a second objective function

Such as minimizing the second objective function, may be formulated

The greater the measure of similarity of features from different subsets of the same historical target protein, the higher the similarity of target features for different subsets of the same historical target protein. At the same time, the smaller the measure of similarity of features of a subset of all different historical target proteins.

determining a third objective function according to the first objective function and the second objective function;

and optimizing a third objective function to train the basic neural network, the target point characteristic network and the functional area determination network.

For example, the third objective function may be expressed as equation (8):

wherein theta is the initial value of the network in the gradient descent method, w_aNetwork parameters of the target feature network, w_bNetwork parameters of the network are determined for the functional area. x is the number of^sAnd y^sAnd (3) representing the molecular structure characteristics of the tested compound corresponding to the historical target protein in the historical data set and the activity value of the tested compound. Phi is a_iThe network parameters of the basic neural network corresponding to the history target protein after the ith round of parameter adjustment are expressed, and phi can be known as described above_iComprising a network initial value theta and a network parameter w in a functional area determination network_b。

Denotes x^sOutputting after inputting the basic neural network corresponding to the adjusted historical target proteinAnd (4) predicting the activity value. N represents N historical target proteins.

The first objective function is expressed by the above equation (2).

A second objective function representing the target feature network, including a network parameter w of the target feature network_aIt can be expressed by the above formula (3).

When the target function areas corresponding to each layer of the basic neural network are sequentially connected, the historical target protein network parameter phi is obtained_iThen, a first objective function is calculated

And a second objective function

Finally, a third objective function is obtained

And training the network parameters of the basic neural network, the network parameters of the target point characteristic network and the network parameters of the functional area determination network by optimizing, such as minimizing, the third objective function to obtain the trained target point characteristic network, the trained basic neural network and the trained functional area determination network.

Further, the third objective function is adjusted at least once by adopting a gradient descent method so as to update the network parameters of the basic neural network, the network parameters of the target point characteristic network and the network parameters of the functional area determination network.

Wherein, the network parameters of the basic neural network comprise network initial values theta globally shared by all historical target point proteins, and the parameters of the target point characteristic network are w_aThe functional area determines the network parameter of the network as w_b。

For example, the gradient descent method training may be represented as (9) - (11):

wherein θ on the left side of the equation of equation (9) is the initial value of the network after optimization of gradient descent, and θ on the right side of the equations of equations (9) - (11) is the initial value of the network before optimization in gradient descent; w is a_aNetwork parameters of the target feature network, w_bNetwork parameters of the network are determined for the functional area, β is a gradient optimized learning rate, and optionally α may be set to 0.001.

Is an objective function

With respect to the gradient of theta.

Is an objective function

With respect to w_aOf the gradient of (c).

Is an objective function

With respect to w_bOf the gradient of (c).

For a third objective function, please refer to the above equation (8).

In particular, updating network parameters using this gradient descent method may perform a limited step optimization, e.g., performing a 1-step update, or a 5-step update.

Optionally, in the training process, the equations (1) - (11) may be cycled for multiple times to update the network parameters, so as to obtain a trained target feature network, a trained basic neural network, and a trained functional area determination network. For example, 1000 cycles may be performed.

Referring to FIG. 7a, FIG. 7a illustrates training sample data

By means of a target feature network

Obtaining ti, inputting the ti into a function area to determine a network, and selecting the ti according to the maximum probability

Determining the target function region and selecting the maximum probability value

The corresponding pending function area is expressed as

As the target functional area of the l-th layer. In obtaining the target functional area of

Then, the target function areas of each layer network are connected in sequence to obtain the network parameters corresponding to the basic neural network of the target protein

The network parameters are then ramped down to update quickly,

to represent

Can be calculated by the above formula (1).

Through the training process, the optimized theta and w can be obtained_a,w_bBy optimizing theta, w in the actual prediction of the activity value of the test compound_a,w_bAnd establishing an activity prediction model for the target protein, and predicting the compound to be tested according to the activity prediction model of the target protein to obtain an activity prediction result of the compound to be tested on the target protein.

In addition, the complexity of the different functional areas is different, the required network width is different, the complexity is different, and the different functional areas may share part of the parameters. In the above embodiment, functional areas are determined, and each area is a small network module. In some other embodiments, each dimension of all the pending functional regions of the ith layer may be determined as a target functional region, and the probability that the output of the functional region determination network is selected from each pending functional region is determined as the probability that each dimension is predicted to be selected. Therefore, multiple dimensions can be selected, and meanwhile, the selected standard of each dimension is that the probability corresponding to the dimension exceeds a certain preset threshold value and then is selected.

To more clearly show the difference between the activity prediction methods of the present application and the prior art, please refer to fig. 7b, in which there are three activity value prediction methods in fig. 7b, two of the left are the prior prediction methods, and the right is an example of an embodiment of the present application. And predicting the activity-measured training drug molecules of the current target protein by using the prediction model x obtained by training all the leftmost target proteins by using each historical target protein. The middle is to rapidly update the initial model to the prediction model of the target point through the training data of each target point of the historical target point proteins 1-n by learning a common initial model, the historical target point proteins 1-n do not comprise the current concerned target point protein, and then the initial model is optimized by using the test data, so that the performance of the updated prediction model of each target point is optimal, the structure of the prediction model of each target point is the same, and each target point corresponds to different model network parameters.

And the rightmost part is an embodiment example of the application, the target point characteristics of the historical target point proteins 1-n are extracted respectively, the functional region determination network determines a target functional region corresponding to each historical protein target point, and the target functional region of each layer network is connected. Further, the target feature network, the basic neural network and the functional area determination network are trained by using network parameters and training sample data of the basic neural network corresponding to the historical target protein, so that the trained target feature network, the trained basic neural network and the trained functional area determination network are obtained. When a user inputs a compound to be detected and a target protein, the network is determined according to the trained target feature network, the trained basic neural network and the trained functional region, and an activity prediction model can be generated according to a small amount of information of the detected compound of the target protein, so as to predict the activity value of the compound to be detected on the target protein.

The method comprises the steps of obtaining training sample data, wherein the training sample data comprise historical data sets of all historical target proteins, and each historical data comprises at least one historical target protein and activity data of a historical active compound to the historical target protein; determining the target characteristics of the historical target protein by adopting a target characteristic network according to the molecular structure characteristics of the historical active compound of the historical target protein and the activity data of the historical active compound on the historical target protein; inputting target characteristics of historical target proteins into a functional area determination network to determine a historical functional area corresponding to each layer of the network in the basic neural network; determining network parameters of the basic neural network corresponding to the historical target protein according to the historical functional area corresponding to each layer of the basic neural network; and training the target point characteristic network, the basic neural network and the functional area determination network by using the network parameters of the basic neural network corresponding to the historical target point protein and training sample data to obtain the trained target point characteristic network, the trained basic neural network and the trained functional area determination network. The activity prediction model is trained by fully utilizing the data of the tested compounds of all known historical target proteins, so that the problems that the learned meta-initial model is easy to generate overfitting and the effect of generalizing to new data is poor due to the small quantity of the target proteins are solved. Secondly, compared with the prior art, due to the fact that different types of target proteins have larger difference in interaction with small molecules, different target proteins have different pharmacophores, the knowledge of the target protein with low similarity to the target is migrated to cause negative effects, and meanwhile, tasks with low relevance are learned together, so that the stability of the model is low, and even the prediction effect of the model is damaged. The method provides a functional regionalized meta-learning algorithm in the ligand-based drug design, and summarizes other target protein data with measured activity into different functional regions according to the similarity, so that data with high correlation in the activity data of other target proteins are fully utilized, and the accuracy of activity prediction in the current target protein is improved. Thirdly, the full-connection network and the recurrent neural network are utilized to determine respective target characteristic network and functional area determination network, and three network parameters are optimized in training to improve the prediction accuracy of an activity prediction model formed by the target protein, so that the model expression capacity is strong.

In order to better implement the compound activity prediction method of the embodiments of the present application, the embodiments of the present application also provide a compound activity prediction apparatus. Referring to fig. 8, fig. 8 is a schematic structural diagram of a device for predicting activity of a compound according to an embodiment of the present disclosure. The compound activity prediction apparatus 800 may include:

an obtaining unit 810, configured to obtain a target protein corresponding to a compound to be detected;

a determining unit 820, configured to determine a target feature of the target protein according to the information of the measured active compound corresponding to the target protein; and

inputting the target point characteristics into a trained functional area determination network for processing so as to determine a target functional area corresponding to each layer of the network in the basic neural network, wherein the trained functional area determination network is used for predicting the selected probability of the target functional area; and

determining an activity prediction model of target protein according to target function regions corresponding to each layer of network in the basic neural network;

the predicting unit 830 is configured to predict the compound to be detected according to the activity prediction model of the target protein, so as to obtain a prediction result of the activity of the compound to be detected on the target protein.

Optionally, the determining unit 820 may be configured to input the molecular structure characteristic of the measured active compound and the activity data of the target protein obtained by the measured active compound into a trained target feature network for processing, so as to determine the target feature of the target protein, where the trained target feature network is trained based on similarities between the activity data of all historical target proteins.

Optionally, the determining unit 820 may be further configured to perform feature extraction on the molecular structure feature of the detected active compound by using a feature extraction network, so as to obtain an intermediate feature of the detected active compound; connecting the intermediate characteristics of the tested active compound and the activity data of the tested active compound on the target protein in series; and inputting the data after series connection into a full-connection network containing a preset number of neurons, and averaging output results of full-connection layers corresponding to the molecular structures of all the tested active compounds to obtain the target characteristics of the target protein.

Optionally, the determining unit 820 may be further configured to determine the series connection characteristic according to the target point characteristic and the selected probabilities corresponding to all the undetermined functional regions of the previous layer of network in the basic neural network; determining the hidden characteristics of the current layer network according to the series characteristics and the hidden characteristics of the previous layer network in the basic neural network; and processing the hidden characteristics of the current layer network by adopting the trained functional area determination network so as to determine a target functional area of the current layer network from a plurality of functional areas to be determined.

Optionally, the determining unit 820 may be further configured to perform normalization processing on the hidden features of all to-be-determined functional regions of each layer in the basic neural network through the trained functional region determination network to obtain selected probabilities corresponding to all to-be-determined functional regions; and determining a target function area of the current layer network according to the undetermined function area corresponding to the maximum value in the selected probabilities corresponding to all the undetermined function areas.

Optionally, the determining unit 820 may be further configured to sequentially connect the target functional regions of each layer in the basic neural network to obtain a basic activity prediction model of the target protein; adjusting the basic activity prediction model of the target protein at least once by adopting a gradient descent method; and determining an activity prediction model of the target protein according to the adjusted basic activity prediction model.

The embodiment of the application also provides a network training device for predicting the activity of the compound. Referring to fig. 9, fig. 9 is a schematic structural diagram of a network training device for compound activity prediction according to an embodiment of the present application. The network training apparatus 900 for predicting the activity of the compound may include:

an obtaining unit 910, configured to obtain training sample data, where the training sample data includes a historical data set of all historical target proteins, and each historical data includes at least one historical target protein and activity data of a historical measured active compound on the historical target protein;

a determining unit 920, configured to determine, by using a target feature network, a target feature of a historical target protein according to a molecular structure feature of a historical measured active compound of the historical target protein and activity data of the historical measured active compound on the historical target protein; and

a training unit 930, configured to train the target feature network, the basic neural network, and the functional area determination network by using the network parameters of the basic neural network corresponding to the historical target protein and the training sample data, so as to obtain a trained target feature network, a trained basic neural network, and a trained functional area determination network.

Optionally, the training unit 930 may be configured to input the molecular structure characteristic of the historical measured active compound of the historical target protein into the basic neural network corresponding to the historical target protein with the network parameter, so as to obtain the historical predicted activity of the historical measured active compound on the historical target protein; determining a first objective function according to the historical predicted activity of the historical measured active compound on the historical target protein and the activity data of the historical measured active compound on the historical target protein; and adjusting the network parameters of the basic neural network according to the determined first objective function until a training end condition is met.

Optionally, the training unit 930 may be further configured to randomly divide the historical data set into a plurality of disjoint historical data subsets; performing feature extraction on the plurality of historical data subsets by adopting the target point feature network to obtain target point features of the plurality of historical data subsets; determining a second objective function according to the similarity of the target features of the same historical target proteins in the target features of the plurality of historical data subsets and the similarity of the target features of different historical target proteins; optimizing the second objective function to train the target feature network.

Optionally, the training unit 930 may be further configured to determine a third objective function according to the first objective function and the second objective function; optimizing the third objective function to train the basic neural network, the target feature network, and the functional area determination network.

It should be noted that, for the functions of each module in the compound activity prediction apparatus 800 and the network training apparatus 900 in the embodiments of the present application, reference may be made to the specific implementation manner of any embodiment in the foregoing method embodiments, and details are not described here again.

The respective units in the compound activity prediction apparatus 800 and the network training apparatus 900 described above may be implemented in whole or in part by software, hardware, and a combination thereof. The units may be embedded in hardware or independent from a processor in the computer device, or may be stored in a memory in the computer device in software, so that the processor can call and execute operations corresponding to the units.

The compound activity prediction device 800 and the network training device 900 may be integrated in a terminal or server having a memory and a processor installed therein and having a computing capability, for example, or the drug analysis device 600 may be the terminal or server. The terminal can be a smart phone, a tablet Computer, a notebook Computer, a smart television, a smart speaker, a wearable smart device, a Personal Computer (PC), and the like, and the terminal can further include a client, which can be a video client, a browser client, an instant messaging client, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Fig. 10 is a schematic structural diagram of a compound activity prediction apparatus 800 provided in an embodiment of the present application, where the compound activity prediction apparatus 800 may include: a communication interface 801, a memory 802, a processor 803, and a communication bus 804. The communication interface 801, the memory 802, and the processor 803 communicate with each other via a communication bus 804. The communication interface 801 is used for data communication between the apparatus 800 and an external device. The memory 802 may be used to store software programs and modules, and the processor 803 may operate the software programs and modules stored in the memory 802, such as the software programs of the corresponding operations in the foregoing method embodiments.

Alternatively, the processor 803 may invoke the software programs and modules stored in the memory 802 to perform the following operations:

obtaining target point protein corresponding to the compound to be detected;

inputting the target point characteristics into a trained functional area determination network for processing so as to determine a target functional area corresponding to each layer of the network in the trained basic neural network, wherein the trained functional area determination network is used for predicting the selected probability of the target functional area;

determining an activity prediction model of the target protein according to a target function region corresponding to each layer of network in the trained basic neural network;

Alternatively, the compound activity prediction apparatus 800 may be integrated in a terminal or a server having a memory and a processor installed therein and having an arithmetic capability, for example, or the compound activity prediction apparatus 800 may be the terminal or the server. The terminal can be a smart phone, a tablet computer, a notebook computer, a smart television, a smart sound box, a wearable smart device, a personal computer and the like. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform and the like.

Fig. 11 is a schematic structural diagram of a network training apparatus 900 according to an embodiment of the present application, and as shown in fig. 11, the network training apparatus 900 may include: a communication interface 901, a memory 902, a processor 903, and a communication bus 904. The communication interface 901, the memory 902 and the processor 903 are communicated with each other through a communication bus 904. The communication interface 901 is used for the apparatus 800 to perform data communication with an external device. The memory 902 may be used for storing software programs and modules, and the processor 903 may operate the software programs and modules stored in the memory 902, for example, the software programs of the corresponding operations in the foregoing method embodiments.

Alternatively, the processor 930 may invoke the software programs and modules stored in the memory 902 to perform the following operations:

Alternatively, the network training device 900 may be integrated in a terminal or a server having a memory and a processor, or the compound activity prediction device 800 may be the terminal or the server. The terminal can be a smart phone, a tablet computer, a notebook computer, a smart television, a smart sound box, a wearable smart device, a personal computer and the like. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform and the like.

Optionally, the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps in the foregoing method embodiments when executing the computer program.

The present application also provides a computer-readable storage medium for storing a computer program. The computer readable storage medium can be applied to a computer device, and the computer program enables the computer device to execute the corresponding procedures in the compound activity prediction method in the embodiments of the present application, which are not described herein again for brevity.

The present application also provides a computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the corresponding procedures in the compound activity prediction method in the embodiment of the present application, which are not described herein again for brevity.

The present application also provides a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the corresponding procedures in the compound activity prediction method in the embodiment of the present application, which are not described herein again for brevity.

It should be understood that the processor of the embodiments of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

It will be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous link SDRAM (SLDRAM), and Direct Rambus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be understood that the above memories are exemplary but not limiting illustrations, for example, the memories in the embodiments of the present application may also be Static Random Access Memory (SRAM), dynamic random access memory (dynamic RAM, DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (enhanced SDRAM, ESDRAM), Synchronous Link DRAM (SLDRAM), Direct Rambus RAM (DR RAM), and the like. That is, the memory in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer or a server) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for predicting the activity of a compound, said method comprising:

obtaining target point protein corresponding to the compound to be detected;

inputting the target point features into a trained functional area determination network for processing so as to determine a target functional area corresponding to each layer of the network in the trained basic neural network, wherein the trained functional area determination network is used for predicting the selected probability of the target functional area;

2. The method of claim 1, wherein determining the target signature of the target protein of interest based on the information about the measured active compound to which the target protein of interest corresponds comprises:

inputting the molecular structure characteristics of the tested active compound and the activity data of the tested active compound on the target protein into a trained target feature network for processing so as to determine the target features of the target protein, wherein the trained target feature network is obtained based on similarity training among the activity data of all historical target proteins.

3. The method of claim 2, wherein the target signature network comprises a signature extraction network and a fully-connected network, and wherein inputting the molecular structural signature of the measured active compound and the activity data of the measured active compound on the target protein of interest into the target signature network for processing to determine the target signature of the target protein of interest comprises:

performing feature extraction on the molecular structure features of the tested active compound by using the feature extraction network to obtain intermediate features of the tested active compound;

concatenating the intermediate characteristics of the measured active compound and the activity data of the measured active compound on the target protein of interest;

and inputting the data after series connection into the full-connection network containing a preset number of neurons, and averaging output results of full-connection layers corresponding to the molecular structures of all tested active compounds to obtain the target characteristics of the target protein.

4. The method of claim 1, wherein each layer of the trained basic neural network comprises a plurality of pending functional areas, and the inputting the target point feature into the trained functional area determination network for processing to determine the target functional area corresponding to each layer of the basic neural network comprises:

determining a series connection characteristic according to the target point characteristic and the selected probability corresponding to all undetermined functional areas of a previous layer of network in the trained basic neural network;

determining the hidden features of the current layer network according to the series features and the hidden features of the previous layer network in the trained basic neural network;

and processing the hidden characteristics of the current layer network by adopting the trained functional area determination network so as to determine the target functional area of the current layer network from the plurality of functional areas to be determined.

5. The method of claim 4, wherein the processing hidden features of the current-level network using the trained functional region determination network to determine the target functional region of the current-level network from the plurality of pending functional regions comprises:

performing normalization processing on the hidden features of all undetermined functional areas of each layer of the trained basic neural network through the trained functional area determination network to obtain the selected probabilities corresponding to all the undetermined functional areas;

6. The method according to any one of claims 1 to 5, wherein the determining the activity prediction model of the target protein according to the target functional regions corresponding to each layer network in the underlying neural network comprises:

sequentially connecting the target function regions of each layer of the trained basic neural network to obtain a basic activity prediction model of the target protein;

adjusting the basic activity prediction model of the target protein at least once by adopting a gradient descent method;

and determining an activity prediction model of the target protein according to the adjusted basic activity prediction model.

7. A method of network training for prediction of compound activity, the method comprising:

8. The method according to claim 7, wherein the training the target feature network, the basic neural network and the functional region determination network using the network parameters of the basic neural network corresponding to the historical target protein and the training sample data comprises:

inputting the molecular structure characteristics of the historical measured active compounds of the historical target proteins into a basic neural network corresponding to the historical target proteins with the network parameters to obtain the historical predicted activity of the historical measured active compounds on the historical target proteins;

determining a first objective function according to the historical predicted activity of the historical measured active compound on the historical target protein and the activity data of the historical measured active compound on the historical target protein;

and adjusting the network parameters of the basic neural network according to the determined first objective function until a training end condition is met.

9. The method according to claim 8, wherein the training the target feature network, the basic neural network and the functional region determination network using the network parameters of the basic neural network corresponding to the historical target protein and the training sample data further comprises:

randomly partitioning the historical data set into a plurality of disjoint historical data subsets;

performing feature extraction on the plurality of historical data subsets by adopting the target point feature network to obtain target point features of the plurality of historical data subsets;

determining a second objective function according to the similarity of the target features of the same historical target proteins in the target features of the plurality of historical data subsets and the similarity of the target features of different historical target proteins;

optimizing the second objective function to train the target feature network.

10. The method according to claim 9, wherein the training the target feature network, the basic neural network and the functional region determination network using the network parameters of the basic neural network corresponding to the historical target protein and the training sample data further comprises:

optimizing the third objective function to train the basic neural network, the target feature network, and the functional area determination network.

11. The method of claim 10, wherein the training process further comprises:

and adjusting the third objective function at least once by adopting a gradient descent method so as to update the network parameters of the basic neural network, the network parameters of the target point characteristic network and the network parameters of the functional area determination network.

12. A compound activity prediction device, the device comprising:

13. A network training apparatus for compound activity prediction, the apparatus comprising:

14. A computer-readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor for performing the steps of the method according to any one of claims 1-11.

15. A computer arrangement, characterized in that the computer arrangement comprises a processor and a memory, in which a computer program is stored, which processor, by calling the computer program stored in the memory, is adapted to carry out the steps of the method according to any of claims 1-11.

16. A computer program product comprising computer instructions, characterized in that said computer instructions, when executed by a processor, implement the steps in the method of any of claims 1-11.