CN113782114B

CN113782114B - Automatic excavating method of oligopeptide medicine lead based on machine learning

Info

Publication number: CN113782114B
Application number: CN202111094052.6A
Authority: CN
Inventors: 张永彪; 肖百川; 王晓刚; 马超
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2024-02-09
Anticipated expiration: 2041-09-17
Also published as: CN113782114A

Abstract

The invention discloses an automatic excavating method of oligopeptide medicine lead based on machine learning, which comprises the following steps: acquiring a functional protein set, and extracting inherent disorder regions (IntrinsicallyDisordered Regions, IDRs) of the functional protein set; constructing an N-Gram model based on a deep neural network; learning a semantic distribution mode of IDRs based on an N-Gram model to obtain a context probability vector of amino acid of oligopeptide of a possible patent medicine; simulating the process of oligopeptide rising from zero by adopting a Monte Carlo method according to the context probability vector of the amino acid to obtain candidate oligopeptides; scoring and ranking the candidate oligopeptides, and selecting a plurality of candidate oligopeptides with top ranking results for functional verification. The invention combines an N-Gram model and a Monte Carlo method to dig out the functional oligopeptide of the possible patent medicine from the functional protein concentrate with forward relation with the treatment of related diseases, and has universality.

Description

Automatic excavating method of oligopeptide medicine lead based on machine learning

Technical Field

The invention relates to the technical field of computer-aided drug design, in particular to an automatic excavating method of oligopeptide drug lead based on machine learning.

Background

The polypeptide medicine is used as a medicine with high selectivity and strong effect, and has high safety and tolerance. However, traditional polypeptide drug design relies heavily on accurate protein structure and function annotation, which results in high cost and time period for drug development. In order to reduce the cost and time period of drug development, attempts have been made to assist in drug development using various types of machine learning and statistical analysis methods, and good progress has been made.

Throughout recent years, almost all commonly used machine learning methods such as deep neural networks, support vector machines, KNNs, random forests and GBMs, logistic regression, discriminant analysis, hidden markov models, etc. have been used to assist in the development of drugs by artificial intelligence. From the application point of view, these works mainly focus on the field where the data stores of antibacterial peptides (AMPs), antitumor peptides (ACPs), and tumor cell neoantigens (neoantigens) are mature.

Based on the features used, these algorithms can be divided into two categories: one is a deep learning-based method, which can achieve high accuracy without manually designing features, but has the defects of "data hunger and thirst" and opaque decision process. The other type is a traditional machine learning method based on feature engineering, and the method is not as fast as deep learning in model capacity, but can obtain more accurate results through high-quality manual features under the condition of data scarcity. Common manual features can be divided into two categories, one category being characterized by the elemental composition of the primary sequence. For example: n-and C-terminal or amino acid residue number of the holotoxin; pseudo amino acid composition (PseAAC) method; a sequence order based method; methods based on Evolutionary Feature Construction (EFC) are based on non-local correlations between motifs. Another class of manual features is based on the physicochemical properties of the natural amino acids and features an average of the physicochemical indices of the entire polypeptide sequence or all amino acids contained at its ends. Taking antibacterial peptide as an example, the physicochemical property indexes based on the primary sequence commonly used at present are 56, wherein 47 peptide fragment characteristics and 9 global characteristics comprise known t-scale, u-polarity and other structure-activity indexes.

However, these methods, which have achieved good results in the development of polypeptide drugs, are difficult to use in the development of oligopeptide drugs. In one aspect, the available data set for oligopeptide drugs is far less than for polypeptide drugs such as ACP, AMP, etc. Up to now, there are 28 FDA approved oligopeptide drugs, 55 in experimental stages, most of which are different modifications or derivatives of the same oligopeptide, which severely limits the use of supervised learning methods such as deep learning. On the other hand, because of the small number of amino acid residues in oligopeptide drugs, manual features for development of polypeptide drugs are difficult to distinguish on oligopeptide drugs, resulting in difficulty in feature migration. The lack of prior information and the limitations of self length make it difficult and important to design unique manual features for oligopeptide drugs.

Therefore, it is becoming urgent and necessary to design an automatic design method for oligopeptide medicines based on machine learning.

Disclosure of Invention

In view of the above, the invention provides an automatic excavating method of oligopeptide medicine lead based on machine learning, which combines an N-Gram model and a Monte Carlo method to intensively excavate functional oligopeptides of possible patent medicine from functional proteins with forward relation with treatment of related diseases, and has universality.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

an automatic excavating method of oligopeptide medicine lead based on machine learning, comprising the following steps:

s1, acquiring a functional protein set, and extracting IDRs of the functional protein set;

s2, constructing an N-Gram model based on a deep neural network;

s3, learning a semantic distribution mode of the IDRs based on the N-Gram model to obtain a context probability vector of amino acid of oligopeptide of a possible patent medicine;

s4, simulating the process of oligopeptide rising from zero by adopting a Monte Carlo method according to the context probability vector of the amino acid to obtain candidate oligopeptides;

and S5, scoring and ranking the candidate oligopeptides, and selecting a plurality of candidate oligopeptides with top ranking results for functional verification.

Preferably, in the above-mentioned automatic excavation method of oligopeptide drug lead based on machine learning, the expression of the N-Gram model is:

wherein F represents a deep neural network, θ represents a parameter to be learned in F,represents the kth word omega _k Sequence numbers, v (context (ω) _k ) Character omega _k Context (ω) of (1) _k ) Is a word vector of (a).

Preferably, in the above-mentioned automatic excavating method of oligopeptide drug lead based on machine learning, S4 comprises the following steps:

s41, selecting any amino acid as an initial amino acid;

s42, deducing a context probability vector of the linking amino acid of the oligopeptide to be delayed by using the N-Gram model;

s43, adopting a Monte Carlo method to simulate and generate the linking amino acid according to the context probability vector deduced in the S42;

s44, connecting the linking amino acid with the current oligopeptide to be delayed to obtain a new oligopeptide to be delayed;

s45, circularly executing S42-S44, and carrying out one amino acid delay and rise for each round until the preset ending condition is met, so as to obtain the candidate oligopeptide.

Preferably, in the above-mentioned automatic excavation method of oligopeptide drug lead based on machine learning, the preset end condition in S45 is: the length of the oligopeptide is increased to 10 and the probability of all potential linking amino acids of the current oligopeptide is smaller than the random probability.

Preferably, in the above-mentioned automatic excavation method of oligopeptide drug lead based on machine learning, S5 comprises:

grouping and clustering according to the lengths of the candidate oligopeptides;

and respectively scoring the recommendation degree of the oligopeptides in each cluster in each group of clusters, and selecting a plurality of candidate oligopeptides with top score ranks for functional verification.

Preferably, in the above-mentioned automatic excavating method of oligopeptide medicine lead based on machine learning, in S5, if one or more of the selected candidate oligopeptides ranked at the top have the function verification result satisfying the preset requirement, continuing to perform the function verification on the remaining candidate oligopeptides in the cluster where the oligopeptides whose function verification result satisfies the requirement are located.

Preferably, in the above-mentioned automatic excavating method of oligopeptide drug lead based on machine learning, in S5, the product of context probabilities of the linking amino acids in each round of oligopeptide extension is used as a recommendation score of the candidate oligopeptides, and the candidate oligopeptides are ranked according to the recommendation score.

Preferably, in the above-mentioned automatic excavation method of oligopeptide drug lead based on machine learning, the deep neural network architecture of the N-Gram model is composed of an input layer, a projection layer, a hidden layer, an output layer and a SoftMax layer.

Compared with the prior art, the invention discloses an automatic excavating method of oligopeptide medicine lead based on machine learning, because protein IDRs are the structural basis of protein phase change, and the phase change has strong correlation with occurrence of diseases, the invention takes the IDRs as characteristic areas, can bypass the problem of lack of data sets to a certain extent, and improves the success rate of developing oligopeptide medicines based on small samples.

Meanwhile, the invention considers the difficulty of manually designing the oligopeptide descriptor, and adopts a deep learning method to avoid the problem of manual feature design. The invention also considers that the oligopeptide does not have a long-distance semantic mode and the functional protein set (namely the model training set) is usually smaller, so that the most basic natural language processing model, namely N-Gram, is adopted to carry out semantic mode mining on IDRs so as to learn the amino acid distribution mode of the oligopeptide of a possible patent medicine. The N-Gram model is essentially a conditional probability calculation model, which functions similar to a conventional naive Bayesian model, but performs the calculation of the inter-word conditional probability through a deep neural network, so that the model capacity is larger than that of a traditional machine learning model, and manual characteristics are not required to be designed. The model has simple principle, does not need to rely on a large amount of training data, and the decision probability of each step can be obtained, so that the model is suitable for development of oligopeptide medicines. In addition, the invention simulates the process of the oligopeptide from zero to rise by a Monte Carlo method, so that the de novo design of the oligopeptide medicament is more similar to the natural process.

In general, the invention fills the research blank in the related field by using machine learning for full-automatic excavation of oligopeptide drug leads; meanwhile, the invention has high universality, and for any application scene (indication), the functional oligopeptide which is possibly prepared can be extracted from the functional protein set only by providing the functional protein set which has forward relation with the treatment of the disease.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the automatic excavating method of oligopeptide medicine lead based on machine learning;

FIG. 2 is a flow chart showing the overall process of the present invention for the mining of therapeutic oligopeptides from a collection of functional proteins;

FIG. 3 is a flow chart of the method for obtaining candidate oligopeptides by combining an N-Gram model and a Monte Carlo method;

FIG. 4 (A) - (E) are diagrams showing the results of candidate oligopeptides and experimental verification provided by the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the embodiment of the invention discloses an automatic excavating method of oligopeptide medicine lead based on machine learning, which comprises the following steps:

s2, constructing an N-Gram model based on a deep neural network;

s3, learning a semantic distribution mode of IDRs based on an N-Gram model to obtain a context probability vector of amino acid of oligopeptide of the possible patent medicine;

The above steps are further described below.

S1, obtaining a functional protein set, and extracting IDRs of the functional protein set.

Among proteins, there is a region of hot spots called intrinsically disordered regions (Intrinsically Disorder Regions, IDRs) which generally interact with domains of other proteins through peptide motifs (conserved linear peptide fragments of less than 10 in length) within the region, thereby generating allosteric events, and according to existing studies, phase separation caused by the allosteric events of proteins has a strong correlation with the occurrence of diseases, and thus IDRs of proteins are a target of great interest in drug development work.

As shown in FIG. 2, the invention extracts IDRs from the functional protein set as characteristic regions, which can bypass the problem of data set shortage to a certain extent and improve the success rate of developing oligopeptide medicines based on small samples.

S2, constructing an N-Gram model based on the deep neural network.

In S2, an N-Gram model based on a deep neural network is constructed, and the model is used as an unsupervised deep learning model, and can learn the semantic mode through the functional protein IDRs. The N-Gram model is expressed as follows:

Specifically, the deep neural network architecture of the N-Gram model consists of an input layer, a projection layer, a hiding layer, an output layer and a softMax layer. Wherein,

1) Input layer: in this layer, each residue is mapped into a word vector of length m. The word vectors are randomly initialized before training and iterated in the training process.

2) Projection layer: word vectors are mapped into a higher dimensional space to increase the representation capacity of the model.

3) Hidden layer: activation is performed using the tanh function for extracting deep features.

4) Output layer: the output of the hidden layer is mapped to a low-dimensional feature vector, the dimension of the vector being the number of possible results.

5) SoftMax layer: and normalizing the output layer results to obtain the probability of each result.

S3, learning a semantic distribution mode of IDRs based on an N-Gram model to obtain a context probability vector of amino acid of oligopeptide of the possible patent medicine.

The invention obtains the semantic distribution mode (context probability vector) of IDRs of the functional protein set based on N-Gram model learning. The semantic distribution pattern refers to: in a text or sentence, the relative positional relationship between characters is specifically represented by a context probability vector that describes the probability of occurrence of each possible character in a particular context.

S4, simulating the process of the oligopeptide rising from zero by adopting a Monte Carlo method according to the context probability vector of the amino acid, and obtaining the candidate oligopeptide.

After obtaining the context probability vector of the amino acid of the oligopeptide which is possibly used as a drug based on the N-Gram model, the invention introduces a Monte Carlo simulation method for simulating the natural delay and rise process of the oligopeptide. The monte carlo method uses the probability vector obtained from the softmax layer as the probability distribution of the simulator (similar to a random seed) to simulate the process of the oligopeptide rising from zero.

Overall, starting from one amino acid residue, the context probability vector for that character (called character 1) is first calculated using the N-Gram model, and then the next preparatory character (called character 2) is simulated using the monte carlo method. The character 1 and the character 2 generated based on this are spliced to constitute a new character input in the next round (i.e., character 1 in the next round). The above procedure is repeated until the final output length (iteration is terminated when the length reaches 10 due to the definition of the oligopeptide).

Specifically, as shown in fig. 3, the procedure for modeling candidate oligopeptides in combination with the N-Gram model and the monte carlo method is as follows:

s41, selecting any amino acid as an initial amino acid; in the embodiment, 10 amino acids with highest frequency in the functional protein IDRs are selected as initial amino acids;

s42, deducing a context probability vector of the linking amino acid of the oligopeptide to be prolonged by using an N-Gram model;

Wherein, the preset end conditions are two, respectively: the length of the oligopeptide which is prolonged and increased reaches 10; in condition two, the probability of the linking amino acid of the current oligopeptide is less than the random probability, namely 1/20.

After the candidate oligopeptides are obtained, grouping and clustering are carried out according to the lengths of the candidate oligopeptides, and then recommendation degree scoring is carried out on the oligopeptides in each cluster in each group of clusters, wherein the recommendation degree score is the product of the context probability of the linking amino acids of the oligopeptides in each cycle of delay. And finally, carrying out functional verification on a plurality of candidate oligopeptides with the top scores, and continuing to carry out functional verification on the rest candidate oligopeptides in the cluster where the oligopeptides with the verification results meeting the requirements are located.

The method of the invention is verified in a specific example as follows:

the invention is realized by dividing the device into 3 parts in practical application, and firstly, the device is needed to pass through UniProt #https:// www.uniprot.org/) The website searches the functional protein set with positive relation with the treatment of a certain disease, and then passes through IUPred2A #https://iupred2a.elte.hu/) And extracting IDRs of the functional proteins, and finally inputting the IDRs into a deep learning model loaded with an N-Gram model and a Monte Carlo method to obtain the required candidate oligopeptide. This example exemplifies the excavation of oligopeptides for the treatment of osteoporosis (osteogenesis):

1. on the UniProt, 171 related functional protein sequences are obtained through the retrieval of 4 keywords, namely "ossification", "osseogenesis", "osteoblast development", "osteoblast differentiation".

2. IDRs of these functional proteins were predicted by IUPred 2A.

3. And inputting the protein IDRs sequence into a deep learning model to obtain candidate oligopeptides.

4. And carrying out grouping clustering according to the lengths of the candidate oligopeptides, selecting 3 oligopeptides with highest scores from a plurality of clusters obtained by each grouping clustering to carry out functional verification, and if the oligopeptides with the top 3 clusters are good in experimental effect, carrying out cell experimental verification on the rest oligopeptides in the clusters.

As shown in fig. 4 (a), a plurality of oligopeptides with good osteogenesis promoting effect were finally obtained, cell experiments were performed on 28 oligopeptides generated by the algorithm, the obtained Alizarin Red (ARS) had an osteogenic color value, the darker the color was, the stronger the osteogenic function was, and animal experiments were performed on the oligopeptides (named AIB 5P) with the best cell experiment effect, and fig. 4 (B) shows a double-labeled femur of hypercalciferin and xylenol orange by animal experiments, with a scale of 100 μm. Fig. 4 (C) shows femoral von kossa staining. The scale bar is 200 μm. FIG. 4 (D) shows an anti-DMP 1 immunohistochemical staining pattern. Fig. 4 (E) shows a representative microct of a mouse femur, with a medial axis scan of longitudinal cross section at the upper portion, on a scale of 1mm. The lower part is the trabecula under the growth plate, and the scale is 500 μm. The oligopeptide can be found to have good bone formation promoting effect.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An automatic excavating method of oligopeptide medicine lead based on machine learning is characterized by comprising the following steps:

s2, constructing an N-Gram model based on a deep neural network;

s4, simulating the process of oligopeptide rising from zero by adopting a Monte Carlo method according to the context probability vector of the amino acid to obtain candidate oligopeptides, wherein the method specifically comprises the following steps of:

s41, selecting any amino acid as an initial amino acid;

s45, circularly executing S42-S44, and carrying out one amino acid delay for each round until a preset ending condition is met, so as to obtain candidate oligopeptides;

2. The automatic excavation method of oligopeptide drug lead based on machine learning according to claim 1, wherein the expression of the N-Gram model is:

3. The automatic excavating method of oligopeptide medicine lead based on machine learning according to claim 1, wherein the preset ending condition in S45 is: the length of the oligopeptide is increased to 10 and the probability of all potential linking amino acids of the current oligopeptide is smaller than the random probability.

4. The automatic excavation method of oligopeptide drug lead based on machine learning according to claim 1, wherein S5 comprises:

5. The method for automatically excavating an oligopeptide drug lead based on machine learning according to claim 4, wherein in S5, if one or more of the plurality of candidate oligopeptides selected to be ranked at the top have functional verification results satisfying a preset requirement, continuing functional verification on the remaining candidate oligopeptides in the cluster where the oligopeptides whose functional verification results satisfy the requirement are located.

6. The method for automatic mining of machine learning based oligopeptide drug leads according to claim 1, wherein in S5, the product of the context probabilities of the joined amino acids in each round of oligopeptide lifting is used as the recommendation score of the candidate oligopeptides, and the candidate oligopeptides are ranked according to the recommendation score.

7. The automatic excavating method of oligopeptide medicine lead based on machine learning according to claim 1, wherein the deep neural network architecture of the N-Gram model consists of an input layer, a projection layer, a hiding layer, an output layer and a SoftMax layer.