CN113409883A

CN113409883A - Information prediction and information prediction model training method, device, equipment and medium

Info

Publication number: CN113409883A
Application number: CN202110738491.XA
Authority: CN
Inventors: 向颖飞; 林炜
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-17
Anticipated expiration: 2041-06-30
Also published as: CN113409883B

Abstract

The disclosure provides an information prediction and information prediction model training method, device, equipment and medium, and relates to the technical field of artificial intelligence such as machine learning and natural language processing. The specific implementation scheme is as follows: acquiring parameter information of the target protein based on the original sequence of the target protein; acquiring parameter information of the candidate drug based on the original sequence of the candidate drug; and predicting the affinity of the target protein and the candidate drug according to the parameter information of the target protein, the parameter information of the candidate drug and a pre-trained information prediction model. In addition, a training method of the related information prediction model is also provided. The present disclosure provides a more accurate information prediction model, implementing a more accurate information prediction scheme.

Description

Information prediction and information prediction model training method, device, equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence technologies such as machine learning and natural language processing, and in particular, to a method, an apparatus, a device, and a medium for information prediction and training of an information prediction model.

Background

Drug Target Interaction (DTI) is an important part of the field of new Drug development, and refers to the process of mutual recognition and Interaction between a Drug compound and a Target protein. The determination of the interaction between a drug and a target protein, i.e., the affinity of a drug for binding to a target protein, can generally be measured using physicochemical indices such as the dissociation constant Kd, the inhibition constant Ki, and the like.

In recent years, more and more Artificial Intelligence (AI) -based methods of deep learning have been applied in the field of drug target protein interactions. For example, the original sequence of the drug compound and the original sequence of the target protein may be expressed and learned using, for example, a Convolutional Neural Network (CNN) or Graph Neural Network (GNN), and the learned information may be exchanged to predict the affinity between the drug and the target protein from the result of the exchange.

Disclosure of Invention

The disclosure provides an information prediction and information prediction model training method, device, equipment and medium.

According to an aspect of the present disclosure, there is provided an information prediction method, wherein the method includes:

acquiring parameter information of the target protein based on the original sequence of the target protein;

acquiring parameter information of the candidate drug based on the original sequence of the candidate drug;

and predicting the affinity of the target protein and the candidate drug according to the parameter information of the target protein, the parameter information of the candidate drug and a pre-trained information prediction model.

According to another aspect of the present disclosure, there is provided a training method of an information prediction model, wherein the method includes:

collecting an array of training samples; each group of training samples comprises a protosequence of a training drug, a protosequence of a training target protein and the real affinity of the training drug and the training target protein;

acquiring parameter information of the training drugs based on the original sequences of the training drugs in each group of training samples;

acquiring parameter information of the training target protein based on the original sequence of the training target protein in each group of the training samples;

and training an information prediction model according to the parameter information of the training drugs, the parameter information of the training target protein and the real affinity of the training target protein of each group of the training samples.

According to still another aspect of the present disclosure, there is provided an information prediction apparatus, wherein the apparatus includes:

the target information acquisition module is used for acquiring the parameter information of the target protein based on the original sequence of the target protein;

the drug information acquisition module is also used for acquiring the parameter information of the candidate drug based on the original sequence of the candidate drug;

and the prediction module is used for predicting the affinity of the target protein and the candidate drug according to the parameter information of the target protein, the parameter information of the candidate drug and a pre-trained information prediction model.

According to still another aspect of the present disclosure, there is provided an apparatus for training an information prediction model, wherein the apparatus includes:

the acquisition module is used for acquiring an array of training samples; each group of training samples comprises a protosequence of a training drug, a protosequence of a training target protein and the real affinity of the training drug and the training target protein;

the medicine information acquisition module is used for acquiring the parameter information of the training medicines based on the original sequences of the training medicines in each group of training samples;

the target information acquisition module is used for acquiring parameter information of the training target protein based on the original sequence of the training target protein in each group of the training samples;

and the training module is used for training an information prediction model according to the parameter information of the training drugs, the parameter information of the training target protein and the real affinity of the training target protein of each group of the training samples.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of the aspects and any possible implementation described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the above aspect and any possible implementation.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the aspect and any possible implementation as described above.

According to the technology disclosed by the invention, a more accurate information prediction model is provided, and a more accurate information prediction scheme is realized.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure;

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It is to be understood that the described embodiments are only a few, and not all, of the disclosed embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terminal device involved in the embodiments of the present disclosure may include, but is not limited to, a mobile phone, a Personal Digital Assistant (PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), and other intelligent devices; the display device may include, but is not limited to, a personal computer, a television, and the like having a display function.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure; as shown in fig. 1, the present embodiment provides an information prediction method, which may specifically include the following steps:

s101, acquiring parameter information of the target protein based on the original sequence of the target protein;

s102, acquiring parameter information of the candidate drug based on the original sequence of the candidate drug;

s103, predicting the affinity of the target protein and the candidate drug according to the parameter information of the target protein, the parameter information of the candidate drug and a pre-trained information prediction model.

The execution subject of the information prediction method of the embodiment is an information prediction device, and the execution subject of the information prediction device is an electronic entity, or an application integrated by software. The information prediction device of the embodiment is used for realizing the prediction of information such as the affinity of target protein and candidate drugs.

In this embodiment, the pro sequence of the target protein may be a FASTA sequence of the target protein. And acquiring parameter information of the target protein based on the original sequence of the target protein, wherein the parameter information can comprise one kind of information or a plurality of kinds of information. The parameter information of the target protein of this embodiment is obtained based on the original sequence of the target protein, and is different from the FASTA sequence of the target protein, which can be regarded as more detailed information obtained based on the FASTA sequence of the target protein. For example, the parameter information of the target protein of interest may include information at the level of substructure, which enables more abundant and accurate characterization of the target protein of interest. Of course, other information of the target protein of interest may be further included, such as at least one of 3D structure information of the conformational level of the target protein of interest, Contact map (Contact map) and Distance map (Distance map) information, and the like.

Alternatively, the prosequence of the drug candidate may be a Simplified molecular input line specification (SMILES) sequence of the drug candidate. Similarly, the parameter information for the drug candidate is obtained based on the SMILES sequence of the drug candidate, which is different from the SMILES sequence of the drug candidate. Similarly, the parameter information of the drug candidate may be considered as a more detailed information obtained based on the SMILES sequence of the drug candidate. For example, the SMILES sequence of a drug candidate may include information at the level of the substructure that enables a more abundant and more accurate representation of the drug candidate. Of course, other information of the drug candidate, such as at least one of 3D structural information and mapping information of functional groups, etc., may be further included.

Then, the parameter information of the target protein and the parameter information of the candidate drug can be input into a pre-trained information prediction model, and the information prediction model can predict and output the affinity of the target protein and the candidate drug based on the parameter information of the target protein and the parameter information of the candidate drug.

It should be noted that, steps S101 and S102 in this embodiment may not be limited in sequence.

According to the information prediction method, the affinity of the target protein and the candidate drug is predicted according to the parameter information of the target protein, the parameter information of the candidate drug and the pre-trained information prediction model, and the parameter information of the target protein and the parameter information of the candidate drug are different from the information of the original sequence of the target protein and the original sequence of the candidate drug respectively, so that the target protein and the candidate drug can be more accurately represented, and the affinity of the target protein and the candidate drug can be more accurately predicted.

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure; the information prediction method provided in this embodiment further describes the technical solution of the present disclosure in more detail on the basis of the technical solution of the embodiment shown in fig. 1. As shown in fig. 2, the information prediction method provided in this embodiment may specifically include the following steps:

s201, segmenting an original sequence of the target protein to obtain a first unit sequence consisting of a plurality of subunits;

specifically, the FASTA sequence of the target protein is segmented, specifically according to the smallest segmentation unit, so that a first unit sequence composed of a plurality of subunits can be obtained.

S202, combining a plurality of subunits in the first unit sequence according to each known protein molecule compound and the corresponding occurrence frequency in a pre-established target protein information table to obtain the substructure information of the target protein;

since each subunit included in the first unit sequence is the smallest unit, some smallest units may not constitute protein molecule information, and the smallest units have little meaning independently, and fragments formed by other subunits have more value. Therefore, in this embodiment, the merging process may also be performed on a plurality of sub-units in the first unit sequence. Specifically, in the merging process, merging is mainly performed on two, three or more consecutive subunits.

In the merging process, this embodiment may be implemented by using a pre-established target protein information table, which may be established based on an existing target protein database, such as uniprot, BFD, and the like, and may include several known protein molecules and the occurrence frequency of each protein molecule in the database. During the merging process, one sub-unit can be sequentially selected as a processing object to be analyzed according to the sequence from front to back. For example, for a first sub-unit, since it is the first, there is no merging process with the previous sub-unit, and the analysis can be started directly from the second sub-unit. For the second subunit, one subunit is taken forward, that is, the first subunit and the second subunit are combined to form the whole protein molecular compound, and a target protein information table can be searched to obtain the occurrence frequency of the protein molecular compound. The frequency of appearance of the compound was 0 for the non-existent protein molecules. If present, the first subunit and the second subunit can be combined, and the frequency of occurrence of the protein molecule compound formed by the combination of the first subunit and the second subunit recorded. Then, the analysis continues for a third subunit, at which point two merging strategies exist. One is to forward a subunit, that is, the second subunit and the third subunit are combined to form an integral protein molecular compound, and the occurrence frequency of the protein molecular compound is obtained by searching a target protein information table. And the other is to forward take two subunits, namely, a first subunit, a second subunit and a third subunit are combined to form an integral protein molecule compound, and the occurrence frequency of the protein molecule compound is obtained by searching a target protein information table. Then comparing the occurrence frequency of the protein molecule compound formed by combining the first subunit and the second subunit, the protein molecule formed by combining the second subunit and the third subunit, and the protein molecule compound formed by combining the first subunit, the second subunit and the third subunit, so as to determine which way the subunits are combined. And continuing to analyze the subsequent subunits until all subunits in the FASTA sequence of the target protein are analyzed, and completing merging treatment to obtain the substructure information of the target protein. That is, the substructure information of the target protein of interest is obtained by merging some consecutive subunits in the sequence of the first unit. Based on the above, it can be seen that, in the combination treatment, for any subunit, the combination treatment is performed on the protein molecule compound having the highest frequency of occurrence with the adjacent subunit. For subunits which do not appear in any protein molecule compound in the target protein information table, the subunits still exist independently and are not combined.

The merging process in this embodiment is not limited to the above process, and in any way, it is necessary to obtain the occurrence frequency of the compound fragment formed by each subunit and its adjacent subunits, where the adjacent subunits include not only direct adjacent subunits, but also adjacent subunits of adjacent subunits when the compound fragment includes direct adjacent subunits, and so on. Finally, the subunits are combined into the compound fragment with the highest occurrence frequency. For example, if the first unit sequence of the target protein of interest comprises ABCDE, the fragment of the compound to be analyzed may comprise AB, BC, ABC, CD, BCD, ABCD, DE, CDE, BCDE and ABCDE. If the occurrence frequency of BC is the highest, DE is the second order, and BCD is the second order, then B and C, D and E are combined respectively, and the obtained substructure information of the target protein can be expressed as: A-BC-DE. Of course, in practical applications, it is not limited to the combination of two adjacent subunits, and if the occurrence frequency of a compound fragment composed of a plurality of subunits is higher, a plurality of subunits may be combined into one fragment. The substructure information obtained in this way can more accurately identify target proteins from a multi-granularity level.

Steps S201 to S202 of this embodiment are a specific implementation manner of step S101 of the embodiment shown in fig. 1.

S203, segmenting the original sequence of the candidate drug to obtain a second unit sequence consisting of a plurality of subunits;

and in the same way as the generation of the first unit sequence, the SMILES sequence of the candidate drug is segmented according to the minimum segmentation unit, and then a second unit sequence consisting of a plurality of subunits can be obtained.

S204, merging a plurality of subunits in the second unit sequence according to each known drug compound and the corresponding occurrence frequency in a pre-established drug compound information table to obtain the substructure information of the candidate drug;

the drug compound information table may be established based on existing drug databases such as Chembl, bindingDB, drug bank, and the like. The pharmaceutical compound information table may include several known pharmaceutical compounds and the frequency of occurrence of each pharmaceutical compound in the database.

The acquiring process of the substructure information of the candidate drug is similar to the acquiring process of the substructure information of the target protein in principle, and the process of combining a plurality of subunits in the second unit sequence may refer to the implementation process of step S202 in detail, and is not described herein again.

Steps S203 to S204 of this embodiment are a specific implementation manner of step S102 of the embodiment shown in fig. 1.

In this embodiment, the substructure information of the candidate drug and the substructure information of the target protein obtained after the merging may be considered to include both fragments obtained by merging at least two subunits and subunits that cannot be merged, so that the obtained substructure information of the candidate drug and the substructure information of the target protein include information of different particle size levels, and the candidate drug and the target protein can be identified more accurately.

S205, predicting the affinity of the target protein and the candidate drug according to the substructure information of the target protein, the substructure information of the candidate drug and a pre-trained information prediction model.

The information prediction model of the embodiment may adopt a Transformer network model of a single tower structure, or other network models of a single tower structure.

When the method is used, firstly, the embedding treatment is carried out on each substructure in the substructure information of the target protein and the substructure information of the candidate drug, then the embedding of all the respective substructures of the target protein and the candidate drug is fused (coordination), and the embedding of the target protein and the embedding of the candidate drug are obtained by splicing together. And performing dot product operation on the Embedding of the candidate drug and the Embedding of the target protein in an Embedding Layer of the network model: femb ═ Demb Temb; wherein, Femb is embedding after dot product operation, Demb is embedding after fusion of candidate drugs, and Temb is embedding after fusion of target protein. Since the Femb is obtained by performing dot product operation on the fused embedding of the candidate drug and the fused embedding of the target protein, the Femb contains the interactive information of the target protein and the candidate drug. And finally, inputting the Femb into an information prediction model, wherein the information prediction module can predict and output the affinity between the candidate drug and the target protein based on the interaction information of the target protein and the candidate drug contained in the Femb.

According to the information prediction method, the affinity of the target protein and the candidate drug is predicted based on the substructure information of the target protein and the substructure information of the candidate drug and the pre-trained information prediction model, and the adopted substructure information of the target protein and the substructure information of the candidate drug are more abundant and more accurate to identify the corresponding target protein and the candidate drug, so that the predicted affinity of the target protein and the candidate drug is more accurate.

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure; the information prediction method provided in this embodiment further describes the technical solution of the present disclosure in more detail on the basis of the technical solution of the embodiment shown in fig. 2. As shown in fig. 3, the information prediction method provided in this embodiment may specifically include the following steps:

s301, segmenting an original sequence of the target protein to obtain a first unit sequence consisting of a plurality of subunits;

s302, merging a plurality of subunits in the first unit sequence according to each known drug compound and the corresponding occurrence frequency in a pre-established drug compound information table to obtain the substructure information of the target protein;

s303, acquiring at least one of 3D structure information, contact map information and distance map information of the conformation layer of the target protein based on the original sequence of the target protein;

for example, HHBlits method can be used to extract Multiple Sequence Alignment (MSA) information from the FASTA Sequence of the target protein, and further obtain conformation-level 3D structure information, Contact map (Contact map) and Distance map (Distance map) information based on the MSA information.

Steps S301 to S303 of this embodiment are a specific implementation manner of step S101 of the embodiment shown in fig. 1. On the basis of the embodiment shown in fig. 2, the parameter information of the target protein of this embodiment further includes at least one of 3D structure information, contact map information, and distance map information of the conformational level of the target protein, so that information of multiple particle sizes of the target protein is further increased, and the parameter information of the target protein is enriched.

S304, segmenting the original sequence of the candidate drug to obtain a second unit sequence consisting of a plurality of subunits;

s305, merging a plurality of subunits in the second unit sequence according to each known drug compound and the corresponding occurrence frequency in a pre-established drug compound information table to obtain the substructure information of the candidate drug;

s306, acquiring at least one of 3D structure information and image information of a functional group of the candidate drug based on the original sequence of the candidate drug;

for example, at least one of 3D structure information of the drug candidate and mapping information of the functional groups is calculated using the RDkit method based on the SMILES sequence of the drug candidate.

Steps S304-S306 of this embodiment are a specific implementation manner of step S102 of the embodiment shown in fig. 1. On the basis of the embodiment shown in fig. 2, the parameter information of the drug candidate of the present embodiment further includes at least one of 3D structure information of the drug candidate and graph information of the functional group, which further increases information of multiple granularities of the drug candidate and enriches the parameter information of the drug candidate.

S307, predicting the affinity of the target protein and the candidate drug according to the substructure information of the target protein, the substructure information of the candidate drug and a pre-trained information prediction model, in combination with at least one of the 3D structure information, the contact map information and the distance map information of the conformational layer of the target protein, and in reference to at least one of the 3D structure information of the candidate drug and the map information of the functional group, and predicting the information of the functional group where the candidate drug is combined and/or the pocket information of the target protein.

Similar to step S205 of the embodiment shown in fig. 2, the 3D structure information, the contact map information and the distance map information of the conformational level of the target protein, and the 3D structure information and the functional group map information of the reference drug candidate are included as examples. When the method is used, the substructure information of the target protein, the 3D structure information of the conformation layer of the target protein, the contact map information and the distance map information are respectively embedding, and then the embedding is fused and spliced together to be used as the embedding of the target protein. And meanwhile, the substructure information of the candidate drug, the 3D structure information of the candidate drug and the graph information of the functional group are respectively subjected to embedding, and then the embedding is subjected to fusion (localization) operation and spliced together to be used as the embedding of the candidate drug. Similar to step S205, performing dot product calculation on embedding of the candidate drug and embedding of the target protein to obtain corresponding Femb. The Femb contains all mutual information of the candidate drug and the target protein. And finally, the Femb after the dot product calculation is input into an information prediction model, and the information prediction model can predict the affinity between the candidate drug and the target protein based on all interaction information of the candidate drug and the target protein contained in the Femb and output the affinity.

Further, the information prediction model of this embodiment has a stronger function, and may further predict functional group (functional group) information of the candidate drug binding and pocket information of the target protein based on the input fused embedding, where the pocket information of the target protein includes a pocket structure and a pocket position (closing pocket site) of the target protein.

Optionally, in the embodiments shown in fig. 1 and fig. 2, the information of the functional group to which the candidate drug binds and/or the pocket information of the target protein may be predicted according to the parameter information of the target protein and the parameter information of the candidate drug, and a pre-trained information prediction model.

According to the information prediction method, the affinity of the target protein and the candidate drug is predicted according to the substructure information of the target protein, the substructure information of the candidate drug and the pre-trained information prediction model, and by combining at least one of the 3D structure information, the contact map information and the distance map information of the conformation layer of the target protein and referring to at least one of the 3D structure information of the candidate drug and the map information of the functional group.

In addition, in the embodiment, the information of the functional group where the candidate drug is combined and the pocket information of the target protein can be further predicted, so that the predicted information is further enriched, and the requirements of users are met.

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure; the training method of the information prediction model provided in this embodiment may specifically include the following steps:

s401, collecting an array training sample; each group of training samples comprises a protosequence of a training drug, a protosequence of a training target protein and the real affinity of the training drug and the training target protein;

s402, acquiring parameter information of the training medicaments based on the original sequences of the training medicaments in each group of training samples;

s403, acquiring parameter information of the training target protein based on the original sequence of the training target protein in each group of training samples;

wherein steps S402 and S403 may be restricted without precedence order. The parameter information of the training drug and the parameter information of the training target protein can be described by referring to the parameter information of the candidate drug and the parameter information of the target protein in the embodiments shown in fig. 1 to 3.

S404, training the information prediction model according to the parameter information of the training drugs of each group of training samples, the parameter information of the training target protein and the real affinity of the training target protein.

During training, for any group of training samples, the parameter information of the training drug and the parameter information of the training target protein are input into an information prediction model, and the information prediction model can predict and output the prediction affinity of the training target protein and the training drug. Then, a loss function can be constructed based on the prediction affinity and the real affinity, and parameters of the information prediction model are adjusted based on the loss function, so that the loss function tends to converge. According to the mode, the information prediction model is continuously trained by adopting a plurality of groups of training samples, so that the information prediction model can learn the parameter information based on the training drugs and the parameter information of the training target protein, and the affinity between the two can be predicted.

When the parameter information of the training drug and the parameter information of the training target protein are input into the information prediction model, the parameter information of the training drug and the parameter information of the training target protein may be embedded, fused, and input into the information prediction model, with reference to the relevant description of the embodiment shown in fig. 2.

According to the training method of the information prediction model, the information prediction model is trained according to the parameter information of the training drugs of each training sample group, the parameter information of the training target protein and the real affinity of the training target protein, and the parameter information of the training drugs of the training samples and the parameter information of the training target protein adopted in the training process can more accurately represent the training drugs and the training target protein, so that the affinity of the training information prediction model can be more accurately predicted.

Similar to the implementation principle of the embodiment shown in fig. 2, the parameter information of the training drug of this embodiment may specifically include substructure information of the training drug. Correspondingly, in this case, in step S402, based on the original sequence of the training drug in each group of training samples, parameter information of the training drug is obtained, which may specifically include:

and acquiring the substructure information of the training medicaments based on the original sequences of the training medicaments in each group of training samples. The acquisition process of the substructure information of the training medicine may refer to the related description of step S204.

Correspondingly, in step S403, based on the original sequence of the training target protein in each set of training samples, obtaining parameter information of the training target protein may specifically include:

and acquiring the substructure information of the training target protein based on the original sequence of the training target protein in each group of training samples. The process of obtaining the parameter information of the training target protein can refer to the related description of step S202.

Similar to the implementation principle of the embodiment shown in fig. 3, step S402 obtains parameter information of the training drug based on the original sequence of the training drug in each group of training samples, and may specifically include: and acquiring at least one of the 3D structure information of the training drug and the graph information of the functional group based on the original sequence of the training drug in each group of training samples. At this time, the parameter information of the training drug can include information of different granularity levels, so that the training drug molecules can be further more abundantly and accurately represented, and the learning ability of the information prediction model can be further enhanced.

In addition, in step S403, based on the original sequences of the training target proteins in each group of training samples, parameter information of the training target proteins is obtained, which may specifically include: and acquiring at least one of 3D structure information, contact diagram information and distance diagram information of the conformational level of the training target protein based on the original sequence of the training target protein in each group of training samples. At this time, the parameter information of the training target protein can include information of different granularity levels, so that the training target protein molecules can be further more abundantly and accurately represented, and the learning ability of the information prediction model can be further enhanced.

It should be noted that, the information prediction model of this embodiment may predict, in addition to the affinity between the training drug and the protein target, the actual functional group information of the binding of the drug compound and/or the actual pocket information of the protein of the training target. Therefore, the information prediction model also needs to learn the partial functions during training. At the moment, real functional group information of the combination of the training drugs and/or real pocket information of the training target protein are required to be marked in each training sample;

correspondingly, during training, the method can be used for training the target protein according to the parameter information of the training drug, the parameter information of the training target protein and the real affinity of the training target protein of each group of training samples; meanwhile, the information prediction model is trained by referring to the real functional group information of the combination of the training drugs and/or the real pocket information of the training target protein.

Specifically, during training, for any group of training samples, all parameter information of the training drugs can be subjected to embedding, and all parameter information of the training target proteins can also be subjected to embedding. And then splicing and fusing the two and inputting the two into an information prediction model. And then, outputting the predicted affinity, the predicted functional group information of the combination of the training drugs and the predicted pocket information of the training target protein by the information prediction model, and then combining the marked real affinity of the training drugs and the training target protein, the real functional group information of the combination of the training drugs and the real pocket information of the training target protein to construct a loss function. Parameters of the information prediction model are further adjusted based on the loss function such that the loss function tends to converge. And continuously training the information prediction model by adopting a plurality of groups of training samples in the above way until the loss function is converged in the continuous training of the preset number of rounds, or finishing the training when the training times reach a preset time threshold. At this time, it is considered that the information prediction model learns the ability to predict the affinity of the training drug with the protein target, and also learns the ability to predict the functional group information of the binding of the drug compound and the pocket information of the training target protein.

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure; as shown in fig. 5, an application architecture diagram of an information prediction model is provided. With reference to the above embodiments, whether the information prediction model shown in fig. 3 is applied or the information prediction model of the present embodiment is trained, an application architecture of the information prediction model is shown in fig. 5. In the using process, firstly, the parameter information of different granularities, such as the substructure information, the 3D structure information of the conformation layer, the contact map information, the distance map information and the like of the target protein can be obtained based on the original sequence of the target protein. And acquiring parameter information with different granularities, such as the substructure information, the 3D structure information, the graph information of the functional group and the like of the medicine based on the original sequence of the medicine. The detailed acquisition process is described with reference to the related embodiments. And splicing the parameter information of the respective particle sizes together in a fusion mode on an Embedding (Embedding) layer to respectively obtain the Embedding of the target protein and the Embedding of the medicine, and performing dot product calculation on the Embedding of the target protein and the Embedding of the medicine. The result of the dot product calculation is then input to the information prediction model. The information prediction model can realize the output of multiple tasks such as the functional group information of the medicine, the affinity of the medicine and the target protein, the pocket information of the target protein and the like based on the input information.

From a macroscopic perspective, the research on the interaction between the drug and the target protein is to judge whether the drug molecules and the target protein macromolecules can be tightly combined and activate related physicochemical reactions, such as antigen-antibody immune reaction and the like. From a microscopic perspective, the interaction between the drug and the target protein is actually only binding and reacting at a certain pocket position of the protein and a certain part of the molecular structure of the drug. Therefore, if the information of different particle size layers can be combined and the information of molecular structure and position can be simultaneously studied and utilized, the combination affinity of the drug and the target protein can be more accurately calculated. From the above perspective, the present disclosure provides multi-modal molecular characterization-based drug target docking multi-task learning.

In other words, in this embodiment, the molecular information of the drug compound and the target protein at different particle size levels is fully utilized, so that the information prediction model can learn richer molecular representations, and the finally predicted affinity of the interaction between the drug and the target protein is improved to a certain extent compared with that of the existing method, the mean square error MSE is reduced by about 0.04, and the consistency index CI is increased by about 0.03; meanwhile, the defects of the prior art can be overcome, and various subtask results can be output simultaneously. Therefore, the performance efficiency can be greatly improved by adopting the technical scheme of the embodiment, and different requirements of users can be met, such as the prediction of the molecular substructure of the combined drug compound concerned by the users and the position and structure of the target protein pocket.

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure; as shown in fig. 6, the present embodiment provides an information prediction apparatus 600 including:

a target information obtaining module 601, configured to obtain parameter information of a target protein based on a native sequence of the target protein;

the drug information obtaining module 602 is further configured to obtain parameter information of the candidate drug based on the original sequence of the candidate drug;

the predicting module 603 is configured to predict the affinity between the target protein and the candidate drug according to the parameter information of the target protein and the parameter information of the candidate drug, and a pre-trained information prediction model.

The information prediction apparatus 600 of this embodiment, which implements the implementation principle and technical effect of information prediction by using the above modules, is the same as the implementation of the related method embodiment, and reference may be made to the description of the related method embodiment in detail, which is not repeated herein.

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure; as shown in fig. 7, the information prediction apparatus 600 provided in this embodiment further describes the technical solution of the present disclosure in more detail on the basis of the technical solution of the embodiment shown in fig. 6.

As shown in fig. 7, in the information predicting apparatus 600 of the present embodiment, the target point information obtaining module 601 includes:

a first segmentation unit 6011, configured to segment an original sequence of a target protein to obtain a first unit sequence formed by multiple subunits;

the first merging unit 6012 is configured to merge multiple subunits in the first unit sequence according to each known drug compound and the corresponding occurrence frequency in the pre-established drug compound information table, so as to obtain the substructure information of the target protein.

Further optionally, the target point information obtaining module 601 further includes:

a first obtaining unit 6013, configured to obtain at least one of 3D structure information, contact map information, and distance map information of a conformational layer of a target protein based on a prosequence of the target protein.

Further optionally, as shown in fig. 7, in the information prediction apparatus 500 of the present embodiment, the medicine information obtaining module 602 includes:

a second segmentation unit 6021 configured to segment the original sequence of the candidate drug to obtain a second unit sequence composed of a plurality of subunits;

a second merging unit 6022, configured to merge the multiple subunits in the second unit sequence according to each known drug compound and the corresponding frequency in the pre-established drug compound information table, so as to obtain the substructure information of the candidate drug.

Further optionally, the medicine information obtaining module 602 further includes:

a second obtaining unit 6023 configured to obtain at least one of 3D structure information of the drug candidate and map information of the functional group based on the pro-sequence of the drug candidate.

Further optionally, the prediction module 603 is further configured to:

and predicting the information of the functional group combined by the candidate drug and/or the pocket information of the target protein according to the parameter information of the target protein, the parameter information of the candidate drug and the information prediction model.

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure; as shown in fig. 8, the present embodiment provides an information prediction model training apparatus 800, including:

an acquisition module 801, configured to acquire an array of training samples; each group of training samples comprises a protosequence of a training drug, a protosequence of a training target protein and the real affinity of the training drug and the training target protein;

a drug information obtaining module 802, configured to obtain parameter information of a training drug based on an original sequence of the training drug in each group of training samples;

a target information obtaining module 803, configured to obtain parameter information of the training target proteins based on the original sequences of the training target proteins in each group of training samples;

the training module 804 is configured to train the information prediction model according to the parameter information of the training drugs of each group of training samples, the parameter information of the training target proteins, and the real affinity of the training target proteins.

The training apparatus 800 of the information prediction model of this embodiment implements the implementation principle and the technical effect of the training of the information prediction model by using the above modules, which are the same as the implementation of the related method embodiment described above, and reference may be made to the description of the related method embodiment in detail, and details are not repeated here.

Further optionally, the drug information obtaining module 802 is configured to:

and acquiring the substructure information of the training medicaments based on the original sequences of the training medicaments in each group of training samples.

Further optionally, the medicine information obtaining module 802 is further configured to:

and acquiring at least one of the 3D structure information of the training drug and the graph information of the functional group based on the original sequence of the training drug in each group of training samples.

Further optionally, the target point information obtaining module 803 is configured to:

and acquiring the substructure information of the training target protein based on the original sequence of the training target protein in each group of training samples.

Further optionally, the target point information obtaining module 803 is further configured to:

and acquiring at least one of 3D structure information, contact diagram information and distance diagram information of the conformational level of the training target protein based on the original sequence of the training target protein in each group of training samples.

Further optionally, each training sample is labeled with real functional group information of the combination of the training drugs and/or real pocket information of the training target protein;

further, the training module 804 is further configured to:

according to the parameter information of the training drugs of each group of training samples, the parameter information of the training target protein and the real affinity of the training target protein; meanwhile, the information prediction model is trained by referring to the real functional group information of the combination of the training drugs and/or the real pocket information of the training target protein.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as an information prediction method or a training method of an information prediction model. For example, in some embodiments, the information prediction method or the training method of the information prediction model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described information prediction method or training method of the information prediction model may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform an information prediction method or a training method of an information prediction model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server that incorporates a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An information prediction method, wherein the method comprises:

2. The method of claim 1, wherein obtaining the parameter information of the target protein based on the pro-sequence of the target protein comprises:

segmenting the original sequence of the target protein to obtain a first unit sequence consisting of a plurality of subunits;

and combining a plurality of subunits in the first unit sequence according to each known protein molecule compound and the corresponding occurrence frequency in a pre-established target protein information table to obtain the substructure information of the target protein.

3. The method of claim 2, wherein obtaining the parameter information of the target protein based on the pro-sequence of the target protein further comprises:

and acquiring at least one of 3D structure information, contact map information and distance map information of the conformation layer of the target protein based on the protosequence of the target protein.

4. The method of any one of claims 1-3, wherein obtaining the parameter information of the drug candidate based on the pro-sequence of the drug candidate comprises:

segmenting the original sequence of the candidate drug to obtain a second unit sequence consisting of a plurality of subunits;

and combining a plurality of subunits in the second unit sequence according to each known drug compound and the corresponding occurrence frequency in a pre-established drug compound information table to obtain the substructure information of the candidate drug.

5. The method of claim 4, wherein obtaining the parameter information of the drug candidate based on the pro-sequence of the drug candidate further comprises:

acquiring at least one of 3D structure information of the drug candidate and pattern information of the functional group based on the pro-sequence of the drug candidate.

6. The method of any of claims 1-5, wherein the method further comprises:

and predicting the information of the functional group combined with the candidate drug and/or the pocket information of the target protein according to the parameter information of the target protein, the parameter information of the candidate drug and the information prediction model.

7. A method of training an information prediction model, wherein the method comprises:

8. The method of claim 7, wherein obtaining parameter information of the training drug based on the original sequence of the training drug in each set of the training samples comprises:

9. The method of claim 8, wherein obtaining the parameter information of the training drug based on the original sequence of the training drug in each set of the training samples further comprises:

and acquiring at least one of 3D structure information and graph information of a functional group of the training drug based on the original sequence of the training drug in each group of the training samples.

10. The method according to any one of claims 7-9, wherein obtaining the parameter information of the target training protein based on the pro-sequence of the target training protein in each set of the training samples comprises:

and acquiring the substructure information of the training target protein based on the original sequence of the training target protein in each group of the training samples.

11. The method of claim 10, wherein obtaining the parameter information of the training target protein based on the pro-sequence of the training target protein in each set of the training samples further comprises:

and acquiring at least one of 3D structure information, contact map information and distance map information of the conformational layer of the training target protein based on the protosequence of the training target protein in each group of training samples.

12. The method according to any one of claims 7 to 11, wherein each training sample is further labeled with information on the actual functional group of the training drug to which the training drug is bound and/or information on the actual pocket of the training target protein;

further, the method further comprises:

according to the parameter information of the training drugs, the parameter information of the training target protein and the real affinity of the training target protein of each group of the training samples; and simultaneously, the information prediction model is trained by referring to the real functional group information of the combination of the training drugs and/or the real pocket information of the training target protein.

13. An information prediction apparatus, wherein the apparatus comprises:

14. The apparatus of claim 13, wherein the target information acquisition module comprises:

the first segmentation unit is used for segmenting the original sequence of the target protein to obtain a first unit sequence consisting of a plurality of subunits;

and the first merging unit is used for merging a plurality of subunits in the first unit sequence according to each known protein molecule compound and the corresponding occurrence frequency in a pre-established target protein information table to obtain the substructure information of the target protein.

15. The apparatus of claim 14, wherein the target information acquisition module further comprises:

a first obtaining unit, configured to obtain at least one of 3D structure information, contact map information, and distance map information of a conformational level of the target protein based on the pro-sequence of the target protein.

16. The apparatus of any one of claims 13-15, wherein the medication information acquisition module comprises:

the second segmentation unit is used for segmenting the original sequence of the candidate drug to obtain a second unit sequence consisting of a plurality of subunits;

and the second merging unit is used for merging a plurality of subunits in the second unit sequence according to each known drug compound and the corresponding occurrence frequency in a pre-established drug compound information table to obtain the substructure information of the candidate drug.

17. The apparatus of claim 16, wherein the medication information acquisition module further comprises:

a second obtaining unit configured to obtain at least one of 3D structure information of the drug candidate and pattern information of the functional group based on the original sequence of the drug candidate.

18. The apparatus of any of claims 13-17, wherein the prediction module is further configured to:

19. An apparatus for training an information prediction model, wherein the apparatus comprises:

20. The apparatus of claim 19, wherein the medication information acquisition module is to:

21. The apparatus of claim 20, wherein the medication information acquisition module is further configured to:

22. The apparatus according to any one of claims 19-21, wherein the target information acquisition module is configured to:

23. The apparatus of claim 22, wherein the target information acquisition module is further configured to:

24. The device according to any one of claims 19-23, wherein each training sample is further labeled with information on the actual functional group of the training drug to which binding occurs and/or information on the actual pocket of the training target protein;

further, the training module is further configured to:

25. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6 or 7-13.

26. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-6 or 7-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-6 or 7-12.