CN114023396B

CN114023396B - Protein kinase inhibitor prediction method, model construction method and device

Info

Publication number: CN114023396B
Application number: CN202210003673.7A
Authority: CN
Inventors: 石方骏; 李远鹏; 张博文; 王纵虎
Original assignee: Beijing Jingtai Technology Co ltd
Current assignee: Beijing Jingtai Technology Co ltd
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-05-17
Anticipated expiration: 2042-01-05
Also published as: CN114023396A

Abstract

The application relates to a method for predicting an inhibitor of protein kinase, a method for constructing a model, and an apparatus therefor. The prediction method comprises the following steps: inputting target molecules into a prediction model, and respectively outputting inhibition probability values of the target molecules corresponding to a plurality of target kinases selected from the prediction model; wherein the prediction model is obtained by training according to the inhibition result of sample molecules on a plurality of target kinases; and determining the credibility of the inhibition probability value according to the application domain of the preset type. The scheme provided by the application can help to screen out small molecules possibly having an inhibiting effect on the target kinase in batches to serve as candidate inhibitors, so that the reliability of the prediction result is improved, and the prediction efficiency is improved.

Description

Protein kinase inhibitor prediction method, model construction method and device

Technical Field

The application relates to the technical field of protein kinases, in particular to a protein kinase inhibitor prediction method, a model construction method and a device thereof.

Background

The protein kinase family is one of the largest enzyme families. Human kinases are composed of more than 500 protein kinases and account for approximately 1.7% of the human genes. Dysregulation of protein kinases plays an important role in a number of human diseases, including cancer, inflammatory diseases, central nervous system diseases, cardiovascular diseases and diabetic complications. Currently only about 80 protein kinases are successfully targeted by some drugs, and many kinase inhibitor drugs are used against the same target in oncology.

However, there are still many "non-targeted" protein kinases and related diseases that need further investigation. For example, many kinase inhibitor drugs are mixed because the ATP (adenosine triphosphate) binding site in the protein kinase group is protected, that is, because the ATP binding site is well conserved, many inhibitor drugs for the ATP binding site have an inhibitory effect on a specific targeted kinase and also have an inhibitory effect on other non-targeted kinases, so that adverse reactions occur after treatment with the inhibitor drugs.

Therefore, how to screen out molecules which can effectively inhibit protein kinase is a problem to be solved at present.

Disclosure of Invention

In order to solve or partially solve the problems in the related art, the application provides an inhibitor prediction method, a model construction method and a device of protein kinase, which can help to screen reasonable small molecules in batches as effective inhibitors and help to detect adverse reactions of drug small molecules.

In a first aspect, the present application provides a method for predicting an inhibitor of a protein kinase, comprising:

inputting target molecules into a prediction model, and respectively outputting inhibition probability values of the target molecules corresponding to a plurality of target kinases selected from the prediction model; wherein the prediction model is obtained by training according to the inhibition result of sample molecules on a plurality of target kinases;

and determining the credibility of the inhibition probability value of the target molecule acting on all target kinases according to the application domain of a preset type.

In one embodiment, before the prediction model is obtained by training a prediction model according to inhibition results of sample molecules on a plurality of target kinases, the prediction model comprises:

marking the inhibition activity label corresponding to each sample molecule relative to the corresponding protein kinase according to the inhibition result of each sample molecule on each protein kinase;

and screening the protein kinase meeting the preset conditions as the target kinase of the prediction model according to the inhibitory activity label of the sample molecule corresponding to the protein kinase.

In one embodiment, the screening of protein kinases meeting preset conditions as target kinases of the prediction model according to the inhibition activity label of the protein kinase corresponding to the sample molecule comprises:

randomly dividing the sample molecules for multiple times, correspondingly obtaining a preset number of training molecules respectively, and taking the rest number of molecules as test molecules;

respectively deleting the protein kinases with the number of the labels with the inhibitory activity not more than a set value or single label with the inhibitory activity from the training molecules and the test molecules obtained by random division each time;

and selecting the residual protein kinases in one random division with the least deleted protein kinases as target kinases.

In one embodiment, the inputting the target molecule into a prediction model and respectively outputting the inhibition probability values of the target molecule corresponding to a plurality of target kinases selected from the prediction model comprises;

characterizing the target molecules to obtain corresponding two-dimensional chemical structure information;

and inputting the two-dimensional chemical structure information of the target molecules into the prediction model, and outputting the inhibition probability value corresponding to each target kinase acted by each target molecule.

In one embodiment, the determining the confidence level of the inhibition probability value of the target molecule acting on all target kinases according to the application domain of the preset type comprises:

respectively acquiring the MORGAN fingerprints of the target molecules and the MORGAN fingerprints of the sample molecules; wherein the sample molecules comprise a training molecule and a test molecule;

determining a first maximum fingerprint similarity in the fingerprint similarities of the target molecule and each training molecule according to the corresponding fingerprint similarity of the MORGAN fingerprint of the target molecule and each training molecule; determining the second maximum fingerprint similarity in the fingerprint similarities of each test molecule and each training molecule according to the corresponding fingerprint similarity of the MORGAN fingerprint of each test molecule and each training molecule;

screening and determining a test molecule corresponding to a second maximum fingerprint similarity closest to the first maximum fingerprint similarity as a reference molecule;

and determining the credibility corresponding to the inhibition probability value of the target molecule acting on all target kinases according to the balance accuracy corresponding to the reference molecule.

In one embodiment, before determining the confidence level corresponding to the inhibition probability value of the target molecule acting on all target kinases according to the equilibrium accuracy corresponding to the reference molecule, the method further comprises:

respectively acquiring the maximum fingerprint similarity in the similarity of the MORGAN fingerprint of each test molecule relative to the MORGAN fingerprint of the training molecule; arranging the numerical values of the maximum fingerprint similarity corresponding to all the tested molecules in a descending order; dividing all the test molecules into N batches of test molecules according to the arrangement serial numbers of all the test molecules and the preset step length P and the preset quantity Q, wherein N is more than or equal to 1;

determining the number of true positive labels, the number of false positive labels, the number of true negative labels and the number of false negative labels of each test molecule and all non-empty label target kinases according to the true inhibitory activity labels of each test molecule and the non-empty label target kinases and the predicted inhibitory activity labels of each test molecule and the non-empty label target kinases determined according to a prediction model;

calculating the corresponding balance accuracy of each test molecule through a preset formula according to the number of the true positive labels, the number of the false positive labels, the number of the true negative labels and the number of the false negative labels; and taking the average value of the equilibrium accuracy corresponding to each test molecule in the batch of the reference molecule as the equilibrium accuracy corresponding to the reference molecule.

respectively obtaining corresponding numerical values of a plurality of different descriptors of the target molecule;

respectively judging whether the value of each descriptor of the target molecule is in the corresponding application domain;

and determining the credibility corresponding to the inhibition probability value of the target molecule acting on all target kinases according to the number of descriptors of the numerical value in the corresponding application domain.

In one embodiment, the determining the confidence level of the inhibition probability values of the target molecule acting on all target kinases according to a preset type of application domain comprises:

determining a confidence level of the suppression probability value according to a confidence level calculated by a similarity-based application domain; and/or

Calculating a number within an application domain from a descriptor-based application domain to determine a confidence level of the suppression probability value.

In a second aspect, the present application provides a method for constructing a protein kinase inhibitor prediction model, which includes:

obtaining a plurality of sample molecules and a plurality of protein kinases;

screening the protein kinase meeting preset conditions as a target kinase according to the inhibitory activity label of the protein kinase corresponding to the sample molecule;

training a model constructed based on a deep learning framework according to selected training molecules in the sample molecules and the target kinase; and testing the constructed model according to the selected test molecules in the sample molecules and the target kinase to obtain a prediction model.

In one embodiment, the inhibitory activity signature includes both "active" and "inactive" signatures;

the label of the inhibitory activity corresponding to each sample molecule relative to the corresponding protein kinase label according to the inhibition result of each sample molecule on each protein kinase comprises:

when the inhibition indexes of the sample molecules and the single protein kinase are greater than or equal to a preset value, the inhibition activity label corresponding to the inhibition result is an 'activity' label;

and when the inhibition indexes of the sample molecules and the single protein kinase are smaller than the preset value, the inhibition activity label corresponding to the inhibition result is an 'inactive' label.

In a third aspect, the present application provides a device for predicting an inhibitor of a protein kinase, comprising:

the prediction module is used for inputting target molecules into a prediction model and respectively outputting inhibition probability values of the target molecules corresponding to a plurality of target kinases selected from the prediction model; wherein the prediction model is obtained by training according to the inhibition result of sample molecules on a plurality of target kinases;

and the reliability confirming module is used for confirming the reliability of the inhibition probability value of the target molecule acting on all target kinases according to the application domain of a preset type.

A fourth aspect of the present application provides an apparatus for constructing a protein kinase inhibitor prediction model, including:

an obtaining module for obtaining a plurality of sample molecules and a plurality of protein kinases;

the classification module is used for marking the inhibitory activity label corresponding to each sample molecule relative to the corresponding protein kinase according to the inhibitory result of each sample molecule on each protein kinase;

the screening module is used for screening the protein kinase which meets the preset conditions as a target kinase according to the inhibitory activity label of the protein kinase corresponding to the sample molecule;

the training construction module is used for training a model constructed on the basis of a deep learning framework according to the selected training molecules in the sample molecules and the target kinase; and testing the constructed model according to the selected test molecules in the sample molecules and the target kinase to obtain a prediction model.

A fifth aspect of the present application provides an electronic device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described above.

A sixth aspect of the present application provides a computer-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to perform the method as described above.

The technical scheme provided by the application can comprise the following beneficial effects:

according to the technical scheme, the activity inhibition result of each target kinase corresponding to the target molecule can be predicted in batch through the trained prediction model, and quantification is carried out through the inhibition probability value, so that the target kinase can be conveniently and rapidly screened out of possible effective drug target molecules, ineffective molecules can be filtered out, and unnecessary experimental processes can be saved; in addition, the credibility of the inhibition probability value is determined according to the application domain of the preset type, so that the situation that the prediction result of the prediction model is inaccurate is eliminated, research and development personnel are facilitated to screen the target molecule with higher credibility as the research option of the drug inhibitor, the research and development cost and time are saved, and the medicine research and development efficiency is improved. In addition, the inhibition rate probability values corresponding to all target kinases are output by the prediction model aiming at a single target molecule, so that the actionable kinase type of the target molecule can be estimated, and research and development personnel can be helped to predict possible adverse reactions when the target molecule is used as a drug inhibitor in advance.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

FIG. 1 is a schematic flow diagram of a method for predicting an inhibitor of a protein kinase, as shown in the examples herein;

FIG. 2 is a schematic flow chart of a method for constructing a model for predicting an inhibitor of a protein kinase in the examples of the present application;

FIG. 3 is another schematic flow diagram of a method for predicting an inhibitor of a protein kinase, as shown in the examples herein;

FIG. 4 is a graph based on the mean of maximum fingerprint similarity and the equilibrium accuracy for test molecules and training molecules as shown in the examples of the present application;

FIG. 5 is a range of each corresponding application domain determined from 5 descriptor values of a training molecule, as shown in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an inhibitor prediction device for protein kinase shown in the embodiment of the present application;

fig. 7 is another schematic structural diagram of an inhibitor prediction device for protein kinase shown in the embodiment of the present application;

fig. 8 is a schematic structural diagram of a device for constructing an inhibitor prediction model of protein kinase according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device shown in an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are illustrated in the accompanying drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

In the related art, only a small fraction of several hundreds of protein kinases are successfully targeted by some inhibitor drugs, and many untargeted protein kinases remain to be studied. Aiming at the protein kinases which are not targeted, a plurality of kinase inhibitor drugs protect ATP (adenosine triphosphate) binding sites in the protein kinase group, and because the ATP binding sites are well conserved, a plurality of inhibitor drugs aiming at the ATP binding sites can inhibit specific targeted kinases and also inhibit other non-targeted kinases, so that adverse reactions occur after the inhibitor drugs are used for treatment.

In view of the above problems, embodiments of the present application provide a method for predicting an inhibitor of a protein kinase, which can help to screen out small molecules that may have an inhibitory effect on a target kinase in a batch as candidate inhibitors, improve the reliability of a prediction result, and improve the prediction efficiency.

The technical solutions of the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a method for predicting an inhibitor of a protein kinase, shown in an example of the present application.

Referring to fig. 1, a method for predicting an inhibitor of a protein kinase, shown in the examples of the present application, comprises:

s110, inputting target molecules into a prediction model, and respectively outputting inhibition probability values of the target molecules corresponding to a plurality of target kinases selected from the prediction model; the prediction model is obtained by training according to the inhibition result of the sample molecules on a plurality of target kinases.

The prediction model may be a deep learning based neural network model, such as a multitask deep neural network multitaskcross classifier model. To obtain the desired predictive model, the model may be trained by collecting in advance the known effect of sample molecules on the target kinase as training data. For example, data provided by KinomeX, a web application for predicting multiple pharmacology (activity and inhibition) of small molecule broad spectrum kinases, can be selected as training data. Among them, the data provided by KinomeX includes the inhibition results of 32056 small molecules respectively corresponding to 391 protein kinase activity, and the inhibition results are whether the activity of each molecule corresponding to a single protein kinase has an inhibitory effect or not, or the inhibitory effect is unknown. It can be understood that, with the update of KinomeX data, the data used for training the prediction model can be updated accordingly, so as to improve the accuracy of the prediction result of the prediction model. Of course, in other embodiments, the relevant data in other databases may also be used as the training data of the prediction model, which is not limited herein.

Further, in order to improve the accuracy of the prediction result, screening may be performed on the data provided by KinomeX, that is, among a plurality of protein kinases, a protein kinase meeting a preset condition is obtained as a target kinase of the prediction model. That is, considering that the number of kinases studied in relevant literature is less than 500 out of over 500 currently known protein kinases, and the number of kinases having reliable experimental data is smaller, it is necessary to screen a certain number of protein kinases satisfying predetermined conditions among a plurality of protein kinases as target kinases in view of accuracy of prediction results. It can be understood that according to the known activity inhibition results of the small molecules corresponding to each protein kinase, the structural properties of a plurality of small molecules are synthesized through deep learning of a prediction model, and the probability value that the activity of each target kinase corresponding to a single target molecule has an inhibition effect can be output, namely the inhibition probability value. When the number of the target molecules is multiple, the prediction model respectively outputs the inhibition probability value of the activity of each target molecule corresponding to a single target kinase. It will be appreciated that, since the number of target kinases is more than one, there may be hundreds of different target kinases in the predictive model. Through the prediction model, the inhibition probability values of the activity of a plurality of target kinases corresponding to each target molecule can be output, so that research personnel can be helped to evaluate the kinase range of the target molecule with the activity inhibition function according to the batch inhibition probability values.

And S120, determining the reliability of the inhibition probability values of the target molecules acting on all target kinases according to the application domains of the preset types.

It will be appreciated that the predictive model may output for a single target molecule a set of inhibition probability values for all target kinases, the number of inhibition probability values in each set being the same as the number of target kinases, and the number of confidence levels output being the same as the number of target molecules. In the step, corresponding credibility is output aiming at each group of inhibition probability values, so that the credibility of the whole group of inhibition probability values of the target molecule acting on all target kinases can be determined according to the credibility. And when a plurality of target molecules are available, the inhibition probability values correspondingly output by the prediction model are a plurality of groups, and each group of inhibition probability values respectively has corresponding credibility.

If the confidence level corresponding to a certain group of inhibition probability values is higher, the activity inhibition possibility of the target molecule acting on the corresponding target kinase can be further evaluated according to each specific probability value of the group of inhibition probability values. If the inhibition probability value is less than a preset threshold value, which means that the probability that the target molecule has activity inhibition on the target kinase is low, the target molecule is not suitable for being used as a drug inhibitor of the target kinase. In the process of drug development, researchers do not need to experiment the inhibition effect of the target molecule on the target kinase, so that the development cost and time are saved, and the drug development efficiency is improved. When the inhibition probability value of the target molecule acting on the corresponding target kinase is greater than or equal to a preset threshold value, the inhibition probability value indicates that the target molecule possibly has an activity inhibition effect on the target kinase. It can be understood that, for different target molecules corresponding to the same target kinase, when the inhibition probability values are different, the target molecule with the higher inhibition probability value can be preferentially selected, so that the research and development cost and time can be saved, and the medicine research and development efficiency can be improved. Aiming at the fact that the same target molecule corresponds to a plurality of different target kinases, the inhibition effect of the target molecule on the target kinases can be evaluated by analyzing the numerical value of the whole group of inhibition probability values, and then the adverse reaction of the target molecule when the target molecule acts on the non-target kinases can be evaluated; by screening, small molecules possibly having adverse reactions can be prevented from being selected as drug inhibitors, and drug molecules more meeting clinical requirements can be obtained.

When the confidence corresponding to a group of inhibition probability values is low, the inhibition probability values of the prediction model corresponding to all target kinases of the target molecules are not credible, and the group of inhibition probability values can be ignored and are not used as reference data for drug inhibitors of research and development personnel.

Further, in order to ensure the reliability of the inhibition probability value, the inhibition probability value can be judged according to the application domain of the preset type, so that research personnel can be helped to determine the necessity of putting the target molecule into an experiment as a potential drug inhibitor. Wherein, the application domain is used for representing the effective range (i.e. the credible range) of the prediction result output by the prediction model. The credibility of the prediction result is quantitatively displayed through the application domain, so that the credibility of the prediction result, namely the credibility of the suppression probability value, can be intuitively and clearly judged. The application domains of the preset type may include one or more types. And correspondingly setting different indications according to different types of application domains to evaluate the reliability of the prediction result output by the prediction model. If the corresponding indicator data of the target molecule is outside the application domain, it indicates that the corresponding prediction result, i.e. the entire set of suppression probability values, may not be trusted. If the corresponding indication data of the target molecules are positioned in the application domain, the whole group of inhibition probability values has certain credibility. In one embodiment, the confidence level of the inhibition probability value is determined according to the confidence level calculated by the application domain based on the similarity; and/or calculating a number within the application domain from the descriptor-based application domain to determine a confidence level of the suppression probability value. It can be understood that when the application domains include multiple preset types, the credibility of the same prediction result can be determined independently through the application domains of different types, or the credibility of the same prediction result can be determined through the application domains of different types according to the sequence, so that the credibility of the prediction result can be verified repeatedly in multiple dimensions, and research and development personnel can be facilitated to judge the prediction result more accurately according to the credibility.

According to the example, the protein kinase inhibitor prediction method can predict the activity inhibition result of each target kinase corresponding to the target molecules in batches through the trained prediction model, and quantize the inhibition probability value, so that the target kinases can be rapidly screened out of the possible effective drug target molecules, ineffective molecules are filtered out, and unnecessary experimental processes are saved; in addition, the credibility of the inhibition probability value is determined according to the application domain of the preset type, so that the situation that the prediction result of the prediction model is inaccurate is eliminated, research and development personnel are facilitated to screen the target molecule with higher credibility as the research option of the drug inhibitor, the research and development cost and time are saved, and the medicine research and development efficiency is improved. In addition, the inhibition rate probability values corresponding to all target kinases are output by the prediction model aiming at a single target molecule, so that the actionable kinase type of the target molecule can be estimated, and research and development personnel can be helped to predict possible adverse reactions when the target molecule is used as a drug inhibitor in advance.

Fig. 2 is a schematic flowchart of a method for constructing a prediction model of an inhibitor of protein kinase in the present embodiment.

Referring to fig. 2, in order to obtain a prediction model meeting the requirement, the construction method of the prediction model of the protein kinase inhibitor of the present application will be further described below.

S210, marking the inhibition activity label corresponding to each sample molecule relative to the corresponding protein kinase according to the inhibition result of each sample molecule on each protein kinase.

It is understood that a plurality of sample molecules and a plurality of protein kinases may be obtained prior to performing step S210. For example, KinomeX provides data containing 32056 small molecules and 391 protein kinases, which can be used as sample molecules. With the data updating of KinomeX, the method can adaptively update sample molecules and protein kinase, thereby enriching related data of a prediction model and expanding the prediction range of the protein kinase. Before training, the inhibition result of each small molecule acting on each protein kinase is respectively obtained, and the inhibition activity label corresponding to each small molecule relative to the corresponding protein kinase is marked according to the inhibition result. The inhibition result is determined according to real experimental data, for example, according to the real experimental data, the inhibition result of a single small molecule on 391 protein kinase activity is different, that is, a single small molecule may have an inhibition effect on some protein kinases but not on other protein kinases, or whether the small molecule has an inhibition effect on some protein kinases cannot be determined due to lack of the experimental data.

Further, the label for inhibiting activity may include two labels of "active" and "inactive". In one embodiment, when the inhibition indexes (such as pKi value or pKd value or pIC50 value) of the sample molecule and the single protein kinase are greater than or equal to a preset value (e.g. 6), the inhibition activity label corresponding to the inhibition result is "activity"; when the pKi value or pKd value or pIC50 of the sample molecule and the single protein kinase is smaller than a preset value (for example, 6), the inhibition activity label corresponding to the inhibition result is "inactive". Specifically, pKi is the negative logarithm of 10 of Ki value, wherein i is inhibition constant of inhibition constant, and the smaller the value of i is, the stronger the inhibition capability is; ki is the concentration of free small molecules at which 50% of a protein kinase is bound by a small molecule (Ki is in some cases equivalent to Kd). pKd is the negative logarithm of 10 of Kd value, wherein d is dissociationconstant dissociation constant, the affinity of reaction small molecule to protein kinase target is larger, the smaller the d value is, the stronger the affinity is; kd is the concentration of free small molecules at which 50% of a protein kinase is bound by a small molecule. pIC50 is the negative logarithm of the IC50 value of 10, wherein IC50 is half inhibition concentration of half inhibition of half of all of half of all of half of all of half of all of half of all of half of all.

It is understood that when a small molecule corresponds to a protein kinase marker "activity" tag, it indicates that the small molecule has an inhibitory effect on the activity of the protein kinase. When the small molecule corresponds to a protein kinase labeled "inactive" tag, it indicates that the small molecule does not inhibit the activity of the protein kinase. And when no experimental data determine that the small molecule corresponds to the inhibition result of a certain protein kinase, no label is marked, and the label is regarded as an empty label. For ease of viewing, in one embodiment, the inhibitory activity labels corresponding to the inhibitory results of all small molecules acting on all protein kinases can be collectively described in a two-dimensional list, which can be regarded as a list of true inhibitory results.

Further, when a small molecule corresponds to a protein kinase marker "activity" tag, the inhibition of the small molecule relative to the protein kinase can be considered as a positive sample. When a small molecule corresponds to a protein kinase labeled "inactive" tag, the result of inhibition of the small molecule relative to the protein kinase can be considered as a negative sample. When a small molecule does not label any tag corresponding to a protein kinase marker, the inhibition result of the small molecule relative to the protein kinase can be regarded as an empty sample. It should be noted that, according to the data provided by KinomeX, it is understood that there is no inhibition result of each small molecule acting on each protein kinase in the source data because of the limited experimental data. Thus, in the absence of experimental data, these small molecules do not act on the inhibitory effect of the protein kinase and are therefore not labeled with any label.

S220, screening the protein kinase which meets the preset conditions as the target kinase of the prediction model according to the inhibition activity label of the protein kinase corresponding to the sample molecule.

In order to facilitate training and testing of the prediction model, in one embodiment, a predetermined number of sample molecules are selected as training molecules, and the remaining molecules are selected as test molecules. For example, when the total number of sample molecules is 32056, a predetermined number of sample molecules is selected, for example, 80% of the total number, i.e., 25645 molecules, are training molecules, and the remaining 20% of the total number, i.e., 6411 molecules, are test molecules. Of course, the predetermined number is not limited thereto.

In order to avoid introducing too many sample data lacking inhibitory activity labels and reduce data fluctuation of prediction results caused by lack of real experimental data, kinases supported by more experimental data need to be screened as target kinases. In one embodiment, the sample molecules are randomly divided for a plurality of times, a preset number of training molecules are correspondingly obtained respectively, and the rest number of molecules are used as test molecules; in the training molecules and the test molecules obtained by random division, protein kinases with the inhibition activity labels of which the number is not more than a set value (the set value can be set according to actual requirements or empirical values, such as the set value is 2, 4, 6 or other values) or the inhibition activity labels are single (namely, only one type of inhibition activity labels) are respectively deleted; and selecting the residual protein kinases in one random division with the least deleted protein kinases as target kinases.

To facilitate understanding, for example, 32056 sample molecules are randomly divided 1000 times, each time obtaining 25645 training molecules and 6411 test molecules. Taking training molecules and test molecules obtained by randomly dividing at a time as an example, for a certain protein kinase a in 391 protein kinases, it is determined whether the number of inhibitory activity labels of the protein kinase a relative to 25645 training molecules is greater than a set value, for example, 2, and if the number is less than or equal to 2, the protein kinase a is deleted. If the number is more than 2, further judging whether the inhibition activity labels of the protein kinase A relative to 25645 training molecules are single, and if the inhibition activity labels are all active or the inhibition activity labels are all inactive, deleting the protein kinase A; if the inhibition activity labels are not single, namely the labels of 'activity' and 'non-activity' of all training molecules corresponding to the protein kinase A are at least one, synchronously judging whether 6411 test molecules corresponding to the protein kinase A have the inhibition activity labels with the number more than 2 according to the method, if the number of the inhibition activity labels of the protein kinase A relative to each test molecule is less than or equal to 2, deleting the protein kinase A, otherwise, judging whether the existing inhibition activity labels are single; if the label of "activity" and "non-activity" of all test molecules corresponding to the protein kinase a is at least one, the protein kinase a can be retained, otherwise the protein kinase a is deleted. That is, in one random division, the condition for deleting a certain protein kinase is that the number of the inhibitory activity labels corresponding to the protein kinase in each training molecule is less than or equal to 2, or the number of the inhibitory activity labels is single, or the number of the inhibitory activity labels corresponding to the protein kinase in each test molecule is less than or equal to 2, or the number of the inhibitory activity labels is single.

It will be appreciated that after each random division 391 was determined the number of protein kinases that were ultimately deleted; the number of deleted protein kinases in each random division was compared with each other. And finally, taking the residual protein kinases in the primary random division with the least deleted protein kinases as target kinases, and determining corresponding training molecules and test molecules according to the result of the current random division. After one thousand random divisions, among 391 protein kinases, one random division deletes the least amount, namely 88 protein kinases, and the rest 303 protein kinases are used as target kinases. In the embodiment, a sufficient number of target kinases which are supported by experimental data and are beneficial to the accuracy of the prediction result of the prediction model are obtained through pre-screening, so that the effect of approaching to a 'broad spectrum' is achieved, and the activity inhibition results of a plurality of target kinases corresponding to one target molecule can be synchronously predicted in the actual prediction.

And S230, training the model constructed based on the deep learning framework according to the selected training molecules and the target kinase, and testing the constructed model according to the selected test molecules and the target kinase to obtain a prediction model.

It is to be understood that, due to the data imbalance of the inhibition activity label, in order to improve the accuracy of the prediction result of the model, in one embodiment, the sample weights of the positive sample and the negative sample are preset, wherein the positive sample weight is used for weighting the positive sample

Ni is the number of negative samples, Na is the number of positive samples; negative sample weights

. It can be appreciated that the null sample has a weight of 0. Wherein, the preset sample can be adopted according to the weight of each sample in the training molecule and the testing moleculeThe weight of the book.

In this embodiment, a Multi-task Deep Neural Network Classifier (Multi-task Deep Neural Network Classifier) provided by an open source Deep learning framework Deep 2.4.0 is used to construct a prediction model. Specifically, each training molecule is characterized by using ECPF4 molecular Fingerprint (Extended Connectivity Fingerprint, diameter, parameter belonging to Fingerprint algorithm) provided by deep chem, so as to obtain two-dimensional chemical structure information. The characterization is molecular fingerprint calculation, a Simplified molecular linear input specification (a character string format for describing a molecular structure) of each training molecule is input, and a binary sequence with the length of 1024 can be obtained through characterization. And then, inputting the two-dimensional chemical structure information into a model input layer, so that the suppression probability value can be predicted according to the constructed model only by the two-dimensional chemical structure information. The specific model parameters when modeling by adopting a multitaskcross classifier of deep chem are as follows:

layer_sizes=[1024], learning_rate=5e-5, batch_size=128, epoch=10000, activation_fn=tf.nn.relu, weight_decay_penalty=0.002,weight_decay_penalty_type='l2'。

the loss function used during training is Softmax cross entropy, and the Optimizer may be Adam (Adaptive motion estimation).

It is understood that when the prediction result of the test molecule test prediction model is adopted, each test molecule is characterized and then input into the constructed prediction model. The prediction result output by the prediction model is the activity inhibition probability of a single molecule acting on each protein kinase, which is referred to as the inhibition probability value for short. When the output inhibition probability value is larger than a preset threshold value (for example, 0.5), the inhibition probability value indicates that the activity of the single molecule on the single protein kinase has an inhibition effect, namely the interaction of the molecule and the protein kinase is active. Of course, the value of the preset threshold is not limited herein.

Compared with the traditional method for constructing a plurality of models by using a single task network, the method has the advantages that fewer computing resources are used, the loading speed is higher, the waiting time is reduced, the prediction period is shortened, more prediction results are obtained at the same time, high-throughput screening is realized, and the prediction efficiency is improved.

Fig. 3 is a schematic flow chart of a method for predicting an inhibitor of a protein kinase, shown in an example of the present application.

Referring to fig. 3, a method for predicting an inhibitor of a protein kinase, shown in the examples of the present application, comprises:

s310, inputting the target molecules into a pre-trained prediction model, and respectively outputting the inhibition probability values of the target molecules corresponding to a plurality of target kinases selected from the prediction model.

In this embodiment, the prediction model is obtained by training according to the above embodiments, which is not described herein again. Wherein a plurality of different target molecules can be simultaneously input into a pre-trained predictive model. Specifically, in one embodiment, the target molecule is characterized to obtain corresponding two-dimensional chemical structure information; inputting the two-dimensional chemical structure information of the target molecules into a prediction model, and outputting the inhibition probability value corresponding to each target kinase acted by each target molecule. That is, each target molecule is characterized to obtain corresponding two-dimensional chemical structure information, namely a binary sequence with the length of 1024; and inputting the information of each two-dimensional chemical structure into a prediction model, and outputting a corresponding inhibition probability value. It is understood that the target kinase in this embodiment is a protein kinase that meets the predetermined conditions after being screened in the above embodiments.

And S320, determining the confidence level of the inhibition probability value according to the confidence level calculated by the application domain based on the similarity.

In order to determine the confidence level of the suppression probability value output by the prediction model, in the step, the confidence level is calculated by using an application domain based on the similarity. It should be clear that the similarity-based application domain belongs to a relative concept, a clear numerical range cannot be directly given, and a relative confidence is obtained as a confidence by calculating the similarity between the target molecule and the training molecule.

In a specific embodiment, MORGAN fingerprints of target molecules and MORGAN fingerprints of sample molecules are obtained separately; wherein the sample molecules comprise training molecules and test molecules; determining the first maximum fingerprint similarity in the fingerprint similarities of the target molecule and each training molecule according to the corresponding fingerprint similarity of the MORGAN fingerprint of the target molecule and each training molecule; determining the second maximum fingerprint similarity in the fingerprint similarities of each test molecule and each training molecule according to the corresponding fingerprint similarity of the MORGAN fingerprint of each test molecule and each training molecule; screening and determining a test molecule corresponding to a second maximum fingerprint similarity closest to the first maximum fingerprint similarity as a reference molecule; and determining the credibility corresponding to the inhibition probability value of the target molecule acting on all target kinases according to the balance accuracy corresponding to the reference molecule.

That is, the MORGAN fingerprint of each target molecule, training molecule, test molecule is acquired separately. For a single target molecule, the similarity between the MORGAN fingerprint of the target molecule and the MORGAN fingerprint of each training molecule is obtained, and the similarity of the maximum value is taken as the similarity of the target molecule relative to the set of all training molecules, i.e., the first maximum fingerprint similarity (i.e., SIM _ MORGAN _ MAX). Similarly, the maximum fingerprint similarity, i.e. the second maximum fingerprint similarity, among the similarities of the MORGAN fingerprints of each test molecule and all the training molecules is obtained in advance. In addition, the equilibrium accuracy (BA) of the maximum fingerprint similarity of each test molecule relative to the training molecule was calculated in advance. Then, comparing the first maximum fingerprint similarity of the target molecule with the second maximum fingerprint similarities corresponding to all the test molecules, searching a value closest to the first maximum fingerprint similarity in all the second maximum fingerprint similarities, taking the test molecule corresponding to the value as a reference molecule, further taking the balance accuracy of the reference molecule as the confidence coefficient of the inhibition probability value of the target molecule acting on all the target kinases, and taking the confidence coefficient as the confidence degree corresponding to the inhibition probability value. In order to facilitate the research and development personnel to judge the inhibition possibility of the target molecule acting on the corresponding target kinase according to the confidence level, when the confidence level, namely the confidence level is greater than a preset value, for example, greater than 0.6, the target molecule which is greater than the preset value and has a larger value can be preferentially screened.

Further, the balance accuracy of the maximum fingerprint similarity of each test molecule relative to the training molecule can be calculated according to a preset rule. In a specific embodiment, the maximum fingerprint similarity among the similarities of the MORGAN fingerprint of each test molecule relative to the MORGAN fingerprints of all training molecules is obtained; arranging the numerical values of the maximum fingerprint similarity corresponding to all the tested molecules in a descending order; dividing all the test molecules into N batches of test molecules according to the arrangement serial numbers of all the test molecules and the preset step length P and the preset quantity Q, wherein N is more than or equal to 1; determining the number of true positive labels, the number of false positive labels, the number of true negative labels and the number of false negative labels of each test molecule and all non-empty label target kinases according to the true inhibitory activity labels of each test molecule and the non-empty label target kinases and the predicted inhibitory activity labels of each test molecule and the non-empty label target kinases determined according to a prediction model; calculating the corresponding balance accuracy of each test molecule through a preset formula according to the number of the true positive labels, the number of the false positive labels, the number of the true negative labels and the number of the false negative labels; and taking the average value of the equilibrium accuracy corresponding to each test molecule in the batch of the reference molecule as the equilibrium accuracy corresponding to the reference molecule. The preset step length can be regarded as the number of the adjacent two batches without intersection.

In one embodiment, the equilibrium accuracy is calculated according to the following equation:

wherein, balanced-accuracy (BA) is the balance accuracy of a single test molecule corresponding to all non-empty label target kinases, TP is the number of true positive labels, FP is the number of false positive labels, TN is the number of true negative labels, and FN is the number of false negative labels. The true positive label is that the true activity inhibition label of the same test molecule and the same target kinase is an 'activity' label, and the predicted activity inhibition label of the prediction model is also an 'activity' label (namely, the inhibition probability value is greater than or equal to a preset threshold value). The false positive label is that the real activity inhibition label of the same test molecule is an 'inactive' label, and the prediction activity inhibition label of the prediction model is an 'active' label. The true negative label is that the true activity inhibition label of the same test molecule is an 'inactive' label, and the predicted activity inhibition label of the prediction model is also an 'inactive' label (namely, the inhibition probability value is smaller than a preset threshold). The false negative label is that the true activity inhibition label of the same test molecule is an 'activity' label, and the predicted activity inhibition label of the prediction model is an 'inactive' label.

For convenience of understanding, for example, in 6411 test molecules, 25645 training molecules, and 303 target kinases, for each test molecule, the similarity between the MORGAN fingerprint and the MORGAN fingerprint of each training molecule is obtained, and the similarity of the maximum value of the test molecule in the similarities between the MORGAN fingerprints of all the training molecules is found, that is, the maximum fingerprint similarity of the test molecule to the training molecules is found, and then all the test molecules are arranged in descending order according to the magnitude of the corresponding maximum fingerprint similarity value, and the serial numbers are respectively No. 1 to No. 6411. Next, 6411 test molecules were divided into N batches in total according to the preset step size P and the preset number Q. For example, when P is 50 and Q is 800, the test molecules nos. 1 to 800 are listed as the molecules of the 0 th batch, the test molecules nos. 51 to 850 are listed as the molecules of the 1 st batch, and the test molecules nos. 101 to 900 are listed as the molecules of the 2 nd batch, and so on, to obtain more than one hundred batches of molecules in total, it is necessary to calculate the average value of the equilibrium accuracy rates corresponding to each batch of molecules, that is, to obtain more than one hundred average values of the equilibrium accuracy rates in total.

Taking the average value of the equilibrium accuracy of the molecules of the 0 th batch as an example, the specific calculation method is as follows: the actual inhibitory activity signature of 303 target kinases for each of test molecules nos. 1 to 800 can be obtained with reference to the data provided by KinomeX in step S210. According to the inhibition results of each test molecule and 303 target kinases, deleting all the target kinases with empty labels, and reserving the target kinases with non-empty labels, namely the number of the target kinases with non-empty labels corresponding to each test molecule is M, wherein M is more than 0 and less than or equal to 303; it is understood that the number of non-empty labeled target kinases corresponding to each test molecule is determined according to practical situations, i.e., the number of non-empty labeled target kinases M corresponding to each test molecule may be the same or different. Then, according to the prediction model, the inhibition probability value of each test molecule corresponding to M target kinases is obtained, the prediction result with the inhibition probability value larger than or equal to a preset threshold value, for example, larger than or equal to 0.5, is pasted with an 'active' label, and when the inhibition probability value is smaller than the preset threshold value, the prediction result is pasted with an 'inactive' label. Then, the number of true positive Tags (TP), the number of false positive Tags (TN), the number of true negative tags (FP) and the number of false negative tags (FN) of the test molecule for M target kinases were obtained, wherein TP + TN + FP + FN = M. And calculating the balance accuracy rate BA of each test molecule corresponding to M target kinases according to the formula, and then calculating the average value of 800 BA values, wherein the average value is the balance accuracy rate corresponding to the 0 th batch of molecules. And by analogy, the balance accuracy corresponding to more than one hundred batches of molecules is obtained. As shown in fig. 4, the abscissa is the lot number of each batch of molecules, and the ordinate is the balance accuracy BA value of the corresponding lot. Obviously, the balance accuracy does not have a linear relationship with the maximum fingerprint similarity average, i.e., the higher the similarity value is, the higher the balance accuracy is not represented.

When the confidence degrees of the inhibition probability values of a certain target molecule acting on all 303 target kinases are determined, in all 6411 test molecules, a reference molecule of the target molecule is found according to the scheme, then the batch of the reference molecule is determined, so that the balance accuracy corresponding to the batch is used as the balance accuracy of the reference molecule and then used as the confidence degree of the inhibition probability values of the target molecule acting on all the target kinases, and the confidence degree can be used as the confidence degree corresponding to the inhibition probability values. It should be noted that the above numerical values are only for illustration and are not limited herein.

Further, the first maximum fingerprint similarity (i.e., SIM _ MORGAN _ MAX) and corresponding confidence (confidence) of a plurality of target molecules (represented in SMILES format) with respect to all training molecules is shown in table 1 below.

TABLE 1

SMILES	SIM_MORGAN_ MAX	confidence
			O=C1C=C(OC2=C1C=CC=C2C1=CC=CC=C1)N1CCNCC1	0.75555557	0.8226
O=c1cc(N2CCNCC2)oc2c(-c3ccccc3)cccc12	0.75555557	0.8226
			CC(C)(C)NC(=O)c1cccc(Oc2ccc(Nc3ncnc4ccn(CCO)c34)cc2Cl)c1	0.46391752	0.7142

For convenience of understanding, for example, when the first maximum fingerprint similarity of the target molecules in the first row in the above table is 0.75555557, the equilibrium accuracy corresponding to the lot where the reference molecule with the second maximum fingerprint similarity closest to the first maximum fingerprint similarity is found is 0.8226, so the confidence of the whole set of inhibition probability values corresponding to all target kinases for the target molecule is 0.8226, i.e., the confidence is 0.8226.

It should be clear that, in this embodiment, the subsequent step S330 may be selectively executed, that is, the step S330 may be executed simultaneously with the step S320, or executed out of sequence with the step S320, or the step S330 is not executed when the calculated confidence is greater than the preset value. The confidence of step S320 is preferably used as a reference item for determining the confidence of the suppression probability value. If the confidence degree calculated in step S320 is smaller than a preset value, for example, the preset value is 0.6, and if the confidence degree is smaller than 0.6, the calculation result in step S330 may be further referred to, so that research and development personnel can more comprehensively judge the confidence degree of the inhibition probability value.

S330, calculating the quantity in the application domain according to the application domain based on the descriptor to determine the credibility of the suppression probability value.

In this step, the application domain different from S320 is used to quantitatively confirm the reliability of the suppression probability value. In a specific embodiment, corresponding values of a plurality of different descriptors of the target molecule are obtained respectively; respectively judging whether the value of each descriptor of the target molecule is in the corresponding application domain; and determining the credibility corresponding to the inhibition probability value of the target molecule acting on all target kinases according to the number of descriptors of the numerical value in the corresponding application domain. In this embodiment, the plurality of different descriptors may include at least 5 descriptors, such as atomic number (atomCount), molecular weight (MolWt), lipid water partition coefficient (ALOGP), number of hydrogen bond acceptors (NumHAcceptors), and number of hydrogen bond donors (NumHDonors). That is, the number of atoms, the molecular weight, the lipid-water partition coefficient, the number of hydrogen bond acceptors, and the number of hydrogen bond donors are obtained for each target molecule, respectively. Knowing that each descriptor has a threshold range for the corresponding application domain, the atomic number, molecular weight, lipid-water partition coefficient, hydrogen bond acceptor number, and hydrogen bond donor number of the target molecule are further compared to the threshold range for the corresponding application domain to determine whether the value of each descriptor for the target molecule is within the corresponding application domain. Wherein, the threshold range of the application domain is more than 0 aiming at three descriptors, such as atomic number, hydrogen bond acceptor number, hydrogen bond donor number and the like. Aiming at two descriptors such as molecular weight, lipid-water distribution coefficient and the like, in one embodiment, descriptor values of all training molecules are obtained, and the training molecules are arranged in an ascending order according to the size of the descriptor values; when the descriptor value corresponding to the target molecule is higher than the value in the top 85% of the ranking of the sequence numbers in the training molecules and is lower than the maximum value in the training molecules, or is lower than the value in the 85% of the reciprocal of the ranking of the sequence numbers in the training molecules and is higher than the minimum value in the training molecules, the descriptor value of the target molecule is considered to be located in the edge range of the application domain; if the descriptor value corresponding to the target molecule is higher than the maximum value of the descriptor values corresponding to all the training molecules or lower than the minimum value of all the descriptor values, the descriptor value of the target molecule is considered to be outside the application domain; if the descriptor value of the target molecule is lower than the value in the top 85% of the rank in the training molecule and higher than the value in the 85% of the reciprocal of the rank in the training molecule, then the descriptor value of the target molecule is considered to be in the application domain. Of course, the above value of 85% is set empirically, and is not limited thereto. Further, after judging whether each descriptor value of the target molecule is in the application domain, the overall reliability of the inhibition probability value of a single target molecule acting on all corresponding target kinases can be determined according to the number of the descriptors of the value in the corresponding application domain.

As shown in fig. 5, fig. 5 is a range of each corresponding application domain determined from 5 descriptor values of the training molecules. In order to facilitate understanding of the judgment standard of the threshold range of the application domain of two descriptors, such as molecular weight, lipid-water distribution coefficient and the like, taking the descriptor as a molecular weight example, firstly calculating the molecular weight of each training molecule, and then arranging the sequence numbers of all the training molecules in an ascending manner according to the molecular weight, for example, the sequence numbers corresponding to all the training molecules are numbers 1 to 25645, wherein the maximum molecular weight corresponding to the training molecule 25645 is 3740, and the minimum molecular weight corresponding to the training molecule 1 is 96. If the molecular weight of the target molecule is greater than 3740 or less than 96, the descriptor number of the target molecule is outside the application domain. Selecting the maximum molecular weight 309 with the sequence number ranking in the first 15% from the training molecule No. 1, selecting the minimum molecular weight 508.6 with the sequence number ranking in the last 15% from the reciprocal number 25645, if the molecular weight of the target molecule is larger than 508.6 and smaller than 3740, or the molecular weight is larger than 96 and smaller than 309, the descriptor value of the molecular weight of the target molecule is located at the edge of the application domain; if the molecular weight of the target molecule is 309 or more and 508.6 or less, the descriptor value of the molecular weight of the target molecule is located in the application domain.

Referring to fig. 5 again, similarly, for the descriptor that the distribution coefficient of lipid water (ALogP), the maximum distribution coefficient of lipid water in the training molecules after being arranged in sequence has a value of 25.5, the minimum value of-13.1, the maximum value of the ranking in the top 15% of 2.45, and the minimum value of the ranking in the bottom 15% of 5.43; if the value of the lipid water partition coefficient of the target molecule is greater than 25.5 or less than-13.1, the value of the lipid water partition coefficient of the target molecule is outside the application domain; if the value of the lipid-water partition coefficient of the target molecule is greater than 2.45 and less than 5.43, the value is within the application domain; if the value of the lipid-water partition coefficient of the target molecule is greater than-13.1 and less than 2.45, or greater than 5.43 and less than 25.5, then the value is at the edge of the application domain. Aiming at the descriptor with atomic number (AtomCount), the atomic number in the training molecules after sequential arrangement is at most 268, the atomic number in the training molecules is at least 36, and the theoretical minimum atomic number is 0; then if the atomic number of the target molecule is greater than 268, the number is outside the application domain; if the atomic number of the target molecule is greater than 36 and less than 268, the number is at the application domain edge; if the number of atoms of the target molecule is greater than 0 and less than 36, the number is within the application domain. Aiming at the descriptor of the number of hydrogen bond acceptors (NumHAcceptors), the number of the hydrogen bond acceptors in the training molecules after being sequentially arranged is 104 at most and 8 at least, and the theoretical minimum number of the hydrogen bond acceptors is 0; if the number of hydrogen bond acceptors of the target molecule is more than 104, the value is outside the application domain; if the number of hydrogen bond acceptors of the target molecule is more than 8 and less than 104, the number is at the edge of the application domain; if the number of hydrogen bond acceptors of the target molecule is greater than 0 and less than 8, then the value lies within the application domain. Aiming at the descriptor of hydrogen bond donor number (NumHDonors), the number of the hydrogen bond donors in the training molecules after being sequentially arranged is at most 57, and is at least 3, and the theoretical minimum number of the hydrogen bond donors is 0; then if the number of hydrogen bond donors of the target molecule is greater than 57, the number is outside the application domain; if the number of hydrogen bond donors of the target molecule is greater than 3 and less than 57, the number is at the edge of the application domain; if the number of hydrogen bond donors of the target molecule is greater than 0 and less than 3, then the value lies within the application domain. The above values are obtained only based on the training molecules selected by the current prediction model, and are not limited herein. It is to be understood that in one embodiment, the values of descriptors located at an edge may also be attributed to being located within the application domain. However, the number of descriptors at the edge of the application domain may be listed separately, thereby facilitating the developer to further reference and evaluate confidence.

It can be understood that, in the present embodiment, the number of descriptors is 5 in total, and the number of descriptor values in the corresponding application domain has an upper limit of 5 and a lower limit of 0, that is, the highest confidence value of the inhibition probability values of all target kinases corresponding to a single target molecule in the present step is 5, the lowest confidence value is 0, and the higher the number is, the higher the confidence level is. As shown in table 2 below, table 2 shows the data related to the confidence level determined by the descriptor-based application domain of this step for 3 target molecules in table 1. In table 2, the number of descriptors outside the application domain is Out domain _ count, the number of descriptors at the edge of the application domain is Warning _ count, and the number of descriptors in the application domain is in _ domain _ count, which is the overall reliability of the inhibition probability values acting on all target kinases calculated by the corresponding target molecules according to the prediction model.

TABLE 2

SMILES	AtomC ount	MolWt	ALogP	NumHAcc eptors	NumHD onors	in_ domain_ count	warning _count	out_ domain_ count
									O=C1C=C(OC2=C1C=CC=C2C1=CC=CC= C1)N1CCNCC1	23	306.365000 0000001	2.8696000 00000001	4	1	4	1	0
O=c1cc(N2CCNCC2)oc2c(-c3ccccc3) cccc12	23	306.365000 00000007	2.8696000 00000001	4	1	4	1	0
									CC(C)(C)NC(=O)c1cccc(Oc2ccc (Nc3ncnc4ccn(CCO)c34)cc2Cl)c1	34	479.968000 0000002	5.1412000 00000004	7	3	5	0	0

It can be understood that, when the confidence level calculated in step S320 indicates that the corresponding inhibition probability value is lower, by further checking the number of descriptors of the same target molecule in the application domain in step S330, if the number of descriptors is higher, for example, greater than or equal to 3, the whole set of inhibition probability values of all target kinases corresponding to the target molecule may have a certain confidence level, and the research and development personnel selectively perform the screening according to the requirement.

As can be seen from this example, according to the protein kinase inhibitor prediction method, the activity inhibition of multiple target kinases corresponding to a single target molecule can be simultaneously predicted through a single prediction model, the inhibition probability value corresponding to each target kinase is obtained, more prediction results can be obtained by using fewer calculation resources, and the prediction efficiency is improved; meanwhile, the credibility of the whole group of inhibition probability values of the same target molecule is subjected to multiple confirmation by adopting different types of application domains, in addition, when the corresponding result of the application domain based on the similarity is lower than a preset value, the result of the application domain based on the descriptor is selectively checked, so that the credibility of the whole group of inhibition probability values can be verified multiple times from different angles, the target molecule is prevented from being omitted, the inhibition possibility of the same target molecule on the whole group of target kinases is considered more comprehensively, different molecules can be screened as drug inhibitors aiming at the same target kinase in the target molecules with high credibility of the inhibition probability values, the molecules with higher inhibition probability values on a plurality of kinases are prevented from being selected, the adverse reaction of the drugs is reduced, and research personnel can design a novel chemical regulator or relocate the known drugs according to the inhibition probability values and the corresponding credibility, the method has high practical value for exploring target kinases with less research data.

Corresponding to the embodiment of the application function realization method, the application also provides a device for predicting the inhibitor of the protein kinase, a device for constructing a model for predicting the inhibitor of the protein kinase, an electronic device and corresponding embodiments.

Fig. 6 is a schematic structural diagram of an inhibitor prediction device for protein kinase shown in the embodiment of the present application.

Referring to fig. 6, an inhibitor prediction apparatus for protein kinase includes a prediction module 610 and a reliability confirmation module 620.

The prediction module 610 is configured to input the target molecules into a prediction model, and output inhibition probability values of the target molecules corresponding to a plurality of target kinases selected from the prediction model respectively; the prediction model is obtained by training according to the inhibition result of the sample molecules on a plurality of target kinases.

The reliability confirming module 620 is used for confirming the reliability of the inhibition probability value of the target molecule acting on all target kinases according to the application domains of the preset type.

Further, the prediction module 610 is configured to characterize the target molecule and obtain corresponding two-dimensional chemical structure information; and inputting the two-dimensional chemical structure information of the target molecules into a prediction model, and outputting the inhibition probability value corresponding to each target kinase acted by each target molecule. The confidence confirming module 620 is used for determining the confidence of the inhibition probability value according to the confidence calculated by the application domain based on the similarity; and/or calculating a number within the application domain from the descriptor-based application domain to determine a confidence level of the suppression probability value.

Fig. 7 is another schematic structural diagram of an inhibitor prediction device for protein kinase shown in the embodiment of the present application.

Referring to fig. 7, the apparatus of the present application further comprises a classification module 630 and a screening module 640, wherein:

the classification module 630 is configured to label, according to the inhibition result of each sample molecule on each protein kinase, the inhibition activity label corresponding to each sample molecule with respect to the corresponding protein kinase; the screening module 640 is used for screening the protein kinase meeting the preset conditions as the target kinase of the prediction model according to the inhibitory activity label of the protein kinase corresponding to the sample molecule. Specifically, the screening module 640 is configured to randomly divide the sample molecules for multiple times, correspondingly obtain a preset number of training molecules, and use the remaining number of molecules as test molecules; respectively deleting the protein kinases with the number of the labels with the inhibitory activity not more than a set value or single label with the inhibitory activity from the training molecules and the test molecules obtained by random division each time; and selecting the residual protein kinases in one random division with the least deleted protein kinases as target kinases.

Trust validation module 620 includes a first validation module 621 and/or a second validation module 622. The first confirmation module 621 is configured to obtain the MORGAN fingerprint of the target molecule and the MORGAN fingerprint of the sample molecule, respectively; wherein the sample molecules comprise training molecules and test molecules; determining the first maximum fingerprint similarity in the fingerprint similarities of the target molecule and each training molecule according to the corresponding fingerprint similarity of the MORGAN fingerprint of the target molecule and each training molecule; determining the second maximum fingerprint similarity in the fingerprint similarities of each test molecule and each training molecule according to the corresponding fingerprint similarity of the MORGAN fingerprint of each test molecule and each training molecule; screening and determining a test molecule corresponding to a second maximum fingerprint similarity closest to the first maximum fingerprint similarity as a reference molecule; and determining the credibility corresponding to the inhibition probability value of the target molecule acting on all target kinases according to the balance accuracy corresponding to the reference molecule. Further, the first confirming module 621 is configured to obtain a maximum fingerprint similarity among the similarities of the MORGAN fingerprint of each test molecule with respect to the MORGAN fingerprint of the training molecule, respectively; arranging the numerical values of the maximum fingerprint similarity corresponding to all the tested molecules in a descending order; dividing all the test molecules into N batches of test molecules according to the arrangement serial numbers of all the test molecules and the preset step length P and the preset quantity Q, wherein N is more than or equal to 1; determining the number of true positive labels, the number of false positive labels, the number of true negative labels and the number of false negative labels of each test molecule and all non-empty label target kinases according to the true inhibitory activity labels of each test molecule and the non-empty label target kinases and the predicted inhibitory activity labels of each test molecule and the non-empty label target kinases determined according to a prediction model; calculating the corresponding balance accuracy of each test molecule through a preset formula according to the number of the true positive labels, the number of the false positive labels, the number of the true negative labels and the number of the false negative labels; and taking the average value of the equilibrium accuracy corresponding to each test molecule in the batch of the reference molecule as the equilibrium accuracy corresponding to the reference molecule.

The second confirmation module 622 is configured to obtain corresponding values of a plurality of different descriptors of the target molecule; respectively judging whether the value of each descriptor of the target molecule is in the corresponding application domain; and determining the credibility corresponding to the inhibition probability value of the target molecule acting on all target kinases according to the number of descriptors of the numerical value in the corresponding application domain.

The inhibitor prediction device for protein kinase can pre-determine a training molecule, a test molecule and a target kinase through a screening module, so that a prediction model is pre-trained; the prediction module can predict the activity inhibition of target molecules corresponding to each target kinase in batches based on a prediction model, and quantizes through the inhibition probability value, so that the possible effective drug target molecules can be screened out quickly for a single target kinase in a targeted manner, ineffective molecules are filtered out, and unnecessary experimental processes are saved; in addition, the credibility of the inhibition probability value is determined by the credibility confirming module according to the application domain of the preset type, so that the situation that the prediction result of the prediction model is inaccurate is eliminated, research and development personnel are facilitated to screen the target molecule with higher credibility as a research option of the drug inhibitor, the research and development cost and time are saved, and the medicine research and development efficiency is improved.

Fig. 8 is a schematic structural diagram of an apparatus for constructing a protein kinase inhibitor prediction model according to an embodiment of the present application.

Referring to fig. 8, the apparatus for constructing a prediction model of an inhibitor of a protein kinase, shown in the embodiment of the present application, includes an obtaining module 810, a classifying module 820, a screening module 830, and a training and constructing module 840. Wherein:

the obtaining module 810 is configured to obtain a plurality of sample molecules and a plurality of protein kinases.

The classification module 820 is used for labeling the inhibitory activity label corresponding to each sample molecule relative to the corresponding protein kinase according to the inhibitory result of each sample molecule on each protein kinase.

The screening module 830 is configured to screen a protein kinase meeting a preset condition as a target kinase according to an inhibitory activity label of the protein kinase corresponding to the sample molecule.

The training construction module 840 is used for training the model constructed based on the deep learning framework according to the selected training molecules in the sample molecules and the target kinase; and testing the constructed model according to the selected test molecules in the sample molecules and the target kinase to obtain a prediction model.

Further, the obtaining module 810 can obtain known small molecules as sample molecules from data provided by KinomeX, and obtain protein kinases with experimental data for predicting models, so as to improve the accuracy of the prediction results of the constructed models. The classification module 820 in the device is used for determining that an inhibition activity label corresponding to an inhibition result is an 'activity' label when inhibition indexes of sample molecules and single protein kinase are greater than or equal to a preset value; and when the inhibition indexes of the sample molecules and the single protein kinase are smaller than a preset value, the inhibition activity label corresponding to the inhibition result is an 'inactive' label. For example, the inhibition indicator may be a pKi value or pKd value or pIC 50.

The screening module 830 can refer to the related descriptions of the screening module 640 in the above apparatuses, and will not be described herein. The training construction module 840 may construct a prediction model according to a Multi-task Deep Neural Network Classifier provided by the open-source Deep learning framework Deep 2.4.0.

According to the device for constructing the protein kinase inhibitor prediction model, the prediction model with high accuracy of the prediction result can be obtained, the prediction model can predict the inhibition probability values of each target molecule and a plurality of target kinases at the same time in a multitask mode, fewer calculation resources are adopted, high-flux calculation is completed, and the prediction efficiency is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring to fig. 9, the electronic device 1000 includes a memory 1010 and a processor 1020.

The Processor 1020 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 1010 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are needed by the processor 1020 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 1010 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, among others. In some embodiments, memory 1010 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 1010 has stored thereon executable code that, when processed by the processor 1020, may cause the processor 1020 to perform some or all of the methods described above.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having executable code (or a computer program or computer instruction code) stored thereon, which, when executed by a processor of an electronic device (or server, etc.), causes the processor to perform part or all of the various steps of the above-described method according to the present application.

Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for predicting an inhibitor of a protein kinase, comprising:

marking the inhibition activity label corresponding to each sample molecule relative to the corresponding protein kinase according to the inhibition result of each sample molecule on each protein kinase; the inhibitory activity tags include "active" tags, "inactive" tags, and empty tags; screening protein kinase meeting preset conditions as target kinase of a prediction model according to the inhibitory activity label of the sample molecule corresponding to the protein kinase; randomly dividing the sample molecules for multiple times, respectively and correspondingly obtaining a preset number of training molecules, and taking the rest number of molecules as test molecules; respectively deleting the protein kinases with the number of the labels with the inhibitory activity not more than a set value or single label with the inhibitory activity from the training molecules and the test molecules obtained by random division each time; selecting the residual protein kinases in the primary random division with the least deleted protein kinases as target kinases;

2. The method of claim 1, wherein the inputting of the target molecule into a predictive model and the outputting of the inhibition probability values of the target molecule for a plurality of target kinases selected from the predictive model respectively comprises;

3. The method of claim 1, wherein the determining the confidence level of the inhibition probability value of the target molecule on all target kinases according to the application domain of the preset type comprises:

4. The method of claim 3, wherein before determining the confidence level corresponding to the inhibition probability value of the target molecule on all target kinases according to the equilibrium accuracy corresponding to the reference molecule, the method further comprises:

5. The method of claim 1, wherein the determining the confidence level of the inhibition probability value of the target molecule on all target kinases according to the application domain of the preset type comprises:

6. The method according to any one of claims 1 to 5, wherein the determining the confidence level of the inhibition probability value of the target molecule on all target kinases according to the application domain of the preset type comprises:

7. A method for constructing a model for predicting an inhibitor of a protein kinase, comprising:

obtaining a plurality of sample molecules and a plurality of protein kinases;

marking the inhibition activity label corresponding to each sample molecule relative to the corresponding protein kinase according to the inhibition result of each sample molecule on each protein kinase; the inhibitory activity tags include "active" tags, "inactive" tags, and empty tags;

screening the protein kinase meeting preset conditions as a target kinase according to the inhibitory activity label of the protein kinase corresponding to the sample molecule; randomly dividing the sample molecules for multiple times, respectively and correspondingly obtaining a preset number of training molecules, and taking the rest number of molecules as test molecules; respectively deleting the protein kinases with the number of the labels with the inhibitory activity not more than a set value or single label with the inhibitory activity from the training molecules and the test molecules obtained by random division each time; selecting the residual protein kinases in the primary random division with the least deleted protein kinases as target kinases;

8. The method according to claim 7, wherein labeling each sample molecule with a corresponding inhibitory activity label relative to the corresponding protein kinase according to the inhibition result of each sample molecule on each protein kinase comprises:

9. An inhibitor prediction device for a protein kinase, comprising:

the classification module is used for marking the inhibitory activity label corresponding to each sample molecule relative to the corresponding protein kinase according to the inhibitory result of each sample molecule on each protein kinase; the inhibitory activity tags include "active" tags, "inactive" tags, and empty tags;

the screening module is used for screening the protein kinase which meets the preset conditions as the target kinase of the prediction model according to the inhibitory activity label of the protein kinase corresponding to the sample molecule; randomly dividing the sample molecules for multiple times, respectively and correspondingly obtaining a preset number of training molecules, and taking the rest number of molecules as test molecules; respectively deleting the protein kinases with the number of the labels with the inhibitory activity not more than a set value or single label with the inhibitory activity from the training molecules and the test molecules obtained by random division each time; selecting the residual protein kinases in the primary random division with the least deleted protein kinases as target kinases;

10. An apparatus for constructing a model for predicting an inhibitor of a protein kinase, comprising:

the screening module is used for screening the protein kinase which meets the preset conditions as a target kinase according to the inhibitory activity label of the protein kinase corresponding to the sample molecule; randomly dividing the sample molecules for multiple times, respectively and correspondingly obtaining a preset number of training molecules, and taking the rest number of molecules as test molecules; respectively deleting the protein kinases with the number of the labels with the inhibitory activity not more than a set value or single label with the inhibitory activity from the training molecules and the test molecules obtained by random division each time; selecting the residual protein kinases in the primary random division with the least deleted protein kinases as target kinases;

11. An electronic device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-8.

12. A computer-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-8.