CN108647489B

CN108647489B - Method and system for screening disease drug target and target combination

Info

Publication number: CN108647489B
Application number: CN201810461277.2A
Authority: CN
Inventors: 陈玲玲; 常继伟; 丁毓端; 高俊祥
Original assignee: Huazhong Agricultural University
Current assignee: Huazhong Agricultural University
Priority date: 2018-05-15
Filing date: 2018-05-15
Publication date: 2020-06-30
Anticipated expiration: 2038-05-15
Also published as: CN108647489A

Abstract

The invention discloses a method and a system for screening disease drug targets and drug target combinations, wherein the method comprises the following steps: constructing an automatic encoder according to the differential expression data of the protein between the disease cell line and the normal tissue; calculating the knockout effect of the gene according to the automatic encoder, and constructing a knockout network; predicting a disease-associated protein from the knockout network; the related protein is a drug target; and predicting the combination of the related proteins of the disease according to the knockout network, wherein the combination of the related proteins is a drug target combination. The present method or system allows for the simultaneous prediction of disease-associated proteins and the combined effects of proteins.

Description

Method and system for screening disease drug target and target combination

Technical Field

The invention relates to the field of deep neural networks, in particular to a method and a system for screening disease drug targets and target combinations.

Background

With the advancement of biometric measures, disease drug-associated high throughput data is constantly accumulating and understanding of some diseases and disease-associated genes/proteins is also constantly in progress. At present, the target drug therapy method is considered to be superior to the traditional drug therapy method in safety and Adverse Drug Reactions (ADR), so the target drug is gradually becoming the main direction of disease treatment and drug development. In such drug development efforts, the most critical step is the determination of drug targets, which are the preferred disease-associated proteins.

Many bioinformatics methods for drug design are currently available to screen disease-related genes/proteins for various types of data information, such as protein-protein interactions (PPIs), genomic mutations, gene/protein expression and functional annotations, among which some methods using biological networks have better performance. Some methods utilize the inclusion of disease-related biological process information in a protein interaction network for prediction of disease-related genes/proteins; still other methods utilize protein interaction networks in conjunction with other omics data, such as gene/protein expression profiles and genomic mutation information, to predict new related genes; other methods may screen through network topology. These approaches to biological networking generally follow the "gultby association (GBA)" principle that genes/proteins or phenotypes that are closely related to known disease genes/proteins are also more likely to be associated with the disease, and such predictions are likely to introduce some bias. Some methods integrate multi-sample data to construct networks and also ignore the organization and condition specificity existing in the networks.

Existing methods for predicting disease targets based on protein interaction networks are generally based on the following steps: firstly, the method comprises the following steps: collecting a large amount of protein interaction data, and arranging the data into a non-redundant set to remove misconnections; II, secondly: collecting gene expression profiles of normal tissues and disease tissues, and calculating a differential expression value between the two tissues; thirdly, the method comprises the following steps: the sum of the differential expression values for all proteins interacting with the selected protein is calculated and used as a criterion for the preferred candidate gene.

The neural network has strong nonlinear fitting capability, is convenient for computer realization, has strong robustness, memory capability, nonlinear mapping capability and strong self-learning capability, is an important means of deep learning at present, and has a great application prospect. Here we propose a deep neural network model constructed based on auto-encoder to learn the specificity of protein interactions in disease tissues and use the trained network for screening disease-related proteins and protein combinations.

Disclosure of Invention

The invention aims to provide a method and a system for screening disease drug targets and drug target combinations, and provides an automatic encoder-based deep learning method, which can fully learn the specificity of protein interaction in cancer multigroup chemical data, and a network after deep learning training can effectively screen cancer-related drug targets and target combinations.

In order to achieve the purpose, the invention provides the following scheme:

a method of screening for disease drug targets and drug target combinations comprising:

constructing an automatic encoder according to the differential expression data of the protein between the disease cell line and the normal tissue;

calculating the knockout effect of the gene according to the automatic encoder, and constructing a knockout network;

predicting a disease-associated protein from the knockout network; the related protein is a drug target;

and predicting the combination of the related proteins of the disease according to the knockout network, wherein the combination of the related proteins is a drug target combination.

Optionally, the step of constructing a knockout network according to the knockout effect of the gene calculated by the automatic encoder specifically includes:

constructing a deep learning network model according to an automatic encoder;

giving a differential expression spectrum, inputting the differential expression spectrum into the deep learning network model to obtain a differential expression value, and recording the differential expression value as background output B;

setting a difference value threshold value, selecting genes with difference values larger than the difference value threshold value in the difference expression profile, and marking as high-expression genes;

sorting all the high-expression genes from large to small according to difference values, assigning the high-expression gene with the largest difference value to a numerical value with the smallest difference value in the difference expression spectrum, and sequentially assigning all the high-expression genes to new difference values;

constructing a new differential expression profile according to the high expression genes with new differential values and the rest genes in the differential expression profile after removing all the high expression genes;

inputting the new difference expression spectrum into the deep learning network model to obtain a second output K;

setting a comparison threshold;

calculating the difference value between the second output K of all the high expression genes and the background output B of the high expression genes to obtain a comparison difference value;

recording all the high-expression genes with the comparison difference values larger than the comparison threshold value as knockout genes;

and constructing a knockout network according to all the knockout genes.

Optionally, the constructing a knockout network according to the knockout gene specifically comprises:

using the knockout gene as a source point of the knockout network;

a gene affected by the knockout gene as an edge of the source point;

the comparison difference is used as the weight of the edge.

Optionally, the predicting a disease-associated protein according to the knockout network specifically includes:

setting known drug targets as marker genes, proteins to be detected and correlation threshold values;

obtaining a target point protein and a source point protein connected with the protein to be detected according to the knockout network;

distinguishing a target protein with an inhibitory effect from a target protein with an activating effect according to the target protein and the marker gene;

calculating the weight sum of the edges connecting the target protein with the inhibitory effect and the protein to be detected, and recording as a first weight sum;

calculating the sum of absolute values of weights of edges, connected with the target protein with the activation effect and the protein to be detected, and recording the sum as a first absolute value sum;

calculating the sum of the weights of all positive-value edges of the source point protein, and recording the sum as a second weight sum, and recording the sum of the absolute values of all negative-value weights as a second absolute value sum;

calculating a correlation score of the protein to be detected according to the first weight sum, the first absolute value sum, the second weight sum and the second absolute value sum;

and selecting the protein to be detected with the correlation score higher than the correlation threshold value of all the proteins to be detected, namely the protein related to diseases.

Optionally, the predicting a combination of disease-associated proteins according to the knockout network specifically includes:

collecting a combination of proteins known to have lethal and combinatorial effects as a positive sample from which negative samples 10 times the number of positive samples are randomly generated;

selecting any target protein in the knockout network, and screening all target proteins and source proteins directly connected with the target protein;

judging whether a target protein and a source protein of the target protein in the knockout network exist in the positive sample or the negative sample;

if the target protein in the knockout network, the target protein of the target protein and the source point protein exist in the positive sample, adding the weights of the target protein, the target protein of the target protein and the edge of the source point protein to obtain an absolute value, and obtaining a first combined weight and an absolute value;

adding the absolute values of the first combined weight sums of all the target proteins in the positive sample to obtain a positive combined weight sum absolute value;

if the target protein in the knockout network, the target protein of the target protein and the source point protein exist in the negative sample, adding the weights of the target protein, the target protein of the target protein and the edge of the source point protein to obtain an absolute value, and obtaining a second combined weight and an absolute value;

adding the second combination weights and the absolute values of all the target proteins in the negative sample to obtain a negative combination weight and an absolute value;

assigning the target protein to a value of 1, -1 or 0 according to the absolute value of the first combined weight sum and the absolute value of the second combined weight sum to obtain a target protein assignment;

selecting a first protein to be detected and a second protein to be detected;

setting a first detection threshold and a second detection threshold;

calculating the proportion of the proteins which are affected by the first protein to be detected and the second protein to be detected together, and recording as the proportion of the proteins which are affected together;

calculating the proportion of the evaluated proteins which are affected by the first protein to be detected and the second protein to be detected together according to the target protein assignment value, and recording the proportion as the proportion of the evaluated proteins which are affected together, wherein the evaluated proteins are the proteins with the assignment values of 1 or-1;

determining whether the common effect protein ratio is greater than the first detection threshold while the common effect assessed protein ratio is greater than the second detection threshold;

if so, the combination of the first test protein and the second test protein is the combination of the disease-associated proteins.

A system for screening disease drug targets and drug target combinations, comprising:

the automatic coding module is used for constructing an automatic coder according to the differential expression data of the protein between the disease cell line and the normal tissue;

the knockout network construction module is used for constructing a knockout network according to the knockout effect of the gene calculated by the automatic encoder;

a related protein prediction module for predicting a disease-related protein according to the knockout network; the related protein is a drug target;

and the protein combination prediction module is used for predicting the combination of the related proteins of the disease according to the knockout network, and the combination of the related proteins is the drug target combination.

Optionally, the knockout network construction module specifically includes:

the network model building unit is used for building a deep learning network model according to the automatic encoder;

the background output calculation unit is used for giving a difference expression spectrum, inputting the difference expression spectrum into the deep learning network model to obtain a difference expression value, and recording the difference expression value as background output B;

a high expression gene obtaining unit, configured to set a difference value threshold, select a gene in the difference expression profile whose difference value is greater than the difference value threshold, and mark the gene as a high expression gene;

the high expression gene assignment unit is used for sorting all the high expression genes from large to small according to difference values, assigning the high expression gene with the largest difference value to the numerical value with the smallest difference value in the difference expression spectrum, and sequentially assigning all the high expression genes to new difference values;

a new differential expression profile constructing unit for constructing a new differential expression profile based on the high expression genes having the new differential values and the remaining genes in the differential expression profile after removing all the high expression genes;

the second output calculation unit is used for inputting the new difference expression spectrum into the deep learning network model to obtain a second output K;

a comparison threshold setting unit for setting a comparison threshold;

the comparison difference value calculation unit is used for calculating the difference value between the second output K of all the high expression genes and the background output B of the high expression genes to obtain a comparison difference value;

a knocked-out gene obtaining unit, configured to mark all the high-expression genes whose comparison difference values are greater than the comparison threshold as knocked-out genes;

and the network construction unit is used for constructing a knockout network according to all the knockout genes.

Optionally, the network constructing unit specifically includes:

using the knockout gene as a source point of the knockout network;

a gene affected by the knockout gene as an edge of the source point;

the comparison difference is used as the weight of the edge.

Optionally, the relevant protein prediction module specifically includes:

the gene setting unit is used for setting known drug targets as marker genes, proteins to be detected and correlation threshold values;

a target point source point protein obtaining unit, configured to obtain a target point protein and a source point protein connected to the protein to be detected according to the knockout network;

a target protein distinguishing unit for distinguishing a target protein with an inhibitory effect from a target protein with an activating effect according to the target protein and the marker gene;

the inhibition target protein calculation unit is used for calculating the weight sum of the edges of the inhibition effect target protein connected with the protein to be detected, and recording the weight sum as a first weight sum;

the activation target protein calculation unit is used for calculating the sum of absolute values of weights of edges, connected with the target protein with the activation effect and the protein to be detected, and recording the sum as a first absolute value sum;

a source point protein calculation unit, configured to calculate a sum of weights of all positive-valued edges of the source point protein, which is recorded as a second weight sum, and a sum of absolute values of all negative-valued weights, which is recorded as a second absolute value sum;

a correlation score calculation unit for calculating a correlation score of the protein to be measured based on the first weight sum, the first absolute value sum, the second weight sum, and the second absolute value sum;

and the disease-related protein determining unit is used for selecting the protein to be detected, of which the correlation score is higher than the correlation threshold value, of all the proteins to be detected, namely the disease-related protein.

Optionally, the protein combination prediction module specifically includes:

a sample collection unit for collecting a combination of proteins known to have lethal and combinatorial effects as a positive sample from which negative samples 10 times as many as the number of the positive samples are randomly generated;

a target protein screening unit for selecting any target protein in the knockout network and screening all target proteins and source proteins directly connected with the target protein;

a first judging unit, configured to judge whether a target protein and a target protein of the target protein in the knockout network and a source protein are present in the positive sample or the negative sample;

a first combination weight calculation unit, configured to, if a target protein in the knockout network and a target protein and a source protein of the target protein are present in the positive sample, add weights of edges of the target protein, the target protein of the target protein, and the source protein to obtain an absolute value, and obtain a first combination weight and an absolute value;

a positive combining weight calculation unit, configured to add absolute values of first combining weight sums of all target proteins in the positive sample to obtain a positive combining weight sum absolute value;

a second combination weight calculation unit, configured to add weights of edges of a target protein, a target protein of the target protein, and a source protein of the target protein in the knockout network to obtain a second combination weight and an absolute value if the target protein, and the source protein of the target protein are present in the negative sample;

a negative combination weight calculation unit, configured to add the second combination weights and the absolute values of all the target proteins in the negative sample to obtain a negative combination weight and an absolute value;

the target protein assignment unit is used for assigning a value of 1, -1 or 0 to the target protein according to the absolute value of the first combined weight sum and the absolute value of the second combined weight sum to obtain an assignment target protein;

a proteome selection unit for selecting a first protein to be detected and a second protein to be detected as a proteome to be detected;

a threshold setting unit for setting a first detection threshold and a second detection threshold;

the first protein proportion calculation unit is used for calculating the proportion of the proteins which are influenced by the first protein to be detected and the second protein to be detected together, and recording the proportion as the proportion of the proteins which are influenced together;

a second protein proportion calculation unit, which is used for calculating the proportion of the evaluated protein which is influenced by the first protein to be detected and the second protein to be detected together according to the assignment target protein, and recording the proportion as the proportion of the evaluated protein which is influenced together;

a second determination unit configured to determine whether the common-influence protein ratio is greater than the first detection threshold, and whether the common-influence evaluated protein ratio is greater than the second detection threshold;

a protein combination determination unit, configured to determine that the combination of the first test protein and the second test protein is a combination of disease-related proteins if yes; otherwise, it is not a combination of disease-associated proteins.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the invention, an automatic encoder is constructed according to the differential expression data of protein between a disease cell line and a normal tissue; calculating the knockout effect of the gene according to the automatic encoder, and constructing a knockout network; predicting a disease-associated protein from the knockout network; the related protein is a drug target; and predicting the combination of the related proteins of the disease according to the knockout network, wherein the combination of the related proteins is a drug target combination. A method for simulating a knockout effect based on a deep neural network is provided: in the depth model, the input values for each protein are varied and differences in the output production are observed to assess the effect of the protein on the disease. The invention can capture the characteristic structure and the internal rule hidden in the complex data by only using one network model, can simultaneously predict the protein related to diseases and the combined effect of the protein, and unifies the two types of prediction problems in theory and realization.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a method for screening disease drug targets and drug target combinations according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for predicting a disease-associated protein according to an embodiment of the present invention;

FIG. 3 is a block diagram of a system for screening disease drug targets and drug target combinations according to an embodiment of the present invention;

FIG. 4 is a block diagram of a protein prediction module according to an embodiment of the present invention;

FIG. 5 is a flow chart of predicting disease-associated proteins and protein combinations using a deep neural network model according to an embodiment of the present invention;

FIG. 6 is a network of cancer-associated protein combinations predicted by embodiments of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

FIG. 1 is a flow chart of a method for screening disease drug targets and drug target combinations according to an embodiment of the present invention. Referring to fig. 1, a method of screening for disease drug targets and drug target combinations comprising:

step 101: constructing an automatic encoder according to the differential expression data of the protein between the disease cell line and the normal tissue;

step 102: calculating the knockout effect of the gene according to the automatic encoder, and constructing a knockout network;

step 103: predicting a disease-associated protein from the knockout network; the related protein is a drug target;

step 104: and predicting the combination of the related proteins of the disease according to the knockout network, wherein the combination of the related proteins is a drug target combination.

By adopting the method, the characteristic structure and the internal rule implicit in the complex data can be captured, the combined effect of the protein and the protein related to the disease can be predicted at the same time, and the two types of prediction problems are unified in theory and implementation.

The basic unit of the deep neural network model is an automatic coding machine, the internal structure of the automatic coding machine corresponds to the protein interaction network, and the training process utilizes difference expression data, so that the specificity of protein interaction can be learned by utilizing the difference of different tissues. In addition, combining multiple autocoders allows two proteins that are far apart in the interaction network to be linked.

Wherein, the step of constructing a knockout network according to the knockout effect of the gene calculated by the automatic encoder specifically comprises the following steps:

constructing a deep learning network model according to an automatic encoder;

setting a comparison threshold;

and constructing a knockout network according to all the knockout genes.

Wherein, the step of constructing a knockout network according to the knockout gene specifically comprises the following steps:

using the knockout gene as a source point of the knockout network;

a gene affected by the knockout gene as an edge of the source point;

the comparison difference is used as the weight of the edge.

FIG. 2 is a flowchart of a method for predicting a disease-associated protein according to an embodiment of the present invention. Referring to fig. 2, the predicting a disease-associated protein according to the knockout network specifically includes:

step 201: setting known drug targets as marker genes, proteins to be detected and correlation threshold values;

step 202: obtaining a target point protein and a source point protein connected with the protein to be detected according to the knockout network;

step 203: distinguishing a target protein with an inhibitory effect from a target protein with an activating effect according to the target protein and the marker gene;

step 204: calculating the weight sum of the edges connecting the target protein with the inhibitory effect and the protein to be detected, and recording as a first weight sum;

step 205: calculating the sum of absolute values of weights of edges, connected with the target protein with the activation effect and the protein to be detected, and recording the sum as a first absolute value sum;

step 206: calculating the sum of the weights of all positive-value edges of the source point protein, and recording the sum as a second weight sum, and recording the sum of the absolute values of all negative-value weights as a second absolute value sum;

step 207: calculating a correlation score of the protein to be detected according to the first weight sum, the first absolute value sum, the second weight sum and the second absolute value sum;

step 208: and selecting the protein to be detected with the correlation score higher than the correlation threshold value of all the proteins to be detected, namely the protein related to diseases.

Wherein the predicting a combination of disease-associated proteins according to the knockout network specifically comprises:

selecting a first protein to be detected and a second protein to be detected;

setting a first detection threshold and a second detection threshold;

FIG. 3 is a block diagram of a system for screening disease drug targets and drug target combinations according to an embodiment of the present invention. Referring to fig. 3, a system for screening disease drug targets and drug target combinations, comprising:

an automatic coding module 301, configured to construct an automatic encoder according to the differential expression data of the protein between the disease cell line and the normal tissue;

a knockout network construction module 302, configured to construct a knockout network according to the knockout effect of the gene calculated by the automatic encoder;

a related protein prediction module 303 for predicting a disease related protein from the knockout network; the related protein is a drug target;

a protein combination prediction module 304, configured to predict a combination of disease-associated proteins according to the knockout network, where the combination of associated proteins is a drug target combination.

The knockout network construction module specifically comprises:

a comparison threshold setting unit for setting a comparison threshold;

Wherein, the network construction unit specifically is:

using the knockout gene as a source point of the knockout network;

a gene affected by the knockout gene as an edge of the source point;

the comparison difference is used as the weight of the edge.

FIG. 4 is a block diagram of a related protein prediction module according to an embodiment of the present invention. Referring to fig. 4, the related protein prediction module specifically includes:

a marker gene setting unit 401, configured to set a known drug target as a marker gene, a protein to be detected, and a correlation threshold;

a target point source protein obtaining unit 402, configured to obtain a target point protein and a source point protein connected to the protein to be detected according to the knockout network;

a target protein distinguishing unit 403 for distinguishing a target protein having an inhibitory effect from a target protein having an activating effect based on the target protein and the marker gene;

an inhibition target protein calculation unit 404, configured to calculate a weight sum of edges where the target protein with the inhibition effect is connected to the protein to be detected, and record the weight sum as a first weight sum;

an activation target protein calculation unit 405, configured to calculate a sum of absolute values of weights of edges where the target protein with the activation effect is connected to the protein to be detected, and record the sum as a first absolute value sum;

a source point protein calculation unit 406, configured to calculate a sum of weights of all positive-valued edges of the source point protein, which is recorded as a second weight sum, and a sum of absolute values of all negative-valued weights, which is recorded as a second absolute value sum;

a correlation score calculation unit 407, configured to calculate a correlation score of the protein to be detected according to the first weight sum, the first absolute value sum, the second weight sum, and the second absolute value sum;

a disease-related protein determining unit 408, configured to select a protein to be detected whose correlation score is higher than the correlation threshold value from all the proteins to be detected, that is, a disease-related protein.

Wherein the protein combination prediction module specifically comprises:

and a protein combination determination unit, configured to determine that the combination of the first test protein and the second test protein is a combination of disease-related proteins if yes.

FIG. 5 is a flow chart of the method for predicting disease-related proteins and protein combinations using a deep neural network model according to an embodiment of the present invention, and referring to FIG. 5, the method of the present invention will be described in detail below:

neural network model design and training (see model training part a in FIG. 5)

The present study optimizes the standard auto-encoder to an ultra sparse model to be suitable for learning disease-related features. In the model, each input unit represents the differential expression value of a gene in diseased and normal tissues, and each implicit neuron represents a protein interaction. Because the differential expression value can be a positive value or a negative value, three patterns exist in an implicit neuron: + +, -and + -. Different patterns have different meanings in biology and therefore need to be distinguished. Neurons of the same pattern are grouped into one machine, so there are three machines. The number of input units of each coding machine is the number of proteins, and the number of implicit neurons is the number of interactions. The three automatic coding machines are trained respectively, after the training is finished, the three automatic coding machines are combined, the number of input units of the combined automatic coding machines is kept unchanged, and the number of the hidden neurons is three times that of the original neurons. The activation function of the autoencoder is an integrated linear activation function (equation 4), the weights are updated with a back-propagation algorithm, the learning rate is set to 0.005, the momentum is set to 0.5, and L2 regularization is performed. Each training cycle has 6 samples, and the number of cycles is set to a fixed number according to the size of the training samples.

P₁＝W₁|i₁| (1)

P₂＝W₂|i₂| (2)

In the formula i₁And i₂Representing the input value of the input unit, W₁And W₂Represents the corresponding weight, and W represents the average weight in each cycle, P_i(i ═ 1, 2) represents the value of the input multiplied by the weight, and P represents the input value of the implied neuron.

Calculation of the knockout Effect of the Gene (see the knockout simulation part of b in FIG. 5)

To calculate the knock-out effect of a given gene/protein, we constructed a deep learning network using a trained autocoding machine. Firstly, an automatic coding machine is used for constructing a deep learning network, then a differential expression profile is used as the output of an input calculation model and is used as the background output (the differential expression profile is obtained by calculating experimental data and is downloaded from a public database), finally, the input value of a given gene is changed and the output is calculated, and the difference between the output value and the background output is regarded as other genes influenced by the knock-out effect of the gene. In practice, the value after a given gene change is a negative value, and the specific magnitude is given according to the distribution of all the differentially expressed values, which is close to the minimum value in the distribution. The computational model consists of five trained autocoders connected in series. The output of each layer consists of two parts, namely the output of the current layer of the automatic coding machine and the input of the current layer of the automatic coding machine, which are given different weights a and (1-a), respectively, where a is set to 0.25 (equation 5). Calculating the knock-out effect comprises the steps of:

1: and (3) giving a differential expression profile, taking the differential expression profile as an input, calculating the output of the neural network model, and recording the output as a background output B, wherein the output is the differential expression value calculated by the neural network.

2: calculating the knockout effect of the up-regulated genes in the differential expression profile. Setting a threshold value, such as 0.5, selecting genes with difference values greater than 0.5 in the differential expression profile, and sequentially assigning a smaller value to the genes and keeping the values of other genes unchanged (the smaller value is the minimum value in the whole data according to the data distribution). This altered expression profile was input into the same model and its output recorded as K.

3: for the output K of each highly expressed gene in step 2, the difference from the background output B was calculated. Since the export units correspond to all genes, the genes with large differences are the genes affected by the knocked-out given gene.

4: and constructing a knockout network. The result of step 3 can be expressed in a network form, the knocked-out gene is a source point, the direction of the edge points to the affected gene, and the weight of the edge is the difference value of the affected gene in K-B, wherein K refers to the output obtained after a certain high-expression gene is changed. B refers to the output resulting for the changed input. Each differential expression profile can result in a corresponding knockout network.

l_n＝αe_n+(1-α)l_n-1(5)

D＝B-K (6)

In the formula I_nOutput of the surrogate layer n, e_nThe output of the proxy layer n auto-encoder, a, is a weight, which is set to 0.25. Vectors K and B represent the output of knocking out a certain gene and the background output, respectively, and D represents the difference between the two outputs.

Prediction of disease-associated proteins (see c preferred protein content in FIG. 5)

To evaluate the relevance of each gene to the disease, a score is required according to the connection condition of each gene in the knockout network, and the higher the score is, the higher the relevance to the disease is. The involvement of marker genes is required in the calculation of gene scores, where we use known drug targets as markers. To calculate the score of a given protein P, the protein directly connected to the knockout network needs to be considered, and since the edges in the knockout network are directed edges, the proteins directly connected to the knockout network are divided into two classes, the source point protein and the target point protein, which are used to calculate the score of the given protein P. It is known thatThe drug targets of (1) can be divided into activation effect targets and inhibition effect targets. An edge with a positive weight in the knockout network means that the knockout of the source protein impairs the function of the target protein, so if the target protein of protein P is a drug target with known inhibitory effect and has a positive weight, the sum of the weights of all the edges of this class of protein that are connected to P will be calculated and labeled S_pwtIf the target protein of protein P is the drug target of activating effect and has the side with negative weight, then calculating the sum of the absolute values of the negative weight of the protein connected with P and marking the sum as S_nwt. For the source point protein i linked to P, we first note the sum of all positive weights

The sum of the absolute values of the negative weights is noted

The S of protein i_pwtAnd S_nwtAre respectively marked as

And

the final calculation of the protein P score is fully described in equation 7.

In the formula ω_iRepresents the weight of the source protein i to the target protein P.

Prediction of disease-associated protein combinations (see FIG. 5 for d preferred protein combinations)

We also used the knock-out network to predict combinations of proteins with lethal effects. The basic assumption is that a combination of proteins with a lethal effect will affect or can be affected by the same group of other proteins, and if these proteins are efficiently identified, they can be used to predict a new combination of proteins with a lethal effect. Combinations of proteins known to have lethal and combinatorial effects are first collected from the database as positive samples, and then negative samples 10 times the number of positive samples are randomly generated. The method comprises the following specific steps:

1: each protein in a knockout network is screened for all target and source proteins directly connected to it, if any combination of these proteins is present in the positive sample, the weights of the two proteins in this combination are added to the edges of the selected protein and taken to their absolute values, and such an operation is performed on all combinations of proteins present in the positive sample and the values are summed and labeled as LWpos. The same calculation is done for negative samples and labeled LWneg. It should be noted that the source and target proteins were calculated separately.

2: using the results from the previous calculation, we assigned a value of 1, -1 or 0 to each protein according to equation 10. The threshold T in the equation was set to 2.3 to find that the amount of protein was close for-1 and 1.

3: proteins x and y were selected and calculated whether x and y had a lethal or combined effect. All proteins associated with X and Y were first screened and the two sets of proteins were labeled X and Y and the intersection was labeled XY. Then, the CR calculated by the formula 9 and the formula 10_xyAnd CRA_xyValue, CR_xyThe ratio of proteins affected by both the X and Y proteins, CRA_xyIs the ratio of the co-affected proteins evaluated in the formula

Represents the weight of the edge of a given protein x that joins protein p. Finally, if CR_xy>0.3andCRA_xy>0.03, the combination of proteins is identified as having a combined or lethal effect.

The following operations are also required in the embodiment of the invention:

1 collecting data set of known drug target

This study yielded 913 known drug Target proteins from Therapeutic Target Database (TTD) and was divided into two categories: inhibitory classes (inhibitors) and agonistic classes (agonsts). Used for screening new cancer-related genes.

2 Collection of Gene and protein expression datasets

Gene and protein expression data are derived from 3 sources: (1) protein expression data downloaded from the proteomics db database, comprising 98 cell line samples; (2) RNA expression data downloaded from the BioXpress database, including 18 cancer types and samples of 660 patients; these gene and protein expression data were used mainly in 2 parts of the study: training a neural network model: differential expression values using comparisons between cell lines and tissues in proteomicdb; prediction of cancer-associated genes: the data in proteomics db and BioXpress were used.

3 Collection of protein interaction data sets

Protein-Protein interactions (PPIs) data sets data from five public Protein databases, namely The Biological General interactions for interactions databases (biogID), Human Protein Reference Databases (HPRD), IntAct, The Database of Interacting Proteins (DIP) and The molecular interaction databases (MINT), were integrated into a non-redundant Protein interaction data set, selecting 224,988 pairs of 14,759 Proteins with Protein expression information of Protein interactions DB. For constructing neural networks.

4 Collection of protein combinatorial datasets

The study constructed a protein Combination dataset containing known Drug target Combination (Drug Combination) and synthetic lethal (synthetic lethal) information, and 1,272 pairs of Drug target Combination information were obtained from The Drug Combination Database V2.0(DCDB2) (where The Drug target combinations that did not apply to cancer in The Database were removed). In addition, synthetic lethal protein combination information was obtained from the SynLethDB database (including 13,171 pairs of experimental validation and 5,489 pairs of computationally predicted synthetic lethal protein combinations) and a study paper results data (including 182 pairs of computationally predicted synthetic lethal protein combinations). This protein combination dataset incorporates the above information and constructs a network comprising 20,062 pairs of protein combinations. For predicting novel protein combinations

FIG. 6 is a network of cancer-associated protein combinations predicted by embodiments of the present invention. Referring to fig. 6, the present study optimizes the standard auto-encoder to an ultra-sparse model suitable for learning cancer-related features. We used differential expression data of proteins in proteomicdb between cancer cell lines and normal tissues as input, and constrained the linkage of cryptic units (hiddenunits) to the input unit with a network of protein interactions, each cryptic unit can be represented as an interaction. The 3 auto-encoders were trained separately and then combined into 1 single auto-encoder. The trained network can capture which interaction patterns are more important to the cancer process. The differentiation term of the activation function is limited to 1 by multiplying by a corresponding coefficient. The training process was performed 2,940 times, each time involving 6 comparisons of one cell line, and thus each cell line was repeatedly trained 30 times. The threshold of the activation function is set to 1 and multiplied by the average weight. The learning rate is set to 0.005, the moment coefficient is set to 0.5, and the attenuation coefficient (decay coefficient) is set to 1.

And (4) forming the trained auto-encoders into a group of five-layer recurrent neural networks, wherein each layer uses the auto-encoders with the same weight. Each time the last output is added to the input, the last output amplifies the difference value of the important learned proteins. The study defined the knock-out effect as the difference in output values after each high differentially expressed protein (DE) value >0.5 changed the input value to a negative value (modified value-7.5) in the deep neural network model using the differentially expressed value data in the BioXpress and proteomics db as input values. All interactive proteins with absolute value of output difference value >0.000001 construct a Knockdown (KD) network.

The results of protein evaluation are in the form of a KD network, and a change in the input value of one protein can affect the output of many other proteins, where only the effect on upregulating expressed proteins/genes is considered, and the effect on the protein can be positive or negative. In order to obtain a single value to represent the importance of the protein, the database of TTD of known cancer target proteins is used as a marker in a KD network, and KD score is calculated as an index for evaluating the importance of cancer in the protein (KDscore scoring strategy takes into account direct effect values and indirect effect factors in the network). We evaluated all proteins with each high differential expression protein DE value >0.5, with the final KD score being the average for each cell line or disease sample.

In this study, proteins with high KD score (proteomicdb: KD score > -0.1; BioXpress: KD score > -0.02) were pre-selected as cancer-associated proteins. The results of the integration of the two datasets, proteomicdb and BioXpress, revealed that 4,862 cancer-associated proteins were preliminarily predicted, with the intersection of the two datasets accounting for 87% and 85% of the total number of each, comprising 386 known drug targets, covering 86.35% of the known drug targets that could be evaluated by the present method (this study used 913 known drug target information in the TTD database, of which 447 proteins contained the necessary PPI and expression information for evaluation using the method). Of the 500 pre-selected cancer-related proteins (proteins with the largest mean KD score among cell lines), 211 known cancer drug targets were identified, and then Gene Ontology (GO) enrichment analysis of the remaining 289 proteins revealed that these proteins were mainly functionally enriched in DNA replication (GO: 0006270; GO: 0006260; GO:0006268) metabolic pathways (GO: 0009058; GO: 0044267; GO: 0051246; GO: 0009894; GO:0019538), chromatin structure (GO: 0051276; GO: 0098813; GO: 0007059; GO:0051983), cell division-related (cell cycle GO: 0007049; GO: 0000278; GO: 0022402; GO: 1903047; GO:0007346), and the like. Among these TOP 500 pre-selected cancer-associated proteins, some of those that have been extensively studied, such as cellular tumor antigen 53(TP53), Epidermal Growth Factor Receptor (EGFR), GTPase HRas (HRas), GTPase NRas (NRas), and GTPase KRas (KRas), all have high KD score. In addition, the research also finds that some novel cancer-related proteins, such as amylodbeta A4protein (APP), neural cell adhesion molecule L1(L1CAM), thymidine kinase1(TK1), DNA replication susceptibility MCM2(MCM2), MCM4 and the like have higher KD score values and can be used as potential drug targets for subsequent research. We focus on a newly discovered cancer-associated target, APP protein. Previous studies have shown that variation in the APP protein is a major cause of Alzheimer's Disease (AD), deletion of the APP protein can block the cell division process, and addition of the C-terminus of the APP protein can restart the division process. The results of this study showed that APP protein has higher KD score in 18 cancers and 77 cell lines, suggesting that APP protein may also be involved in very important functions in cancer.

The present study also uses neural network models to predict combinations of cancer-associated proteins. In recent years, combination drugs have proven to be an effective means for cancer treatment, and synthetic lethal effects are of great help for precise therapeutic research on individuals. Due to the differences of individual and cancer types and the huge search space of target combination, it is difficult to search all drug target combinations and synthesize lethal genes by conventional experimental means. Thus, the combination of proteins was pre-screened using bioinformatic methods and then experimentally verified as a viable strategy. The method is based on the assumption that: a pair of synthetic lethal or target potent combinations of proteins which can affect or be affected by another protein(s) can be screened for by known synthetic lethal pairs and target combinations. This study constructed a protein combination data set (containing 20,062 pairs of protein combination information) containing known drug target combinations and synthetic lethal information, and expressed the interactions between proteins using KD networks obtained after the expression data in proteomics db and BioXpress were imported into a deep neural network model.

This study focused on combinations of known cancer targets and combinations predicted in PPI protein combinations. Combinations in which the frequency of appearance in the cell lines was higher than 0.25 were retained, and to obtain a more reliable protein combination, a protein combination covering 10% of the cell lines was predicted to be a cancer-associated protein combination. The present study constructed two networks of the set of proteins predicted to be cancer-associated protein combinations in PPI and known target combinations, respectively (figure 6). Among the network of cancer-associated protein combinations of PPIs, there are 2,439 pairs of protein combinations, including combinations of EGFR, TP53, cyclin-dependent kinase2(CDK2), KRAS, and RAC-alpha line/hormone-protein kinase (AKT1) among others, which have higher connectivity value (void) proteins (see fig. 6, section a). Among the networks of cancer-associated protein combinations of known targets, there are 2,543 pairs of protein combinations, including higher connectivity value protein combinations such as NRAS, KRAS, EGFR, thymidylate synthase (TYMS), riboside-diphosphate reductase large study (RRM1), DNA topoisomerease 2-alpha (TOP2A), and serine/thionine-protein kinase Chk1(CHEK1) as shown in section b of FIG. 6. These combinations can be subsequently investigated as potential cancer-associated protein combinations.

The basic unit of the deep neural network model is an automatic coding machine, the internal structure of the automatic coding machine corresponds to the protein interaction network, and the training process utilizes difference expression data, so that the specificity of protein interaction can be learned by utilizing the difference of different tissues. In addition, combining multiple autocoders allows two proteins that are far apart in the interaction network to be linked. These two points ensure that the model is a function of existing methods.

The present invention can predict disease-associated proteins. The nature of the informatics approach to predicting disease-associated proteins/genes is to predict the effect of knocking out a protein/gene on cellular activity. Accurate calculation of the effects of gene activation/knock-out involves very complex interactions of intracellular components, complete and precise interaction relationships are difficult to determine, and kinetic parameters are more difficult to measure. The model provides a method for simulating knockout effect based on a deep neural network in a method for predicting disease-related protein, which comprises the following steps: in the depth model, the input values for each protein are varied and differences in the output production are observed to assess the effect of the protein on the disease.

The present invention allows prediction of combinations of disease-associated proteins. In recent years, the combination drug is proved to be an effective means for treating diseases, and the synthetic lethal effect is greatly helpful for accurate treatment research aiming at individuals. However, due to the differences of individual and disease types and the huge search space of target combination, it is difficult to search all drug target combinations and synthesize lethal genes by conventional experimental means. Therefore, it becomes a feasible strategy to use the calculation of the model to assist the pre-screening of protein combinations and then to perform experimental verification. The method is based on the assumption that: a pair of synthetic lethal or target potent combinations of proteins which can affect or be affected by another protein(s) can be screened for by known synthetic lethal pairs and target combinations. The interaction between proteins is represented by a KD network obtained by inputting expression data into a deep neural network model.

The model is also suitable as an effective analysis method for general gene/protein expression data analysis. The model is trained based on gene/protein expression data, so that the model can learn the interaction specificity under different backgrounds according to different input data. A well-trained model can be obtained on the basis of large-scale expression data in advance, the difference characteristics of new expression data of the same platform can be rapidly analyzed by utilizing the trained model, and the advantage of large data analysis is shown (the method is suitable for the new expression data of large and small scales, even if the data of small sample size, the large-scale expression information of the same platform can be integrated by the method, and the purpose of analyzing the specificity of the data is achieved).

The model predicts disease-related genes better than simple differential gene analysis and network-coupled differential gene analysis. The model has important application value in computer-aided drug target screening.

The model is also suitable as an effective analysis method for general gene/protein expression data analysis. The model is trained based on gene/protein expression data, so that the model is suitable for studying interaction specificity under different backgrounds according to different learning of input data. A well-trained model can be obtained on the basis of large-scale expression data in advance, the difference characteristics of new expression data of the same platform can be rapidly analyzed by utilizing the trained model, and the advantage of large data analysis is shown (the method is suitable for the new expression data of large and small scales, even if the data of small sample size, the large-scale expression information of the same platform can be integrated by the method, and the purpose of analyzing the specificity of the data is achieved). Therefore, the important related gene screening method using the model is also suitable for single cell sequencing data. Since single cell sequencing data are derived from a single cancer type, the model needs to be pre-trained from a given data set. Here we used the model used herein for Circulating Tumor Cell (CTCs) transcriptome data of prostate cancer patients as input, and trained a prostate cancer specific neural network model by a similar procedure as before for screening prostate cancer related proteins and drugs and protein combinations.

Single cell sequencing data (access number GSE67980) downloaded from NCBI was used, which contained 122 prostate Circulating Tumor Cell (CTCs) samples from 12 patients, 12 prostate tumor samples, and 3 normal prostate tissue samples. The expression difference of the cancer cell line and the normal tissue is used as input data to train a prostate cancer specific model, the training process is basically the same as the previous model training process, 2,144 rounds are performed in total, and 6 comparison calculations are performed in each round (the first 5 times are the difference expression values compared by randomly selecting 536 groups of samples, and the last time are the difference expression values compared by sequentially selecting 536 groups of samples and repeating the steps for 4 times). The learning rate is set to 0.005, the moment coefficient is set to 0.5, and the distance coefficient is set to 1. In a 5-layer recurrent neural network, the activation function threshold is set to 0.01 and multiplied by the average weight. For all proteins with DE value >0, modifiedvalue is set to-4.

The results show that the model can better find important proteins related to the prostate cancer, such as Androgenirecter (AR), kallikrein-2(KLK2), KLK3 and the like. The Androgen Receptor (AR) is an important known prostate cancer-related gene. As a result, it was found that AR is not significantly expressed in some CTC cell lines, and proteins having some important functions such as disabled homolog 2(DAB2), chromatically modified-related protein MEAF6(MEAF6), tyrosine-protease JAK1(JAK1), interleukin-2receptor subunit (IL2RB), mitogen-activated protease kinase 2(MAP3K2), integrin-linked protease kinase (ILK) and cyclin-dependent kinase 1(CDKN1A) have higher scKD, which means that these genes have important roles (may replace some AR functions) in the function of prostate cancer CTC cells in which non-AR expression is involved.

In addition, the study also predicts important protein combinations associated with prostate cancer. These combinations of proteins differ between different cells of the CTCs of prostate cancer. Some proteins with higher coverage are selected, and in the network formed by the combination of the proteins related to the prostate cancer, some important functional proteins are included, such as signal transducer and activator of transcription 3(STAT3), export-1 (XPO1), cycle-dependent kinase 4(CDK4), microtube-associated protein 4(MAP4), BCL-2-like protein 1(BCL2L 1).

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method of screening for disease drug targets and drug target combinations comprising:

predicting a combination of disease-related proteins according to the knockout network, wherein the combination of the related proteins is a drug target combination;

the step of constructing a knockout network according to the knockout effect of the gene calculated by the automatic encoder specifically comprises the following steps:

constructing a deep learning network model according to an automatic encoder;

setting a comparison threshold;

and constructing a knockout network according to all the knockout genes.

2. The method of screening a combination of a disease drug target and a drug target according to claim 1, wherein the constructing a knockout network according to the knockout gene specifically comprises:

using the knockout gene as a source point of the knockout network;

a gene affected by the knockout gene as an edge of the source point;

the comparison difference is used as the weight of the edge.

3. The method of screening for disease drug targets and drug target combinations according to claim 1, wherein the predicting a disease-associated protein from the knockout network specifically comprises:

setting a known drug target as a marker gene, setting a protein to be detected and a correlation threshold;

4. The method of screening a combination of a disease drug target and a drug target according to claim 1, wherein the predicting a combination of disease-associated proteins according to the knockout network specifically comprises:

selecting a first protein to be detected and a second protein to be detected;

setting a first detection threshold and a second detection threshold;

5. A system for screening disease drug targets and drug target combinations, comprising:

a protein combination prediction module for predicting a combination of disease-related proteins according to the knockout network, wherein the combination of related proteins is a drug target combination;

the knockout network construction module specifically comprises:

a comparison threshold setting unit for setting a comparison threshold;

6. The system for screening a combination of a disease drug target and a drug target according to claim 5, wherein the network building unit is specifically:

using the knockout gene as a source point of the knockout network;

a gene affected by the knockout gene as an edge of the source point;

the comparison difference is used as the weight of the edge.

7. The system for screening a combination of a disease drug target and a drug target of claim 5, wherein the relevant protein prediction module specifically comprises:

the marker gene setting unit is used for setting known drug targets as marker genes, proteins to be detected and correlation threshold values;

8. The system for screening a combination of a disease drug target and a drug target according to claim 5, wherein the protein combination prediction module specifically comprises: