CN111785319A

CN111785319A - Drug relocation method based on differential expression data

Info

Publication number: CN111785319A
Application number: CN202010594317.8A
Authority: CN
Inventors: 鱼亮; 何丹
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-10-16
Anticipated expiration: 2040-06-28
Also published as: CN111785319B

Abstract

The invention provides a drug relocation method based on differential expression data, which mainly solves the problems that differential genes are not screened and the accuracy of drug prediction is low in the prior art, and the scheme is as follows: acquiring gene data, calculating a gene expression change value and a significant value thereof, and screening out a gene set with significant differential expression; constructing two inquiry gene sets of diseases by using the gene set and the pathogenic genes; respectively acquiring correlation values of the two gene sets and the medicines and corresponding medicine prediction accuracy; comparing the accuracy of the predicted drugs of the two gene sets; calculating a difference value between the screened gene set and the standard normal distribution, and constructing a prediction model of the optimal disease query gene set according to the difference value and the calculated expression threshold value; the model is used for predicting gene sets of other diseases, calculating the accuracy of the predicted drugs and screening out candidate drugs with potential treatment effects. The prediction accuracy is high, and the method can be used for predicting candidate drugs with potential treatment effects on diseases.

Description

Drug relocation method based on differential expression data

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a drug relocation method which can be used for predicting candidate drugs with good treatment effects.

Background

It is well known in the art that drug development takes a long time, and it usually takes 11.4-13.5 years to develop new drugs from beginning to end, with costs of each drug being $ 1800 and 1800 billion. Despite the significant time and money invested in drug development, over 90% of drugs end up failing. For example, from 1949 to 2014 the U.S. food and drug administration FDA co-approved 150 drugs for cancer treatment, making the available drugs insufficient due to the existence of cancer subtypes and drug resistance, and it is therefore of great importance to explore new drug development strategies. Drug relocation is a new strategy for old drugs, saving a lot of time and cost required for drug development. Traditional drug development typically involves five stages: drug discovery and preclinical testing, drug safety review, clinical study of the drug, FDA review approval, and post-marketing FDA safety monitoring. While drug relocation has only four steps: compound identification, compound acquisition, drug development, and FDA after-market safety monitoring. The existing drug is used as a potential candidate drug to evaluate the treatment effect on the disease. This has the advantage of making the process of presuming a drug faster and less costly, taking advantage of several important drug properties that have been established including efficacy, pharmacokinetics, pharmacodynamics, and toxicity.

The current methods for drug relocation main stream can be divided into: methods based on drug information, genomics and transcriptomics data.

A method for drug relocation based on drug information.

Drug information-based approaches are primarily directed to predicting new indications for drugs by using four classes of information for drugs to explore common features among compounds. The first category is the use of chemical structure information of drugs to relocate drugs, and this category of methods uses the structural and chemical properties of drug molecules to calculate similarities between drugs based on known quantitative relationships between chemical structures and biochemical activities. Its activity is significantly increased or lost due to a slight change in the structure of the molecule. Therefore, such methods have a high false positive in application. The second method is to use the spatial structure information of the drug to reposition the drug, and this method mainly utilizes the spatial structure modeling of the drug molecule and the protein to simulate their direct physical interaction, which depends on the analysis of the drug molecule and the protein structure, however, some of the protein structures analyzed at present have errors and many important protein structures are not completely decomposed, which makes the modeling of the interaction between the compound and the protein incomplete, and therefore, this method also has high false positive. The third category is to use the side effect information of the drug to relocate the drug, the side effect information of the drug reflects the physiological consequences and the phenotypic expression of the drug, most of the methods are performed on the assumption that the side effect is associated with the disease when the drugs related to the disease share a plurality of side effects, and the accuracy of the prediction of the drug depends on whether the characteristics of the side effect of the drug have definite definitions. It is well known in the art that although strict preclinical assessments of side effects of drugs are performed, years of clinical use and post-market monitoring are required before complete identification of the side effects of a newly approved drug is possible. In addition, data redundancy is also a concern in the side effect field. The fourth category is the use of drug target information to relocate drugs, and such methods generally use drug targets and disease-related genes as drug and disease mediators to predict new drug-disease relationships. However, due to the small number of targets for current drugs, such methods often require the addition of additional information, introducing new noise. In summary, many drug relocation methods based on drug information face the problem of high false positives due to insufficient information.

Secondly, a drug relocation method based on genomics.

Genomics is the study of the structure, function and inheritance of a genome, including the interactions between genes and the human environment. Its main task is to determine the molecular sequences that make up the deoxyribonucleotide DNAs of an organism's genome. Current methods of drug relocation based on genomics can be divided into mathematical statistics and network methods. The mathematical statistics method is mainly used for judging the relationship between the drugs and the diseases by performing enrichment analysis on the gene sets corresponding to the drugs and the diseases. This approach is simple, easy to understand and implement, but depending on the disease or drug related gene set, the accuracy of the gene set may affect the accuracy of the results. The method using the network mainly constructs a genetic-related network or a disease network by using various data such as whole genome association study data, gene interaction data, metabolic data, and electronic medical records. The dependence of the method on the original data is large, the reliability of the result can be directly influenced by the incompleteness of the data set, for example, the existing protein network is incomplete, about 88 percent of the single nucleotide mutations related to diseases are positioned in intergenic and intronic regions, and the genetic effect of the parts can be hardly explained by the existing gene interaction data.

And thirdly, a drug relocation method based on transcriptomics.

Transcriptomes are the necessary links of proteomes that link genomic genetic information with biological functions, and are the main means for studying gene expression. The transcriptomics-based drug relocation method mainly aims at establishing a connection relation between differentially expressed genes or non-coding RNA in diseases and differentially expressed genes under drug perturbation to predict the influence of drugs on the diseases. For example, Chen et al, in the context of up-and down-regulation of genes, relocate the drug by dividing the genes into two parts, up-and down-regulating UCDB in disease and DCUB in disease but up-and down-regulating DCUB in drug. The method comprises the steps of using a human function connection network FLN as a background network, respectively mapping k genes with highest expression in diseases and m genes with lowest expression under the action of drugs into the network, using the k genes as seed genes, calculating prediction scores of the k genes and neighbor nodes, arranging the genes in a descending order according to the prediction scores, representing the condition that the m down-regulated genes are ranked in front through an area AUC under a characteristic curve of a subject, and using the AUC value as the score of the disease prediction drugs. Similarly, the m down-regulated genes are used as seed genes to calculate the fraction of the drug predicted disease. And synthesizing the two prediction scores to obtain the mutual prediction scores of the two groups of genes under the parameters k and m, sequencing the medicines in a descending order through the scores, calculating AUC by taking the medicines approved by FDA and the medicines used clinically as standards, obtaining the prediction result of UCDB, updating the parameters, repeating the experiment, and taking the result with the highest finally calculated AUC value as the final result of UCDB. Similarly, the prediction result of DCUB can be calculated. The accuracy of this method depends largely on the accuracy of the background network. In addition, the parameters of the disease are different in both modes, so a large number of experiments are required, which undoubtedly increases the difficulty of using the algorithm.

In addition, the join graph approach was first proposed by the join graph CMap database. The method utilizes transcriptome data to carry out reverse matching on the expression patterns of the drugs and the diseases, namely the expression patterns of genes in the diseases and the drugs are opposite, and the high expression in the diseases needs to be under the action of the drugs and vice versa. Whether the effect of the candidate drug on the disease is treatment or aggravation can be judged through the positive and negative matching scores so as to achieve the aim of screening the drug. The framework of the connection diagram method is mainly divided into three parts: gene expression profiles, disease query gene sets, and matching algorithms. At present, most of researches based on a connection diagram method framework are focused on gene expression profiles and matching algorithms, and the selection of a disease query gene set has a crucial influence on the accuracy of a prediction result. In addition, the methods select genes with remarkable differential expression, particularly those with high differential expression, as a query gene set of diseases, and completely do not consider pathogenic genes which are closely related to the diseases, so that the accuracy of the predicted result is low.

Disclosure of Invention

The invention aims to overcome the defects of the transcriptomics-based drug relocation method, and provides a drug relocation method based on differential expression data so as to provide a model for predicting an optimal disease query gene set and improve the accuracy of a prediction result.

The technical idea of the invention is as follows: finding an optimal query gene set for the disease through pathogenic genes and differential expression genes of the disease; constructing a linear model for predicting an optimal inquiry gene set of diseases according to experimental results of various diseases; and predicting the optimal query gene set of other diseases through the linear model to calculate the association score of the diseases and the medicines.

According to the technical idea, the implementation scheme of the invention comprises the following steps:

(1) downloading pathogenic gene data by using an OMIM database, and searching a pathogenic gene set L of the disease;

(2) downloading gene expression data, calculating each gene expression change value logFC and corresponding false positive FDR, and screening genes;

(3) two query gene sets were constructed:

(3a) deleting the top m genes and the bottom k genes in the screened gene set with obvious differential expression to obtain a disease de-noising differential gene set A, wherein k is more than or equal to 0, and m is more than or equal to 0;

(3b) sorting the pathogenic gene set L from small to large according to the logFC sequence of each gene, adding the gene with the logFC <0 to the top of the disease de-noising difference gene set A, and adding the gene with the logFC >0 to the bottom of the disease de-noising difference gene set A to obtain a disease gene set B of the difference and pathogenic genes;

(4) respectively calculating correlation value vectors AS of the difference gene set A and each medicine for denoising diseases_AAnd correlation value vector AS of disease gene set B of difference and pathogenic gene and each medicine_B；

(5) Calculating the accuracy P of the predicted drug of the denoised differential gene set A_AAnd calculating the difference and the accuracy P of the disease gene set B for predicting the medicine_B；

(6) Comparing the accuracy P obtained in the step (5)_AAnd P_B：

If P_BHigher than P_AIncreasing the values of the parameters k and m, deleting more noises, and repeating the steps (3) - (5) until P_B≤P_A(ii) a Otherwise, executing the step (7);

(7) recording the current difference gene set A of denoising the disease as an optimal inquiry gene set A 'of the disease, and determining a threshold value T at two ends of the optimal inquiry gene set A' of the disease_upAnd T_down；

(8) Setting the minimum value of the total threshold number of the sum of the top threshold number and the bottom threshold number as 20, and judging whenWhether the sum of the top threshold number and the bottom threshold number of the front is lower than a minimum value of 20: if yes, continuing to download other disease data from the TCGA database, and obtaining two end threshold values of the newly downloaded disease by adopting the steps (3) to (7)

And

executing (9) until the total threshold number sum is not less than 20; otherwise, directly executing (9);

(9) constructing a prediction model of the optimal difference gene set threshold of the disease:

(9a) randomly taking 100000 numerical points from standard normal distribution as background values, recording the numerical points as vectors E, using Kolmogorov-Smirnov test statistic J to represent distribution difference between differential expression value of disease genes and 100000 background points, using the distribution difference as independent variable X of model, and using threshold values T at two ends of optimal query gene set of all diseases_upAnd T_downA dependent variable Y as a model;

(9b) according to the parameters obtained in the step (9a), constructing a prediction model as a linear model of | Y | ═ aX + b, wherein a is the slope of the model, and b is the intercept of the model;

(10) predicting the optimal query gene set of other diseases by using a prediction model, and calculating the accuracy P of M drugs before drug ranking through (4) - (5)_MAnd then selecting the M medicines to obtain all candidate medicines with potential treatment effects on diseases, wherein M represents the number of medicines and is more than or equal to 10.

Compared with the prior art, the invention has the following advantages:

1. in the invention, in the process of acquiring the disease query gene set, pathogenic gene data are added, the query gene sets of two diseases are constructed by combining the differential genes and the pathogenic genes, the influence of the pathogenic genes on the diseases is fully considered, and the noise removal operation is carried out on the differential genes of the diseases by comparing the query gene sets of the two diseases.

2. The invention uses not less than 10 diseases to denoise the differential genes, uses the threshold value to express the gene set, uses the experimental data of various diseases and the threshold value thereof to construct the prediction model of the optimal differential gene set, uses the model to predict the optimal differential gene set for other diseases, and obtains more accurate medicine prediction accuracy compared with the prior art.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a graph showing the accuracy of the prediction of sarcoma drug relocation using the present invention.

Detailed Description

The embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

This example illustrates breast cancer in the TCGA database of cancer genomes, where all diseases in the TCGA database where the causative gene is present are used.

Referring to fig. 1, the drug relocation method based on differential expression data is implemented by the following steps:

step 1, downloading pathogenic gene data.

This example utilizes the OMIM database to search the set of causative genes for breast cancer, for a total of 22 genes, designated set L.

And 2, downloading gene expression data, calculating each gene expression change value logFC and corresponding false positive FDR, and screening genes.

2.1) downloading breast cancer data and calculating the gene expression change value logFC:

2.1.1) downloading gene expression data of the breast cancer data from a TCGA database, and identifying the sample types of the gene expression data in the TCGA according to a Barcode code provided by the TCGA to obtain a cancer patient sample set and a normal person sample set;

2.1.2) calculating the expression change value logFC of the genes in the breast cancer according to the downloaded breast cancer data:

wherein e is_lThe expression value of the gene in a sample l is represented, l represents the sample serial number in the gene expression data, when S +1 is not less than l and not more than S + W, the sample l is a cancer patient sample, when S +1 is not less than l and not more than S + W, the sample l is a normal person sample, S represents the number of the cancer patient samples, and W represents the number of the normal person samples;

2.2) calculating the false positive FDR corresponding to the gene in the breast cancer:

2.2.1) constructing an expression value vector θ for a gene from gene expression data in breast cancer:

θ＝(e₁,e₂,…,e_l,…,e_s,e_s+1,e_s+2,…,e_s+f,…,e_S+W)，

wherein e is_lExpressing the expression value of the gene in the cancer sample l, and 1. ltoreq. l.ltoreq.S, e_S+fExpressing the expression value of the gene in a normal human sample S + f, wherein S + f is more than or equal to 1 and less than or equal to W, f represents the sample serial number of the normal human in the gene expression data, and S + W represents the total sample number of the gene expression data;

2.2.2) randomly scrambling the serial numbers of the samples in the expression value vector theta of the genes in the breast cancer to obtain the expression value vector labeled by the scrambled samples

Wherein,

expressing the expression value of the gene in the cancer sample l after the label is disturbed,

expressing the expression value of the gene in a normal human sample S + f after the label is disturbed;

2.2.3) vector θ^*Performing random test, and calculating theta in the v random test^*Is calculated as a statistical result value H_v：

2.2.4) repeating the step (2.2.3) for 1000 times, and calculating the statistical significance value F of each gene expression change value:

wherein C represents the checking statistic value, and P represents theta in the random test^*Is higher than the set of C, | P { H |_V|H_VV is more than C and more than or equal to 1 and less than or equal to 1000, and represents the number of elements in the set P;

2.2.5) rank the F from small to large, calculate the false positive FDR of the gene:

wherein N represents the total number of genes and i represents the ordering of the genes;

2.3) screening genes from the gene expression variance logFC and false positive FDR corresponding to the genes:

2.3.1) setting the threshold for differential changes in gene expression change value logFC to 1.5 and the threshold for false positive FDR to α to 0.05, respectively;

2.3.2) comparing the gene expression change value logFC and the false positive FDR with respective threshold values, and screening genes of which the logFC | is more than or equal to 1.5 and the FDR is less than or equal to 0.05 as the genes with obvious differential expression of the breast cancer, wherein 3166 genes with obvious differential expression of the breast cancer are selected in the example.

And 3, constructing two query gene sets of the breast cancer.

3.1) deleting the topmost m genes and the bottommost k genes in the screened gene set with obvious differential expression to obtain a de-noised differential gene set A, wherein k is more than or equal to 0, m is more than or equal to 0, and the parameter selected in the de-noised differential gene set A for the breast cancer in the example is k which is 0 and m which is 0;

3.2) sequencing a pathogenic gene set L of the breast cancer from small to large according to the logFC sequence of each gene, adding the gene with the logFC <0 to the top of a differential gene set A for denoising the breast cancer, and adding the gene with the logFC >0 to the bottom of the differential gene set A for denoising the breast cancer to obtain a disease gene set B of the difference of the breast cancer and the pathogenic gene;

step 4, respectively calculating correlation value vectors AS of the difference gene set A for breast cancer denoising and each medicine_AAnd the difference of breast cancer and the correlation value vector AS of the disease gene set B of the pathogenic gene and each medicine_B。

4.1) calculating the correlation value vector AS of the difference gene set A and each medicine for denoising the breast cancer_A：

4.1.1) taking a difference gene set A for denoising the breast cancer as a query list of a Kolmogorov-Smirnov method, and taking an ordered list of genes under the action of each medicament as a reference list set in the Kolmogorov-Smirnov method;

4.1.2) constructing a position vector V (1.. n) of each gene in the reference list set, wherein n represents the number of genes in the reference list;

4.1.3) will query logFC in the list>0 as a query list Q of Kolmogorov-Smirnov_upWill query logFC in the list<0 as a Down-Regulation query List Q of Kolmogorov-Smirnov_downSeparately, for each drug, a set of query genes Q is computationally investigated_upEnrichment score of ES_upAnd downregulation of the query Gene set Q_downEnrichment score of ES_down：

Wherein,

score representing genes ranked top in the query list and reference list

Denotes the score of a gene before the query list and after the reference list, p denotes the gene rank, V_up(p) represents the position of the genes ranked as p in the upper query list in the reference list, and s1 represents the number of genes in the upper query list;

scores representing genes ranked later in the query list and the reference list;

score, V, representing genes ranked later in the query list and earlier in the reference list_down(p) represents the position of the genes ranked as p in the query list below in the reference list, and s2 represents the number of genes in the query list below;

4.1.4) enrichment score ES of the query Gene set according to (4.1.3)_upAnd downregulating enrichment scores ES of the query gene set_downCalculating the correlation value as of the difference gene set A and each medicine for denoising the disease^A：

If ES_upAnd ES_downThe same sign, then as^A＝0，

Otherwise, as^A＝ES_up-ES_down；

4.1.5) as for all drugs in the differential Gene set A by^AThe scores are normalized:

wherein,

the j is more than or equal to 1 and less than or equal to q, q represents the total number of the medicaments,

indicates all the drug scores as in the differential gene set A^AThe maximum value of (a) is,

indicates all the drug scores as in the differential gene set A^AThe minimum value of (a) is determined,

representing the correlation value of the j < th > drug and the disease after normalization in the difference gene set A;

4.1.6) combining the medicines with the normalized values of the diseases thereof to obtain the correlation value vector of the difference gene set A for denoising the breast cancer and all medicines:

wherein,

expressing the normalized correlation value of the jth drug and the disease of the difference gene set A for denoising the breast cancer, wherein j is more than or equal to 1 and less than or equal to q, and q expresses the total number of the drugs;

4.2) calculating correlation value vector AS of disease gene set B of breast cancer difference and pathogenic genes and each medicine_B:

4.2.1) taking the disease gene set B as a query list of the Kolmogorov-Smirnov method, taking the ordered list of the genes under the action of each medicament as a reference list set in the Kolmogorov-Smirnov method, and adopting the same steps as (4.1.1) to (4.1.5) to calculate the difference of the breast cancer and the normalized correlation value of the disease gene set B of the pathogenic genes and the medicament j

4.2.2) combining the normalized values of the drugs and the breast cancer to obtain the difference of the breast cancer and the correlation value vector of the disease gene set B of the pathogenic genes and all the drugs:

wherein,

the correlation value of the j drug of the disease gene set B representing the difference of the breast cancer and the pathogenic gene after being normalized with the disease.

Step 5, respectively calculating the predicted medicine accuracy P of two gene sets, namely a difference gene set A for denoising diseases and a disease gene set B for differentiating and causing diseases_AAnd P_B。

5.1) utilizing the correlation value vector AS of the difference gene set A and all medicines for denoising the breast cancer in the step 4_ACalculating the predicted drug accuracy P of the difference gene set A for denoising the breast cancer_A；

5.1.1) according to AS_AOrdering the medicines in a small-to-large order to obtain a first ordered medicine sequence R in a difference gene set A for denoising the breast cancer_A；

5.1.2) according to the correlation value vector AS of the difference gene set A and all the medicines for denoising the breast cancer in the step 4_ACalculating a relevance value vector AS_AAbsolute value of (a):

wherein,

represents the absolute value of the correlation score of the jth drug and the disease in the differential gene set A, wherein j is more than or equal to 1 and less than or equal to q, and q represents the total number of the drugs;

5.1.3) according to absolute value | AS_AI rule from big to small, all the medicines are sequenced to obtain a second ordered medicine sequence AR of the difference gene set A for denoising the breast cancer_A；

5.1.4) downloading a drug set omega related to the breast cancer from a drug toxicity database CTD as a standard drug set related to the breast cancer;

5.1.5) two drug orderings R in the differential Gene set A for Breast cancer De-noising_AAnd AR_AAnd the standard drug set omega of breast cancer, calculating the first drug ranking R of breast cancer_AAccuracy P of_A1And second drug order AR for breast cancer_AAccuracy P of_A2：

Wherein M is_A1Is shown in the drug order R_ANumber of top ranked drugs, U_A1Represents M_A1Number in drug set Ω, M, of individual drugs_A2Expressed in the drug order AR_ANumber of top ranked drugs, U_A2Represents M_A2The number of drugs in the drug set Ω;

5.1.6) comparison of P_A1And P_A2If P is the size of_A1≥P_A2The accuracy P of medicine prediction is carried out on the difference gene set A of breast cancer denoising_A＝P_A1Otherwise, P_A＝P_A2；

5.2) correlation value vector AS of disease gene set B and all drugs according to the differences and pathogenic genes in step 4_BCalculating the accuracy P of the predicted drug_B：

5.2.1) correlation value vector AS of disease Gene set B with all drugs based on differences and disease genes_BThe same procedure as in (5.1.1) to (5.1.5) was used to obtain the difference in breast cancer and the accuracy P of the disease gene set B of the causative genes in both ranks_B1And P_B2；

5.2.2) comparison of P_B1And P_B2If P is the size of_B1≥P_B2The difference of breast cancerAccuracy P of drug prediction of disease gene set B of heterogenous and pathogenic genes_B＝P_B1Otherwise, P_B＝P_B2。

Step 6, comparing the accuracy P of the predicted medicine of the two inquiry gene sets A, B of the breast cancer_AAnd P_B。

Predicting accuracy P of medicine by using difference gene set A for denoising breast cancer_AAccuracy P of predicting drugs with disease Gene set B of differences and disease genes_BAnd (3) comparison:

if P_BHigher than P_AIncreasing the values of the parameters k and m, deleting more noises, and repeating the steps 3-5 until P_B≤P_A；

Otherwise, executing step 7;

step 7, recording the current difference gene set A for denoising the breast cancer as an optimal query gene set A 'of the breast cancer, and determining threshold values T at two ends of the optimal query gene set A' of the breast cancer_upAnd T_down。

7.1) constructing two normal distributions for the gene set obtained in step 2:

7.1.1) a normal distribution symmetrical about logFC ═ 1.5 was constructed for genes in the gene set with logFC > 0:

(logFC_g)_down＝-logFC_g+3,logFC_g>0,

therein, logFC_gExpression Change value of Gene g, (logFC)_g)_downRepresents a value symmetrical about logFC of 1.5 constructed by an existing gene g;

7.1.2) constructing a normal distribution for the genes of logFC <0 in the gene set that is symmetric about logFC-1.5:

(logFC_g)_up＝-logFC_g-3,logFC_g<0，

wherein (logFC)_g)_upRepresents a value symmetrical about logFC-1.5 constructed by an existing gene g;

7.2) calculating the mean value μ of the normal distribution in (7.1.1) symmetrical about logFC ═ 1.5_downHebiaoTolerance sigma_down:

Wherein G represents the number of genes in a normal distribution;

7.3) calculating the mean value μ of the normal distribution in (7.1.2) symmetrical about logFC-1.5_upAnd standard deviation σ_up：

7.4) the maximum and minimum values of logFC in the Breast cancer optimal query Gene set A' were respectively recorded as (logFC)_maxAnd (logFC)_minCalculating corresponding threshold value T according to the results of (7.2) - (7.3)_upAnd T_down：

In this example, the bottom gene threshold T of the optimal differential gene set A' determined for breast cancer_upTop gene threshold T of optimal differential gene set a ═ 2.03_down＝-1.56；

And 8, determining whether to download other disease data according to the sum of the current top threshold number and the current bottom threshold number.

8.1) adding the current top threshold quantity and the bottom threshold quantity to obtain the total threshold quantity sum;

8.2) setting the minimum value of the total threshold number of the sum of the top threshold number and the bottom threshold number to 20;

8.3) comparing the total threshold number sum with a set total threshold number minimum value of 20:

if sum<20, continuing to download other disease data from the TCGA database, and adopting the steps 3-7 to obtain two end thresholds of the newly downloaded disease

And

executing step 9 until the total threshold number sum is not less than 20;

otherwise, step 9 is performed directly.

And 9, constructing a prediction model of the optimal difference gene set threshold value of the disease.

9.1) determining independent variable X and dependent variable Y of the prediction model:

9.1.1) randomly taking 100000 numerical points from the standard normal distribution as background values and marking as a vector E;

9.1.2) characterization of the distribution differences between differential expression values of disease genes and vector E using Kolmogorov-Smirnov test statistic J as independent variable X of the model;

9.1.3) two-terminal threshold T of optimal query Gene set for all diseases_upAnd T_downA dependent variable Y as a model;

9.2) according to the parameters obtained in (9.1), constructing a prediction model as a linear model with | Y | ═ aX + b, wherein a is the slope of the model, b is the intercept of the model, and the values of a and b are calculated by the following formula:

wherein,

means representing the independent variable X

Means of absolute value of dependent variable Y

D represents the number of independent variables and dependent variables used to construct the prediction model, X_dDenotes the d-th value, Y, in the argument_dRepresenting the d-th value in the dependent variable.

In this example, the prediction model is: y | ═ 14.889X + 1.182.

And step 10, predicting the optimal query gene set of the sarcoma and carrying out drug prediction.

10.1) predicting the optimal query gene set of the sarcoma by using the prediction model constructed in the step 9, taking the predicted optimal query gene set of the sarcoma as an input set of Kolmogorov-Smirnov, and calculating the accuracy rates P of M drugs before drug ranking through the steps 4-5_M；

10.2) selecting the first M medicaments of the medicament ranking in the step (10.1) to obtain all candidate medicaments with potential treatment effect on diseases, wherein M represents the number of the medicaments, and M is more than or equal to 10.

In this example, the accuracy is P₂₀All the drugs listed in the top 20, M-20, were selected to obtain drug candidates with potential therapeutic effects on sarcomas.

The technical effects of the present invention will be described below with reference to simulation experiments.

1. Simulation conditions are as follows:

the computer hardware CPU of the simulation experiment is Intel Core (TM) i5, the memory of the computer hardware is 4G, and the computer software: RStudio integrated development software on a WINDOWS 7V system.

2. Simulation content:

screening differentially expressed genes of colorectal adenocarcinoma, breast cancer, lung adenocarcinoma, lung squamous cell carcinoma, hepatocellular carcinoma, prostate cancer, pancreatic ductal adenocarcinoma, bladder urothelial carcinoma, glioblastoma and head and neck squamous cell carcinoma by using the method to obtain upper and lower regulated gene thresholds of an optimal differential gene set, and constructing a model; and then the model is used for carrying out simulation experiments on the accuracy distribution of each ranking section of the prediction results of the potential therapeutic drugs for the sarcoma with the disease, and the results are shown in figure 2. The vertical axis in FIG. 2 represents the accuracy of each rank of the sarcoma drug-relocation prediction result, and the horizontal axis represents the number of drugs.

As can be seen from FIG. 2, the accuracy of the predicted results of the top 10 ranked drugs reaches 90%, and the accuracy of the predicted results of the top 20 ranked drugs reaches 70%. As can be seen from the accuracy of the sarcoma drug prediction result, the invention effectively improves the accuracy of the drug relocation prediction result.

Claims

1. A method of drug relocation based on differential expression data comprising the steps of:

(3) two query gene sets were constructed:

(6) Comparing the accuracy P obtained in the step (5)_AAnd P_B：

(8) Setting the minimum value of the total threshold number of the sum of the top threshold number and the bottom threshold number as 20, and judging whether the sum of the current top threshold number and the current bottom threshold number is lower than the minimum value of 20: if yes, continuing to download other disease data from the TCGA database, and obtaining two end threshold values of newly downloaded diseases by adopting the steps (3) - (7)

And

2. The method according to claim 1, wherein the expression change value logFC of each gene is calculated in (2) by the following formula:

wherein e is_lThe expression value of the gene in the sample l is shown, l is the sample number in the gene expression data, when S +1 is not less than l and not more than S + W, the sample l is the cancer patient sample, when S +1 is not less than l and not more than S + W, the sample l is the normal human sample, S is the number of the cancer patient samples, and W is the number of the normal human samples.

3. The method of claim 1, wherein the false positive FDR for gene correspondence is calculated in (2) as follows:

(2a) constructing an expression value vector θ of the gene from the gene expression data:

θ＝(e₁,e₂,…,e_l,…,e_S,e_S+1,e_S+2,…,e_S+f,…,e_S+W)，

wherein l represents the sample number of the cancer patient in the gene expression data, e_lExpressing the expression value of the gene in the cancer sample l, and 1. ltoreq. l.ltoreq.S, S representing the number of cancer patient samples, e_s+fExpressing the expression value of the gene in a normal human sample S + f, wherein S + f is more than or equal to 1 and less than or equal to W, f represents the sample number of the normal human in the gene expression data, W represents the number of normal human samples, and S + W represents a baseTotal number of samples due to expression data;

(2b) randomly scrambling the serial numbers of the samples in the expression value vector theta of the gene to obtain the expression value vector labeled by the scrambled samples

For vector theta^*Performing random test, and calculating theta in the v random test^*Is calculated as a statistical result value H_v：

(2c) Repeating the step (2b) for 1000 times, and calculating the statistical significance value F of each gene expression change value:

(2e) sorting the F from small to large, calculating the false positive FDR of the gene:

where N represents the total number of genes and i represents the ordering of the genes.

4. The method according to claim 1, wherein the screening of the gene from the gene expression change value logFC and the false positive FDR corresponding to the gene in (2) comprises setting a threshold for differential change of the gene expression change value logFC and a threshold for false positive FDR as α, respectively, and comparing the gene expression change value logFC and its false positive FDR with their respective thresholds to screen the gene with | logFC | ≧ and FDR ≦ α as the gene whose differential expression is significant.

5. The method according to claim 1, wherein (4) the correlation value vector AS of the denoised differential gene set A and each drug is calculated separately_AAnd correlation value vector AS of disease gene set B of difference and pathogenic gene and each medicine_BThe implementation is as follows:

(4a) taking a difference gene set A subjected to disease denoising AS a query list of a Kolmogorov-Smirnov method, taking an ordered list of genes under the action of each drug AS a reference list set in the Kolmogorov-Smirnov method, and calculating a correlation value vector AS of the disease query gene set A and all drugs_A；

(4a1) Constructing a position vector V (1.. n) of each gene in the reference list set, wherein n represents the number of genes in the reference list;

(4a2) query logFC in list>0 as a query list Q of Kolmogorov-Smirnov_upWill query logFC in the list<0 as a Down-Regulation query List Q of Kolmogorov-Smirnov_downSeparately, for each drug, a set of query genes Q is computationally investigated_upAnd downregulation of the query Gene set Q_downEnrichment score of ES_upAnd ES_down：

Wherein,

a score representing the gene ranked top in both the query list and the reference list,

denotes the score of a gene before the query list and after the reference list, p denotes the gene rank, V_up(p) represents the position of the genes ranked as p in the upper query list in the reference list, s1 represents the number of genes in the upper query list,

representing the scores of genes ranked further down in both the query list and the reference list,

(4a3) calculating the correlation value as of the difference gene set A and each medicine for denoising the disease^AIf ES in (4c)_upAnd ES_downThe same sign, then as^A0, otherwise, as^A＝ES_up-ES_downAs for all drugs by the following formula^AAnd (3) score normalization:

wherein,

represents the correlation value of the jth drug and the disease, j is more than or equal to 1 and less than or equal to q, q represents the total number of the drugs,

indicates the all-drug score as^AThe maximum value of (a) is,

indicates the all-drug score as^AThe minimum value of (a) is determined,

expressing the correlation value of the j < th > drug and the disease after normalization;

(4a4) combining the normalized values of the drugs and the diseases to obtain a correlation value vector:

(4b) calculating correlation value vector AS of disease gene set B of difference and pathogenic genes and all medicines_B；

(4b1) The disease gene set B is used as a query list of a Kolmogorov-Smirnov method, an ordered list of genes under the action of each medicament is used as a reference list set in the Kolmogorov-Smirnov method, and the normalized correlation values of the disease gene set B of the difference and pathogenic genes and the medicament j are calculated by adopting the same steps as (4a1) - (4a3)

(4b2) Combining the normalized values of the drugs and the diseases to obtain the correlation value vectors of the disease gene set B of the difference and the pathogenic genes and all the drugs:

6. the method of claim 1, wherein the accuracy P of the disease de-noised differential gene set A predictive drug is calculated in (5)_AAnd calculating the difference and the accuracy P of the disease gene set B for predicting the medicine_BThe implementation is as follows;

(5a) using the correlation value vector AS in (4)_ACalculating the accuracy P of the predicted drug of the denoised differential gene set A_A；

(5a1) The medicines are sequenced according to the sequence from small to large to obtain a first ordered medicine sequence R of the differential gene set A for denoising diseases_ACalculating a correlation value vector AS_AAbsolute value of (a):

wherein,

j is more than or equal to 1 and less than or equal to q, and q represents the total number of the medicaments;

(5a2) sequencing all the medicaments according to the rule that the absolute value is from large to small to obtain a second ordered medicament sequence AR of the differential gene set A for denoising the disease_A；

(5a3) Downloading a medicine set omega related to the disease from a related database, and respectively sorting R two medicines of a difference gene set A for denoising the disease_AAnd AR_ACalculation accuracy P_A1And P_A2：

Wherein M is_A1Is shown in the drug order R_ANumber of top ranked drugs, U_A1Is shown in the drug order R_AMiddle front M_A1The number of known drugs in the drug set omega, M_A2Expressed in the drug order AR_ANumber of top ranked drugs, U_A2Expressed in the drug order AR_AMiddle front M_A2The number of drugs known in the drug set Ω;

(5a4) comparison P_A1And P_A2If P is the size of_A1≥P_A2The accuracy P of the drug prediction of the disease de-noised difference gene set A_A＝P_A1Otherwise, P_A＝P_A2；

(5b) Using the correlation value vector AS in (4)_BCalculating the accuracy P of the drug predicted by the disease gene set B of differences and causative genes_B；

(5b1) Correlation value vector AS of disease gene set B and all drugs according to difference and pathogenic genes_BThe accuracies P of the disease gene sets B for which the differences and causative genes were obtained in the two ranks were calculated in the same steps as (5a1) - (5a3)_B1And P_B2；

(5b2) Comparison P_B1And P_B2If P is the size of_B1≥P_B2Accuracy P of drug prediction of disease Gene set B of differences and disease genes_B＝P_B1Otherwise, P_B＝P_B2。

7. The method of claim 1, wherein (7) a threshold T is determined across the set A' of disease-optimal query genes_upAnd T_downThe implementation is as follows:

(7a) genes of logFC >0 and logFC <0 in the gene set whose differential expression was significant, obtained for (2), respectively, constructed normal distributions symmetrical with respect to logFC and symmetrical with respect to logFC:

(logFC_g)_up＝-logFC_g-2*,logFC_g<0，

(logFC_g)_down＝-logFC_g+2*,logFC_g>0，

therein, logFC_gExpression Change value of Gene g, (logFC)_g)_upRepresents a value that is symmetric about logFC (logFC) constructed by an existing gene g_g)_downA threshold value indicating a difference change in gene expression change value logFC, which is a value symmetrical about logFC constructed by an existing gene g;

(7b) calculating the mean value μ of the normal distribution in (7a) symmetric about logFC_downAnd standard deviation σ_down:

Wherein G represents the number of genes in a normal distribution;

(7c) calculating the mean value μ of the normal distribution in (7a) about logFC-symmetry_upAnd standard deviation σ_up：

(7d) The maximum value and the minimum value of logFC in the disease optimal query gene set A' are respectively recorded as (logFC)_maxAnd (logFC)_minCalculating corresponding threshold value T according to the results of (7b) - (7c)_upAnd T_down：

8. The method of claim 1, wherein the values of slope a and intercept b of the prediction model | Y | ═ aX + b are calculated in (9b) by the following formula:

wherein,

means representing the independent variable X

Means of absolute value of dependent variable Y