CN108830040B

CN108830040B - Drug sensitivity prediction method based on cell line and drug similarity network

Info

Publication number: CN108830040B
Application number: CN201810578523.2A
Authority: CN
Inventors: 李敏; 王晓桐; 王建新
Original assignee: Central South University
Current assignee: Shenzhen Zaozhidao Technology Co ltd
Priority date: 2018-06-07
Filing date: 2018-06-07
Publication date: 2021-06-15
Anticipated expiration: 2038-06-07
Also published as: CN108830040A

Abstract

The invention discloses a drug sensitivity prediction method based on a cell line and a drug similarity network, which comprises the following steps: constructing a drug similarity network, a cell line similarity network and a drug-cell line relation network; respectively obtaining a corresponding drug adjacency matrix, a cell line adjacency matrix and a drug-cell line relation initial matrix according to the drug similarity network, the cell line similarity network and the drug-cell line relation network; and obtaining a drug sensitivity prediction matrix of the drug-cell line by adopting an unbalanced double random walk algorithm based on the drug adjacency matrix, the cell line adjacency matrix and the drug-cell line relation initial matrix, wherein each element in the drug sensitivity prediction matrix of the drug-cell line obtained after the walking is finished by adopting an unbalanced double random walk formula is a sensitivity value of the corresponding drug to the cell line prediction. The method fully considers the characteristics of the drug similarity network and the cell line similarity network, and further improves the reliability of the drug sensitivity prediction result.

Description

Drug sensitivity prediction method based on cell line and drug similarity network

Technical Field

The invention belongs to the technical field of biomedicine, and particularly relates to a drug sensitivity prediction method based on a cell line and a drug similarity network.

Background

Over the past two decades, with substantial improvements in high throughput analysis techniques, there has been an increase in the expectation that personalized or sophisticated medicine will become the future medical science paradigm. Patients with the same cancer may respond differently to a particular drug treatment, and personalized medicine may wish to molecularly interpret the cause of a particular patient's cancer and then tailor the treatment regimen to address the patient's cancer. Personalized medicine observes tumor responses based on established molecular characteristics of cancer cells, as compared to chemotherapy-based monotherapy approaches, to overcome some of the limitations associated with conventional symptom-directed disease diagnosis and treatment. The most important step in personalized medicine is the identification of biomarkers (biorarker) that are capable of predicting the patient's drug response, i.e. predicting the patient's sensitivity to the drug. However, the development of predictive biomarkers requires extensive experimentation and is more expensive when used in humans or animal models. Therefore, there are many studies to perform large-scale drug screening by culturing human cell lines to determine predictive biomarkers. One of the earliest attempts at this was the NCI-60 study, which included a panel of 60 human cell lines and their response to over 10 million compounds. The drug response results for the NCI-60 dataset show that different types of cancers have different drug response characteristics, and that different tumors from the same type of cancer may have different molecular patterns. When two tissues, Cancer Cell Line Encyclopedia (CCLE) and Cancer Genome Project (CGP), were examined by co-analysis of the pharmacological profiles of about 1,000 clinically relevant human cell lines and 149 cancer drugs, and examples of the use of an elastic network model to select expression and mutation profiles that predict drug response are given. We have then found that algorithms in bioinformatics can be used to predict drug response.

Zhang et al, who first calculated drug similarity and cell line similarity separately and then made drug sensitivity predictions by constructing a two-layer network of drugs and cell lines, proposed two hypotheses in the article: 1) genetically similar cell lines may respond very similarly to a given drug; 2) structurally related drugs may have similar therapeutic effects due to their common molecular structure or targeting pattern. And these two hypotheses were verified experimentally. Actually, the drug sensitivity prediction experiment based on the double-layer network is to excavate the influence of the drug and the cell line on the drug sensitivity, but the method does not consider the biological network characteristics of the drug similarity network and the similarity network of the cell line. Meanwhile, Zhang and the like adopt heterogeneous networks to realize network fusion, and the mode cannot fully excavate the topological structures of Zhang and the like; in addition, some apply the MRSF algorithm to drug sensitivity prediction based on the similarity network, and similarly, the MRSF algorithm fails to sufficiently exploit the biological network characteristics of drugs and cell lines themselves and their topological structures, and the effect is not ideal.

Therefore, in the prior art, the process of fusing drug similarity and cell line similarity into drug sensitivity prediction cannot fully excavate the topological structure of the network, so that the prediction effect is poor and needs to be improved.

Disclosure of Invention

The invention aims to provide a drug sensitivity prediction method based on a cell line and a drug similarity network, which deeply excavates the characteristics and the association of the drug similarity network and the cell line similarity network, applies the fusion to the drug sensitivity prediction and improves the reliability of a drug sensitivity prediction result.

The invention provides a drug sensitivity prediction method based on a cell line and a drug similarity network, which comprises the following steps:

s1: constructing a drug similarity network, a cell line similarity network and a drug-cell line relation network;

wherein the drug similarity network comprises similarity values for any two drugs in the drug set, the cell line similarity network comprises similarity values for any two cell lines in the cell line set, and the drug-cell line relationship network comprises drug-cell lines of known sensitivity values and corresponding sensitivity values;

the drug-cell line of known sensitivity value indicates that the drug sensitivity data of the corresponding drug in the drug set to the corresponding cell line in the cell line set is known;

s2: respectively obtaining a corresponding drug adjacency matrix, a cell line adjacency matrix and a drug-cell line relation initial matrix according to the drug similarity network, the cell line similarity network and the drug-cell line relation network;

wherein, each element value in the drug adjacency matrix, the cell line adjacency matrix and the drug-cell line relation initial matrix is respectively determined according to the similarity value between two corresponding drugs, the similarity value between two corresponding cell lines and whether the sensitivity value of the corresponding drug-cell line is known or not;

the drug adjacency matrix is an N-row and N-row matrix, and the cell line adjacency matrix is an M-row and M-row matrix; the primary matrix of the drug-cell line relation is an N-row and M-column matrix or an M-row and N-column matrix, N is the number of the drugs in the drug set, and M is the number of the cell lines in the cell line set;

s3: taking the primary drug-cell line relation matrix as an initial value of a drug sensitivity prediction matrix of the drug-cell line, and updating the initial value of the drug sensitivity prediction matrix of the drug-cell line by adopting an unbalanced double random walk algorithm based on the drug adjacency matrix and the cell line adjacency matrix;

wherein, each element in the drug sensitivity prediction matrix of the updated drug-cell line after the unbalanced double random walk algorithm is adopted to walk is the sensitivity value of the corresponding drug to the cell line prediction; the drug sensitivity prediction matrix of the drug-cell line is an N-row and M-column matrix or an M-row and N-column matrix.

In the present invention, the association of a drug with a cell line is regarded as the drug sensitivity data of the drug to the cell line, and the value is expressed as sensitivity value. Therefore, if the data relating a drug to a cell line in the data set is known, i.e. the data relating the drug to the drug sensitivity of the cell line is known, the relationship between the drug and the cell line is included in the drug-cell line relationship network and is considered as the drug-cell line with known drug sensitivity.

1. The method is based on the fact that the drug sensitivity data of the drug-cell line in the existing data set is not uniform or accurate and needs to be obtained or updated again, and the method is based on the fact that the similarity network of the drug similarity network and the similarity network of the cell line have influence on the drug sensitivity of the drug-cell line, so that the similarity network of the drug similarity network and the similarity network of the cell line are applied to an unbalanced double random walk algorithm, and the drug-cell line with known drug sensitivity is used as an initial value to be updated to obtain the drug sensitivity data, namely the sensitivity value, between each drug to be predicted and each cell line.

2. Alternatively, the present invention updates the sensitivity data of the existing drug-cell line and predicts the sensitivity data of the existing drug-cell line based on the data update or change of the drug similarity network and the cell line similarity network.

The primary drug-cell line relationship matrix and the drug sensitivity prediction matrix of the drug-cell line are both N-row and M-column matrices or M-row and N-column matrices, i.e., each element in the drug sensitivity prediction matrix of the drug-cell line corresponds to each element in the primary drug-cell line relationship matrix one to one.

Further preferably, the unbalanced double random walk formula is as follows:

in the formula, W^t、W^t+1The prediction matrices for drug sensitivity of the t-th and t + 1-th migratory drug-cell lines, respectively, are shown, and W¹＝R，

Respectively representing the left matrix and the right matrix corresponding to the t +1 th wandering, D representing a drug adjacency matrix, C representing a cell line adjacency matrix, R representing a drug-cell line relation primary matrix, and lambda_leftAnd λ_rightThe weight of the random particles in the drug similarity network and the cell line similarity network, respectively, is controlled, the weight lambda_leftAnd λ_rightAre all positive numbers, alpha is a restarting parameter of random walk, and the value range of alpha is [0, 1%]，l_l、l_rRespectively representing the preset migration step length, l, corresponding to the drug adjacency matrix D and the cell line adjacency matrix C_l、l_rAre all positive integers.

According to the invention, through unbalanced double random walk, when the set walk step length is completed, the output drug sensitivity prediction matrix of the drug-cell line is the drug sensitivity prediction matrix of the drug-cell line to be obtained by the invention. It should be noted that since l_l、l_rThe values of the two are independent, so that the two are different, if the two are different, the left matrix or the right matrix corresponding to one walking step length is firstly completed, the value is 0 in the next iteration process, the walking is continued until the two walking step lengths are completed, and the obtained W after all the step lengths are played is walked^t+1I.e. the drug sensitivity prediction matrix of the updated drug-cell line. Wherein, W¹R denotes that the initial value of each element in the drug sensitivity prediction matrix of the drug-cell line is equal to the value of the corresponding element in the primary matrix of the drug-cell line relationship.

Further preferably, the calculation formula of the elements of the drug adjacency matrix is as follows:

wherein d (i, j) is the value of the ith row and jth column element in the drug adjacency matrix, and Pccd (i, j) is the similarity value of two drugs corresponding to the element d (i, j) in the drug similarity network;

the element calculation formula of the cell line adjacency matrix is as follows:

wherein c (i, j) is the value of the element in row i and column j in the cell line adjacency matrix, and Pccc (i, j) is the similarity value of the two cell lines in the cell line similarity network corresponding to the element c (i, j);

the element calculation formula of the drug-cell line relation initial matrix is as follows:

wherein r (i, j) is the value of the ith row and jth column element in the initial matrix of the drug-cell line relationship, and Pcce (i, j) is the corresponding sensitivity value of the drug and cell line corresponding to the element r (i, j) in the drug-cell line relationship network.

When calculating the values of the elements in the primary matrix of drug-cell line relationships, if the drug-cell line relationship corresponding to an element is already in the drug-cell line relationship network, it indicates that the sensitivity data of the corresponding drug-cell line is known, i.e., the sensitivity value is known.

Further preferably, the process of constructing the drug similarity network in S1 is as follows:

firstly, obtaining a descriptor quantitative value of a 1D &2D structure of each medicine in a medicine set;

then, calculating a Pearson correlation coefficient between every two drugs in the drug set based on a Pearson correlation coefficient formula and descriptor quantitative values of the drugs to obtain a drug similarity network;

wherein a pearson correlation coefficient between the two drugs is equal to a similarity value between the corresponding two drugs;

the calculation formula of the pearson correlation coefficient between the two drugs is as follows:

wherein r is_a,bDenotes the Pearson correlation coefficient, X, between the two drugs a, b_i(a)Quantitative values of the ith descriptor representing drug a,

Mean of quantitative values representing descriptors of a drugs, SX_aVariance of quantitative values representing a descriptor of the drug; y is_i(b)Quantitative values representing the ith descriptor of b-drug,

Mean value of quantitative values representing descriptors of b-drugs, SY_bVariance of quantitative values of descriptors representing b-drugs, N is the number of descriptors.

Further preferably, before the drug adjacency matrix is obtained according to the drug similarity network in S2, the method further includes correcting the similarity value between each two drugs in the drug similarity network by using logistic regression;

wherein the formula of the logistic regression is as follows:

wherein L (x) is the similarity value between the two drugs after correction, x is the similarity value between the two drugs before correction, e is a natural base number, c₁And d₁Are all adjustment parameters.

Let L (0) be 0.0001, thus d₁Has a value of log (9999), and c₁The value of (c) is adjusted to obtain an optimal or target value by cross-validation. Logistic regression is used to shift smaller similarity values between drugs closer to 0, and larger similarity values are amplified.

Further preferably, the construction process of the cell line similarity network in S1 is as follows:

firstly, acquiring cell line gene spectrum data obtained by carrying out experimental test on each cell line by using a gene probe;

wherein, the cell line gene spectrum data comprises an expression value obtained by each gene probe through experimental test on each cell line, and one gene probe corresponds to one expression value of one cell line;

then, calculating the variance corresponding to each gene probe based on the expression values between each gene probe and all cell lines, and selecting the expression value obtained by testing the correspondence between the n gene probes with the largest variance and each cell line;

finally, calculating the Pearson correlation coefficient of the gene spectrum between every two cell lines in the cell line set based on a Pearson correlation coefficient formula and n expression values obtained by testing each cell line and the n gene probes correspondingly;

wherein the Pearson's correlation coefficient for the gene profile between the two cell lines is equal to the similarity value between the corresponding two cell lines;

the pearson correlation coefficient of the gene profile between the two cell lines was calculated as follows:

wherein r is_c,dPearson's correlation coefficient, X, representing the Gene Profile between two cell lines c, d_i(c)The expression level of the cell line c,

Represents the mean value of the expression values of the cell line c, SX_cVariance representing expression value of cell line c; y is_i(d)The expression level of the cell line d,

Denotes the mean value of the expression values of cell line d, SY_dRepresents the variance of the expression values of cell line d.

The value range of n is 5-25% of the total number of the gene probes. The cell line gene general data of the invention is a numerical value measured by a gene probe through experimental reaction in a cell line, namely an expression value of the invention.

Further preferably, before the obtaining of the cell line adjacency matrix according to the cell line similarity network in S2, the method further includes correcting the similarity value between every two cell lines in the cell line similarity network by using logistic regression;

wherein the formula of the logistic regression is as follows:

wherein L (y) is the similarity value between the two cell lines after correction, y is the similarity value between the two cell lines before correction, e is a natural base number, c₂And d₂Are all adjustment parameters.

Let L (0) be 0.0001, thus d₂Has a value of log (9999), and c₂The value of (c) is adjusted to obtain an optimal or target value by cross-validation. Logistic regression was used to shift the smaller similarity values between cell lines closer to 0 and the larger similarity values were amplified.

Advantageous effects

1. The drug sensitivity prediction method based on the cell line and the drug similarity network directly applies the drug similarity network and the cell line similarity network to drug sensitivity prediction, and calculates a drug sensitivity prediction matrix of a drug-cell line by adopting an unbalanced double random walk algorithm, wherein the unbalanced random walk algorithm is applied to the drug sensitivity prediction for the first time, and the unbalanced random walk algorithm can be rapidly diffused in the network and can be applied to networks with different topological structures, so that the method is an extremely advantageous tool for processing biological networks with various specific structures and biological calculation problems based on the network, and the reliability of drug sensitivity prediction results is improved. The invention also verifies through experiments that the used unbalanced random walk algorithm can fully utilize the biological network information of the medicine and the cell line compared with the zhang integrated heterogeneous network medicine sensitivity prediction method and the MRSF algorithm, and the prediction result is more accurate. On the basis of simplicity and practicality, the accuracy of drug sensitivity prediction can be well improved, and important reference values and practical values are provided for researchers to carry out experimental analysis and deeper research on drug sensitivity.

2. According to the method, certain genes which have larger influence on the cell line in the cell line gene spectrum are selected through variance calculation to construct a more accurate cell line similarity network, so that the noise in the cell line similarity network is reduced, and the reliability of a prediction result is further improved; and the logistic regression algorithm is adopted to reduce the noise in the cell line similarity network and the drug similarity network respectively, so that the effect of the drug similarity network and the cell line similarity network as the characteristics of the biological network in drug sensitivity research is fully considered, and the drug sensitivity can be predicted more accurately.

Drawings

FIG. 1 is a flow chart of a method for predicting drug sensitivity based on cell lines and drug similarity networks provided by the present invention;

fig. 2 is a graph of RMSE values for each drug (drug) using the leave-one method (a) and the ten-fold cross-validation (b) on the CCLE dataset for the three methods.

Detailed Description

The present invention will be further described with reference to the following examples.

The biological data set used in this example: the Cancer Cell Line Encyclopedia (CCLE) and the tumor drug sensitivity Genetics (GDSC) are two sets of data sets, and specific data of the two sets of data sets are described in detail in the following table 1.

The CCLE dataset consists of large-scale genomic data including gene expression profiles, mutation status and copy number variation of 1,036 human cancer cell lines, and eight-point dose response curves for 24 chemical compounds across 504 cell lines. Gene expression profiles and drug sensitivity data (measured by the area under the dose response curve) can be downloaded from the CCLE website (http:// www.broadinstitute.org/CCLE). Of all 504 cell lines, 491 common cancer cell lines were identified with drug sensitivity measurements and gene expression profiling data. In the CCLE dataset, there are 24 drugs, 23 of which can find the corresponding SDF files in the Pubchem database, and one compound LBW242 without SDF files, so in the drug similarity network, the similarity data of this compound with other compounds is 0.

GDSC data can improve cancer treatment by finding therapeutic biomarkers that can be used to identify patients most likely to respond to anticancer drugs. In the raw GDSC dataset that can be used herein, there are 140 drug data, but only 139 related compounds in the PubChem database, so the number of drugs in this experiment is 139, and by cross-comparing the 139 compound related cell lines, we found 789 cell lines that could be applied to this experiment. The number of drug-cell lines finally formed by extracting the original data set is 64,814, and the drug-cell line relationship data in the invention represent drug sensitivity data. It is worth mentioning that the values of the drug sensitivity data have different measures (e.g., IC50 value, activity area value (activity area) and AUC value, etc.), and different values are used in different method comparisons, so that the obtained RMSE values cannot be uniformly compared in a standard.

TABLE 1 Experimental data in CCLE and GDSC data sets

The invention carries out drug sensitivity prediction based on an unbalanced random walk technology, and in the embodiment, two similarity networks are firstly constructed through gene screening and logistic regression algorithms: a medicine similarity network (DSN) and a cell line similarity network (CSN), and extracting relation data of the medicine-cell lines in the data set as medicine sensitivity data to construct the medicine-cell line relation network. After the three networks (DSN, CSN, drug-cell line relationship network) are constructed, the three networks are put into an unbalanced double random walk algorithm (birdsp algorithm) proposed herein, and drug sensitivity data is predicted by calculating the birdsp algorithm, and the method for predicting drug sensitivity based on the cell line and the drug similarity network provided in this embodiment includes the following specific steps:

step 1: constructing a drug similarity network DSN, a cell line similarity network CSN and a drug-cell line relation network;

1. network DSN for drug similarity

The method comprises the steps of firstly determining the names of compounds of each drug in a drug set, searching corresponding SDF files describing chemical structures of the drugs in a pubchem database (https:// pubchem. ncbi. nlm. nih. gov /) according to the names of the compounds, extracting 1D &2D structures of the SDF files of each drug through software PaDEL (http:// www.yapcwsoft.com/dd/padeldescriptor /), analyzing quantitative values of descriptors of each drug by using PaDEL software, wherein the descriptors quantitatively describe the 1D &2D structures of the compounds, finally calculating Pearson correlation coefficients of the descriptors of the compounds to obtain a drug similarity network, and constructing the drug similarity network formed by similarity relations between every two drugs.

The pearson correlation coefficient between each two drugs is calculated as follows:

The coefficient r can be known from the formula_a,bHas a value range of [ -1,1 [)]If the value of the variable r is greater than_a,bWhen the variable is close to 0, the variable is irrelevant_a,bA value of 1 or-1 indicates that they are strongly correlated.

2. Network of cell line similarity CSN

Firstly, cell line gene spectrum data obtained by experimental tests of the gene probes in the data set on each cell line is obtained. For example, in two tissue data sets CCLE and GDSC, one copy of data on cell line gene profiles was obtained by testing each cell line with 18,988 and 22,277 gene probes, respectively. The cell line gene common data is a value measured by a gene probe through an experimental reaction in a cell line, namely when the cell line is subjected to an experimental test by utilizing the gene probe in a data set, an experimental expression value is obtained between each gene probe and each cell line, the invention describes an expression value in a gene spectrum, the cell line gene spectrum data obtained by the invention comprises the expression value obtained by performing the experimental test on each cell line by utilizing each gene probe, one gene probe corresponds to one expression value of one cell line, for example, when 491 cell lines are subjected to the experimental test by utilizing 18,988 gene probes in a data set CCLE, one gene probe is used for obtaining one expression value after performing the experimental test on one cell line, and therefore one gene probe corresponds to 491 expression values. The gene profile depicts information on the type and abundance of gene expression in a particular state for that particular cell or tissue.

Then, the variance corresponding to each gene probe is calculated based on the expression values between each gene probe and all cell lines, and the expression value obtained by the correspondence test between the n gene probes with the largest variance and each cell line is selected. For example, 491 expression values are assigned to one gene probe in the above data set CCLE, and thus the variance for each gene probe can be calculated based on the expression value for each gene probe. The 1000 with the largest variance, i.e., n equal to 1000, were selected in this example, where 1000 expression values were assigned to each cell line.

Finally, calculating the Pearson correlation coefficient of the gene spectrum between every two cell lines in the cell line set based on a Pearson correlation coefficient formula and n expression values obtained by testing each cell line and the n gene probes correspondingly; the pearson correlation coefficient of the gene profile between the two cell lines is equal to the similarity value between the corresponding two cell lines.

The pearson correlation coefficient of the gene profile between each two cell lines was calculated as follows:

The coefficient r can be known from the formula_c,dHas a value range of [ -1,1 [)]If the value of the variable r is greater than_c,dWhen the variable is close to 0, the variable is irrelevant_c,dA value of 1 or-1 indicates that they are strongly correlated.

3. Relating to drug-cell line relationship networks

The drug-cell line relationship network comprises drug-cell lines of known sensitivity values and corresponding sensitivity values; by a drug-cell line of known sensitivity value is meant that the drug sensitivity data of the corresponding drug in the drug set to the corresponding cell line in the cell line set is known. Wherein, whether the drug sensitive data is known or not is determined according to whether the data set contains the relevant data of the corresponding drug-cell line or not. Drug sensitivity data, such as AUC or IC50 values, are obtained from direct experiments in CCLE and GDSC and other related databases, which are already present in the data set.

Step 2: and denoising the drug similarity network DSN and the cell line similarity network CSN by adopting logistic regression.

The corresponding similarity values of the drug similarity network DSN and the cell line similarity network CSN are calculated by adopting a Pearson correlation coefficient formula. But the Pearson correlation coefficient formula finds that the Pearson phase selection relation number is the cosine of an included angle between vectors formed by concentrating values of two variables according to the mean value. That is, it is a way to calculate the similarity of two drugs from a pure mathematical point of view, and we know that the drugs contain biological significance, and the pearson correlation coefficient calculated in this way, that is, the similarity coefficient of the drugs, has no way to be completely identical to whether the two drugs are actually similar or not, and even have a long difference. Therefore, it is necessary to correct the similarity value and finally improve the reliability of the prediction result.

1. Correction of similarity values between every two drugs in a drug similarity network

And correcting the similarity value between every two medicines in the medicine similarity network by using logistic regression. Wherein the formula of the logistic regression is as follows:

wherein L (x) is the similarity value between the two drugs after correction, x is the similarity value between the two drugs before correction, e is a natural base number, c₁And d₁Are all adjustment parameters. Can be adjusted by adjusting the parameter c₁And d₁To control the magnitude of the drug similarity value. In this embodiment, L (0) is set to 0.0001, and thus d₁Has a value of log (9999), and c₁The value of (c) is adjusted to obtain an optimal or target value by cross-validation. By using logistic regression, the smaller similarity values between drugs are transformed closer to 0, and the larger similarity values are amplified. By using the above procedure, the drug similarity value x is converted into a new similarity value l (x).

2. Correction of similarity values between every two cell lines in a cell line similarity network

And correcting the similarity value between every two cell lines in the cell line similarity network by using logistic regression. The formula of the logistic regression at this time is as follows:

wherein L (y) is the similarity value between the two cell lines after correction, y is the similarity value between the two cell lines before correction, e is a natural base number, c₂And d₂Are all adjustment parameters. The same principle can be realized by adjusting the parameter c₂And d₂To control the magnitude of the cell line similarity value. In this embodiment, L (0) is set to 0.0001, and thus d₂Has a value of log (9999), and c₂The value of (c) is adjusted to obtain an optimal or target value by cross-validation. Logistic regression was used to shift the smaller similarity values between cell lines closer to 0 and the larger similarity values were amplified. By using the above procedure, the cell line similarity value y is converted into a new similarity value l (y).

And step 3: and acquiring a corresponding drug adjacency matrix D (N multiplied by N), a cell line adjacency matrix C (M multiplied by M) and a drug-cell line relation initial matrix R (N multiplied by M).

N is the number of drugs in the drug pool, and M is the number of cell lines in the cell line pool. The drug adjacency matrix D represents an adjacency matrix of a drug similarity network DSN, the cell line adjacency matrix C represents an adjacency matrix of a cell line similarity network CSN, and the drug-cell line relation initial matrix R represents a drug-cell line known association relation matrix constructed based on the drug-cell line relation network.

Wherein, the calculation process of the values of each element in the drug adjacency matrix D, the cell line adjacency matrix C and the drug-cell line relation initial matrix R is as follows:

the elemental calculation formula for the drug adjacency matrix is as follows:

where d (i, j) is the value of the ith row and jth column element in the drug adjacency matrix, and Pccd (i, j) is the similarity value of the two drugs corresponding to element d (i, j) in the drug similarity network. In this embodiment, the similarity value Pccd (i, j) is a similarity value after denoising processing, that is, a value after logistic regression processing is adopted, and in other feasible embodiments, if denoising is not performed by logistic regression, the similarity value is correspondingly calculated by using a pearson correlation coefficient formula.

The formula for calculating the elements of the cell line adjacency matrix is as follows:

where c (i, j) is the value of the element in row i and column j in the cell line adjacency matrix, and Pccc (i, j) is the similarity value in the cell line similarity network for the two cell lines corresponding to element c (i, j). Similarly, in this embodiment, the similarity value Pccc (i, j) is a similarity value after denoising processing, that is, a value after logistic regression processing is adopted, and in other feasible embodiments, if denoising is not performed by logistic regression, the similarity value is correspondingly calculated by using a pearson correlation coefficient formula.

The element calculation formula of the primary matrix of the drug-cell line relationship is as follows:

wherein r (i, j) is the value of the ith row and jth column element in the initial matrix of the drug-cell line relationship, and Pcce (i, j) is the corresponding sensitivity value of the drug and cell line corresponding to the element r (i, j) in the drug-cell line relationship network. That is, if a given drug has sensitivity data for a cell line, its element r (i, j) is the corresponding sensitivity value, otherwise r (i, j) is 0.

And 4, step 4: obtaining a drug sensitivity prediction matrix W (N × M) of the drug-cell line by adopting an unbalanced double random walk algorithm based on the drug adjacency matrix D (N × N), the cell line adjacency matrix C (M × M) and the drug-cell line relation initial matrix R (N × M);

and obtaining the sensitivity value of the corresponding medicine to the cell line prediction by using each element in the medicine-cell line medicine sensitivity prediction matrix obtained after the unbalanced double random walk formula is adopted for walking. In this example, the prediction matrix of drug sensitivity of the drug-cell line is an N-column and M-column matrix. The values of its elements w (i, j) represent the predicted drug sensitivity data for a given drug i on cell line j. Where the initial value of W is the matrix R.

The unbalanced double random walk formula is as follows:

Respectively representing the left matrix and the right matrix corresponding to the t +1 th wandering, D representing a drug adjacency matrix, C representing a cell line adjacency matrix, R representing a drug-cell line relation primary matrix, and lambda_leftAnd λ_rightThe weight of the random particles in the drug similarity network and the cell line similarity network, respectively, is controlled, the weight lambda_leftAnd λ_rightAre all positive numbers, alpha is a restarting parameter of random walk, and the value range of alpha is [0, 1%]The R matrix participates in the migration process, and the whole process can be regulated and controlled by changing the value of the parameter alpha, l_l、l_rRespectively representing the preset migration step length, l, corresponding to the drug adjacency matrix D and the cell line adjacency matrix C_l、l_rAre all positive integers.

Verifying and simulating:

in order to evaluate the effectiveness of the method provided by the invention, the method respectively adopts a leave-one method and a ten-fold cross validation to compare the Pearson correlation coefficient and the root mean square error (RMSE value) between the predicted value and the true value of the three methods in two sets of data sets of CCLE and GDSC by using other two methods Zhang's and MRSF, so that the effectiveness of the drug sensitivity prediction method based on the random walk technology provided by the invention is observed and compared.

a. Verification of algorithm performance based on Pearson correlation coefficient

The performance of the algorithm is evaluated by calculating the Pearson correlation coefficient between the predicted value and the true value of the drug sensitivity data, and the larger the Pearson correlation coefficient value is, the better the performance of the algorithm is. Notably, the algorithm is considered meaningless when the value of the pearson correlation coefficient is less than 0.6. We applied three algorithms on the CCLE and GDSC datasets, respectively: the BiRWSSP (representing the method of the invention), Zhang's and MRSF algorithms are tested by a leave-one cross validation method and a ten-fold cross validation method.

The second to fourth rows in table 2 are RMSE values of birdsp, Zhang's, MRSF algorithms on CCLE dataset, and the last three rows are RMSE values of the above algorithms on GDSC dataset. Table 3-2 lists the pearson correlation values of the three algorithms on the CCLE data set and the GDSC data set, and we respectively take the average, minimum, and maximum of the pearson correlation values of the three algorithms for comparison. It can be seen that the average pearson value of the birdsp algorithm on the CCLE dataset is 0.9082, 1.6%, 12.6% higher than the other two algorithms Zhang's, MRSF, indicating that the performance of the birdsp algorithm is the best, and among all these compared algorithms, Zhang's performs the worst, demonstrating that there is still a lot of information available to be mined in the double layer network on the drug-cell line in terms of drug sensitivity prediction. On the GDSC dataset, the resulting RMSE values are substantially the same since the data is larger and less information is known than on the CCLE dataset.

Table 2 Pearson correlation coefficient between true value and predicted value verified by leave-one-out method of three algorithms

The second to fourth rows in table 3 are the RMSE values of birdsp, Zhang's, MRSF algorithms on the CCLE dataset after cross validation by ten-fold, and the last three rows are the RMSE values of the above algorithms on the GDSC dataset. Table 3 lists the pearson correlation values of the three algorithms on the CCLE data set and the GDSC data set, and similar to table 2, the average, the minimum, and the maximum of the pearson correlation values of the three algorithms are respectively taken for comparison. It can be seen that the average pearson value of the birdsp algorithm on the CCLE dataset is 0.9082, which is 12.6% and 1.6% higher than the other two algorithms Zhang's and MRSF, respectively, indicating that the performance of the birdsp algorithm is the best. On the GDSC dataset, the average pearson value of the birdsp algorithm is 0.8723, which is 18.92% and 1.83% higher than that of the other two algorithms Zhang's and MRSF, respectively, indicating that the performance of the birdsp algorithm is the best. Among all these compared algorithms, Zhang's perform the worst, demonstrating that there is still a lot of information to be mined in the bilayer network on the drug-cell line in terms of drug sensitivity prediction.

TABLE 3 Pearson correlation coefficient between true and predicted values for cross-validation of ten folds for three algorithms

b verifying Performance based on RMSE Angle

As can be seen from the definition and equation of RMSE, smaller value of RMSE means smaller difference between the predicted value and the true value, i.e. better prediction effect.

We evaluate our algorithm from the perspective of RMSE. In the method of Zhang's, it uses leave-one-out method on both sets of data sets to evaluate the quality of drug susceptibility prediction experiments. In the MRSF algorithm, a ten-fold cross-validation method is used for evaluating the quality of a drug sensitivity prediction experiment on two sets of data sets. Drug sensitivity data were calculated on both CCLE and GDSC datasets using both of the above validation methods. Because the 1000 drug-related genes are processed before the drug similarity network is screened, and the logical regression is performed in the drug similarity network and the cell line similarity network to correct the similarity, in order to make the algorithm as fair as possible, the leave-one-out analysis of the two algorithms is performed on the processed data and the data before the processing, and the results are shown in the following table 4.

TABLE 4 RMSE data leave-one-out validation comparison of BiRWSP algorithm on two data sets

Ul _ us in the table is data which is not subjected to gene screening and logistic regression, the gene screening is data which is obtained by calculating the variance corresponding to the gene probes and selecting data which is larger than the variance and corresponds to n gene probes to calculate the similarity value of the cell line, the ul _ s is data which is not subjected to gene screening and is subjected to logistic regression, the l-us is data which is subjected to gene screening and is not subjected to logistic regression, and the l _ s is data which is subjected to gene screening and logistic regression. It can be seen from table 4 that the RMSE value using the birdsp algorithm was the smallest for the data subjected to gene screening and logistic regression, and therefore the data subjected to gene screening and logistic regression were used for the subsequent data.

TABLE 5 RMSE data one-out-of-one validation comparison of three algorithms on two data sets

The second to fourth rows in table 5 are birdsp, Zhang's, RMSE values between predicted and observed values of the MRSF algorithm on the CCLE dataset, and the last three rows are RMSE values between predicted and observed values of the above algorithm on the GDSC dataset. It can be seen from the illustration that, on the CCLE dataset, the birdsp algorithm performs best in terms of both the average RMSE value and the maximum RMSE value when the leave-one method is used for verification, where the average values are 0.7206 and 0.0090 lower than the Zhang's and MRSF algorithms, respectively, and on the GDSC dataset, the birdsp algorithm is also more advantageous. Among them, the algorithm of Zhang's has the largest RMSE value, and also proves that there are still many potential relations which can be presumed in the aspect of predicting drug sensitivity by adopting similarity.

The second to fourth rows in table 6 are birdsp, Zhang's, RMSE values between predicted and observed values of the MRSF algorithm on the CCLE dataset, and the last three rows are RMSE values between predicted and observed values of the above algorithm on the GDSC dataset. It can be seen by the illustration that the birdsp algorithm performed best in terms of both the mean RMSE value, which is 0.4652 and 0.0537 lower than the Zhang's and MRSF algorithms, respectively, and the maximum RMSE value, which is 0.2197 and 0.0244 lower than the Zhang's and MRSF algorithms, respectively, when validated with a ten-fold crossover. In the ten-fold cross validation, the BiRWSP algorithm is also more advantageous. Among them, the RMSE value of Zhang's algorithm is the largest, and it is also proved that there are still many potential relations to predict drug sensitivity using similarity.

Table 6 cross-fold cross-validation comparison of RMSE data between predicted values and true values of three algorithms

As figure 2 illustrates the RMSE values calculated for each drug on the CCLE dataset for the three methods using the leave-one method (top) and the ten-fold cross-validation (bottom), it can be seen graphically that the birdsp algorithm has 18 drugs with smaller RMSE values in all 24 drugs when validated using the leave-one method than the other two methods. The birdsp algorithm is also more advantageous in averaging RMSE values. When ten-fold cross-validation is used, the birdsp algorithm has a lower RMSE value for 20 out of all 24 drugs than for the other two methods. The birdsp algorithm is also more advantageous in averaging RMSE values. Therefore, birdsp is more advantageous in predicting its sensitivity to each individual drug.

It should be emphasized that the examples described herein are illustrative and not restrictive, and thus the invention is not to be limited to the examples described herein, but rather to other embodiments that may be devised by those skilled in the art based on the teachings herein, and that various modifications, alterations, and substitutions are possible without departing from the spirit and scope of the present invention.

Claims

1. A drug sensitivity prediction method based on a cell line and a drug similarity network is characterized in that: the method comprises the following steps:

2. The method of claim 1, wherein: the unbalanced double random walk formula is as follows:

3. The method of claim 1, wherein: the element calculation formula of the medicine adjacency matrix is as follows:

4. The method of claim 1, wherein: the construction process of the drug similarity network in S1 is as follows:

5. The method of claim 4, wherein: before the medicine adjacency matrix is obtained according to the medicine similarity network in the S2, correcting the similarity value between every two medicines in the medicine similarity network by adopting logistic regression;

wherein the formula of the logistic regression is as follows:

wherein L (x) is the similarity value between the two drugs after correction, x is the similarity value between the two drugs before correction, and e is a natural base number，c₁And d₁Are all adjustment parameters.

6. The method of claim 1, wherein: the construction process of the cell line similarity network in S1 is as follows:

finally, calculating the Pearson correlation coefficient of the gene spectrum between every two cell lines in the cell line set based on a Pearson correlation coefficient formula and n expression values obtained by correspondingly testing each cell line and n gene probes;

7. The method of claim 6, wherein: before obtaining the cell line adjacency matrix according to the cell line similarity network in the S2, correcting the similarity value between every two cell lines in the cell line similarity network by using logistic regression;

wherein the formula of the logistic regression is as follows: