WO2022170909A1 - Drug sensitivity prediction method, electronic device and computer-readable storage medium - Google Patents

Drug sensitivity prediction method, electronic device and computer-readable storage medium Download PDF

Info

Publication number
WO2022170909A1
WO2022170909A1 PCT/CN2022/071509 CN2022071509W WO2022170909A1 WO 2022170909 A1 WO2022170909 A1 WO 2022170909A1 CN 2022071509 W CN2022071509 W CN 2022071509W WO 2022170909 A1 WO2022170909 A1 WO 2022170909A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
data
drug
prediction
sample data
Prior art date
Application number
PCT/CN2022/071509
Other languages
French (fr)
Chinese (zh)
Inventor
马少华
方璐
范家旗
冯懿琳
王旭康
王子天
戴琼海
Original Assignee
清华大学深圳国际研究生院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学深圳国际研究生院 filed Critical 清华大学深圳国际研究生院
Publication of WO2022170909A1 publication Critical patent/WO2022170909A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/10ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage

Definitions

  • the present application relates to the technical field of drug detection, and in particular, to a drug sensitivity prediction method, an electronic device and a computer-readable storage medium.
  • the prediction of drug responsiveness in cancer patients based on their clinical characteristics and genomics is essential to assist clinicians in formulating effective and less toxic treatment regimens.
  • Predictive models for drug response are often trained on different datasets.
  • the supervised learning methods used include regression models and classification models.
  • the former can generate specific drug sensitivity values, such as IC50 (The half maximal inhibition concentration), and the latter can generate drug response levels, such as high-sensitivity drug responses and low-sensitivity drug responses.
  • the present application aims to solve at least one of the technical problems existing in the prior art. To this end, the present application proposes a drug sensitivity prediction method, which can quickly and accurately predict drug responsiveness to clinical patients, reduce prediction costs and time costs, and improve drug efficacy prediction efficiency.
  • the present application also proposes an electronic device having the above drug sensitivity prediction method.
  • the present application also provides a computer-readable storage medium having the above drug sensitivity prediction method.
  • the drug sensitivity prediction method includes: acquiring gene sequencing data and drug characteristic data of the cancer cell tissue to be trained; preprocessing the gene sequencing data according to the drug characteristic data to obtain a gene sample data; perform verification processing according to the gene sample data and the drug characteristic data to obtain a prediction model and a gene prediction list; perform drug sensitivity prediction on the cancer cell tissue to be tested by using the prediction model and the gene prediction list.
  • the drug sensitivity prediction method has at least the following beneficial effects: by acquiring the gene sequencing data and drug characteristic data of the cancer cell tissue to be trained, and preprocessing the gene sequencing data according to the drug characteristic data to obtain gene sample data, Validation processing is carried out according to the gene sample data and drug characteristic data to obtain a prediction model and a gene prediction list.
  • the gene prediction list and prediction model Through the gene prediction list and prediction model, the drug sensitivity of the cancer cell tissue to be tested can be predicted, which can quickly and accurately realize the drug responsiveness of clinical patients. Prediction, reduce prediction cost and time cost, and improve the efficiency of drug efficacy prediction.
  • the gene sequencing data includes first sequencing data
  • the drug characteristic data includes drug sensitivity data
  • the acquiring gene sequencing data and drug characteristic data of the cancer cell tissue to be trained The method includes: acquiring the first sequencing data of the cancer cell tissue to be trained and the corresponding drug sensitivity data based on a genome database.
  • the preprocessing of the gene sequencing data according to the drug characteristic data to obtain the gene sample data includes: standardizing the first sequencing data to obtain the first sample data ; Screen the first sample data according to the drug sensitivity correlation coefficient of the first sample data and the drug sensitivity data to obtain second sample data; The sample data is scored and determined to obtain scoring parameters of the second sample data; the second sample data is screened based on the scoring parameters to obtain the gene sample data.
  • performing the verification process according to the gene sample data and the drug characteristic data to obtain a prediction model and a gene prediction list includes: acquiring the difference between the gene sample data and the drug sensitivity data.
  • Drug susceptibility correlation coefficient obtain the scoring parameters of the gene sample data, the gene sample data includes multiple gene fragments; according to the drug susceptibility correlation coefficient and the scoring parameters, the multiple gene fragments are sorted in descending order; The plurality of gene fragments arranged in descending order are subjected to verification processing to obtain the model parameters of the prediction model and the number of gene lists; the gene prediction list is generated according to the number of gene lists, and the prediction model is determined according to the model parameters.
  • the gene sequencing data includes second sequencing data
  • the drug feature data includes drug effect classification data
  • the acquiring gene sequencing data and drug feature data of the cancer cell tissue to be trained Including: acquiring second sequencing data and drug effect grading data of the cancer cell tissue to be trained based on the genome atlas database.
  • the preprocessing of the gene sequencing data according to the drug characteristic data to obtain gene sample data includes: standardizing the second sequencing data to obtain third sample data; The third sample data is tested according to the drug effect classification data to obtain the gene sample data.
  • the performing verification processing according to the gene sample data and the drug characteristic data to obtain a prediction model and a gene prediction list includes: acquiring gene scores of multiple gene segments of the gene sample data ; Arrange the plurality of gene fragments in descending order according to the gene score; perform cross-validation on the plurality of gene fragments after the descending order to obtain the model parameters of the prediction model and the number of gene lists; According to the gene list The number and the corresponding plurality of gene segments generate the gene prediction list, and the prediction model is determined according to the model parameters.
  • the predicting the drug sensitivity of the cancer cell tissue to be tested by using the prediction model and the gene prediction list includes: obtaining the corresponding information of the cancer cell tissue to be tested according to the gene prediction list obtain the gene expression of the gene fragment; input the gene expression into the prediction model to obtain the drug sensitivity result of the cancer cell tissue to be tested.
  • An electronic device includes: at least one processor, and a memory communicatively connected to the at least one processor; wherein the memory stores instructions, the instructions are stored by the at least one processor A processor executes such that the at least one processor implements the drug susceptibility prediction method according to the first aspect when the at least one processor executes the instructions.
  • the electronic device has at least the following beneficial effects: by executing the drug sensitivity prediction method mentioned in the embodiment of the first aspect, the drug responsiveness prediction for clinical patients can be quickly and accurately realized, and the prediction cost and time cost can be reduced, Improve the efficiency of drug efficacy prediction.
  • a computer-readable storage medium stores computer-executable instructions for causing a computer to execute the drug-sensing method according to the first aspect method of prediction.
  • the computer-readable storage medium has at least the following beneficial effects: by executing the drug sensitivity prediction method mentioned in the embodiment of the first aspect, the drug responsiveness prediction for clinical patients can be quickly and accurately realized, and the prediction cost and Time cost and improve the efficiency of drug efficacy prediction.
  • Fig. 1 is a specific flow chart of the drug sensitivity prediction method in the embodiment of the application
  • FIG. 2 is a specific flowchart of step S200 of the drug sensitivity prediction method in the embodiment of the present application;
  • FIG. 3 is another specific schematic flowchart of step S200 of the drug sensitivity prediction method in the embodiment of the present application.
  • FIG. 4 is a specific flowchart of step S300 of the drug sensitivity prediction method in the embodiment of the present application.
  • step S300 of the drug sensitivity prediction method in the embodiment of the present application is another specific schematic flowchart of step S300 of the drug sensitivity prediction method in the embodiment of the present application.
  • FIG. 6 is a specific flowchart of step S400 of the drug sensitivity prediction method in the embodiment of the present application.
  • FIG. 7 is a diagram of a specific application example of the drug sensitivity prediction method in the embodiment of the present application.
  • the prediction of drug responsiveness in cancer patients based on their clinical characteristics and genomics is essential to assist clinicians in formulating effective and less toxic treatment regimens.
  • Predictive models for drug response are often trained on different datasets.
  • the supervised learning methods used include regression models and classification models.
  • the former can generate specific drug sensitivity values, such as IC50 (The half maximal inhibition concentration), and the latter can generate drug response levels, such as high-sensitivity drug responses and low-sensitivity drug responses.
  • the embodiments of the present application provide a drug sensitivity prediction method, an electronic device and a computer-readable storage medium, which can quickly predict the drug efficacy of cancer based on a small number of genes, avoiding the need for drug efficacy prediction to RNA sequencing, etc. Dependence on time-consuming sequencing technology, and reduce the cost of drug efficacy prediction.
  • the embodiments of the present application provide a drug sensitivity prediction method.
  • FIG. 1 a schematic flowchart of the drug sensitivity prediction method in the embodiment of the present application is shown. It specifically includes steps:
  • step S100 in the embodiment of the present application, it is necessary to obtain the gene sequencing data of the cancer cell tissue to be trained and the corresponding drug characteristic data of different drugs, wherein the gene sequencing data refers to the RNA (ribonucleic acid, Ribonucleic Acid) of the cancer cell tissue to be trained.
  • the gene sequencing data refers to the RNA (ribonucleic acid, Ribonucleic Acid) of the cancer cell tissue to be trained.
  • the drug characteristic data of the drug to be tested refers to the sensitivity data or drug effect data of different drugs applied to the cancer cell tissue to be trained, such as the IC50 (half of the drug sensitivity of the drug related to the cancer cell tissue to be trained) maximal inhibition concentration) data
  • IC50 is the concentration corresponding to 50% inhibitory concentration
  • half inhibition is used to measure the sensitivity of the antibody; the lower the IC50 value, the higher the sensitivity of the antibody; for example, about cancer cells.
  • the drug effect classification data, and the clinical effect classification data are used to represent the effect of clinical drug use of the cancer cell tissue, and have different effect grades.
  • the cancer cell tissue to be trained in the embodiments of the present application may be any cancer cell tissue selected from the gene database; it may also be a cancer cell tissue sample of a clinical patient obtained from the gene database; the The cancer cell tissue to be trained is used to provide training data for the subsequent establishment of the prediction model.
  • the gene sequencing data of the cancer cell tissue and the drug characteristic data of the drug to be tested can be obtained based on the genome database, wherein the genome database is the anticancer drug sensitivity genomics database (Genomics). of Drug Sensitivity in Cancer, GDSC) and Cancer Cell Line Encyclopedia (CCLE). Specifically, by consulting the anticancer drug sensitivity genomics database and the cancer cell line encyclopedia, the required relevant information, that is, the gene sequencing data of the cancer cell tissue and the drug characteristic data of the drug to be tested, are obtained.
  • the genome database is the anticancer drug sensitivity genomics database (Genomics). of Drug Sensitivity in Cancer, GDSC) and Cancer Cell Line Encyclopedia (CCLE).
  • the Genomics of Drug Sensitivity in Cancer (GDSC) database was developed by the Sanger Institute in the United Kingdom to collect the sensitivity and response of tumor cells to the drug to be tested. Variations in the cancer genome can affect the efficacy of clinical treatments, and different targets respond to drugs differently. Such data are therefore important for the discovery of potential tumor therapeutic targets.
  • Data from GDSC comes from 75,000 experiments describing the response of about 200 anticancer drugs in more than 1,000 tumor cells.
  • the oncogenome mutation information in this database comes from the COSMIC database, including oncogene point mutations, gene amplification and loss, tissue types, and expression profiles. Users can search the database from three levels: compound, oncogene and cell line.
  • the Encyclopedia of Cancer Cell Lines integrates genetic information such as DNA mutations, gene expression, and chromosomal copy number through large-scale deep sequencing of 947 human cancer cell lines covering more than 30 tissue sources.
  • RNA sequencing data is the data obtained by RNA-seq (transcriptome sequencing) technology
  • transcriptome refers to the collection of all transcriptome products in a cell under a certain physiological condition.
  • the research object of transcriptome sequencing is the sum of all RNAs that can be transcribed by a specific cell in a functional state, mainly including mRNA and ncRNA.
  • Drug sensitivity data refers to IC50 data about the cancer cell tissue-related drug.
  • the second sequencing data and drug effect classification data corresponding to the cancer cell tissue samples of clinical patients were obtained based on The Cancer Genome Atlas (TCGA), among which the tumor genome atlas
  • the database contains clinical data, genomic variation, mRNA (messenger RNA) expression, miRNA (micro RNA) expression, methylation and other data of various human cancers (tumors including subtypes), which are very important to cancer researchers. Data Sources.
  • step S200 the acquired gene sequencing data of the cancer cell tissue is preprocessed according to the acquired drug characteristic data to obtain preprocessed gene sample data.
  • step S200 specifically includes the steps:
  • S214 Perform screening processing on the second sample data based on the scoring parameter to obtain gene sample data.
  • step S211 standardize the acquired first sequencing data to obtain first sample data
  • the normalization processing refers to standardizing the gene length and sequencing of the first sequencing data, that is, the RNA sequencing data of the cancer cell tissue to be trained.
  • the TPM corresponding to the first sequencing data can be obtained (Transcripts Per Kilobase of exon model per Million mapped reads, the number of transcripts per kilobase of transcription per million mapped reads), and then according to the TPM of the first sequencing data
  • the first sequencing data is screened to obtain the first sample data, for example, the first sequencing data whose TPM is lower than 1 is screened to obtain the screened first sequencing data, that is, the first sample data.
  • the first sequencing data are multiple gene fragments.
  • the first sequencing data can be screened based on the gene expression levels of multiple genes, and those with TPM lower than 1 can be screened out. Gene fragments, retain gene fragments with a TPM higher than 1.
  • the first sample data is screened according to the drug sensitivity correlation coefficient between the first sample data and the drug sensitivity data of the corresponding drug to be tested to obtain second sample data.
  • the drug susceptibility correlation coefficient is the Pearson correlation coefficient (Pearson correlation coefficient) between the TPM of each gene in the first sample data and the drug related to the first sample data, that is, the IC50 data of a drug to be tested, where Pearson The correlation coefficient is used to measure the degree of correlation between two variables, and its value is between -1 and 1, where the two variables are the TPM of the gene and the IC50 data of the sample drug.
  • the first sample data whose absolute value of the Pearson correlation coefficient is lower than 0.1 is screened out, that is, the Pearson correlation.
  • the second sample data is obtained.
  • the second sample is scored and determined according to the drug sensitivity data of the relevant drug, and the score of the second sample data is obtained.
  • parameter the second sample data is screened by the scoring parameter, and the screened gene sample data is obtained.
  • scoring is determined based on Fisher's linear discriminant method. By calculating the mean and standard deviation of the gene expression levels of some cancer cells to which the drug to be tested is applied, the scoring parameters are calculated based on the calculated values and standard deviations. The calculated scoring parameters are used to screen genes corresponding to cancer cells to obtain the screened genes, that is, gene sample data.
  • the gene sample data is a collection of multiple gene fragments.
  • the drug sensitivity data of the drug that is, the mean E1 and the standard deviation STD1 of the gene expression levels of 15% of the gene fragments in the cancer cell tissue line with the highest IC50 data
  • the drug sensitivity data of the drug are calculated, namely The mean E2 and standard deviation STD2 of the gene expression levels of 15% of the gene fragments in the cancer cell tissue with the lowest IC50 data, according to the calculated mean E1, mean E2, standard deviation STD1 and standard deviation STD2 of the gene expression levels, through the formula (E1-E2)/(STD1+STD2) is calculated to obtain the scoring parameter, and the second sample data with the highest scoring parameter is retained as the gene sample data obtained by screening, wherein the gene sample data includes several gene fragments, the gene fragment
  • the selection of the number of s can be set according to actual needs, so as to filter the second sample data according to the number.
  • step S200 specifically includes the steps:
  • step S221 normalize the acquired second sequencing data to obtain third sample data, wherein the normalization processing refers to normalizing the second sequencing data, that is, the RNA sequencing data of the cancer cell tissue, by normalizing the gene length, and then normalizing the sequencing depth,
  • the TPM corresponding to the second sequencing data can be obtained, and then the second sequencing data is screened according to the TPM of the second sequencing data to obtain third sample data.
  • the second sequencing data is the third sample data.
  • the second sequencing data are multiple gene fragments. By standardizing the first sequencing data, the second sequencing data can be screened based on the expression levels of multiple gene fragments, and those with TPM lower than 1 can be screened out. Gene fragments, retain gene fragments with a TPM higher than 1.
  • step S222 the third sample data is inspected according to the graded data of the drug effect of the drug to be tested on the cancer cell tissue, and the gene sample data after inspection processing is obtained. Specifically, the third sample data is tested based on the Mann-Whitney U test method, the third sample data is divided into valid data or invalid data according to the drug effect classification data, and the third sample data corresponding to the valid data is calculated. The gene expression amount of the gene fragment and the gene expression amount of the gene fragment in the third sample data corresponding to the invalid data are taken as the calculated data value, and the gene fragment whose data value is less than a certain value, for example, the gene fragment less than 0.1, is used as the calculated data value. Genetic sample data.
  • the grading data of the drug effect of the drug to be tested against a certain cancer cell tissue obtained from The Cancer Genome Atlas includes various data, such as “complete remission”, “partial remission”, “stable disease”, “disease progression”, etc. Among them, “complete remission”, “partial remission”, “stable disease” can be defined as “effective”, and “disease progression” can be defined as “ineffective”, then the third sample data can be divided into effective sample data or ineffective according to the drug effect classification data sample.
  • step S300 verification is performed according to the gene sample data obtained by preprocessing and the drug characteristic data of the drug to be tested, and a prediction model and a gene prediction list of the drug to be tested are obtained.
  • the prediction model of the drug to be tested refers to the verification based on a preset mathematical model, and the optimal parameters of the prediction model are obtained;
  • the gene prediction list refers to the prediction of drug sensitivity of the drug to be tested in cancer cells and plays a key role in prediction Gene fragments that act.
  • step S300 specifically includes the steps:
  • S314 Generate a gene prediction list according to the number of gene lists, and determine a prediction model according to model parameters.
  • step S311 and S312 the drug susceptibility correlation coefficients corresponding to the multiple gene fragments in the gene sample data and the drug sensitivity data of the drug are obtained, and the scoring parameters of the gene sample data are obtained, wherein the drug susceptibility correlation coefficient refers to the The drug susceptibility correlation coefficient obtained in step S212, and the scoring parameter refers to the scoring parameter obtained in step S213.
  • the drug susceptibility correlation coefficient refers to the The drug susceptibility correlation coefficient obtained in step S212
  • the scoring parameter refers to the scoring parameter obtained in step S213.
  • the drug susceptibility correlation coefficient is set as S1
  • the corresponding weight is 0.3
  • the scoring parameter namely the Fisher discrimination score
  • the corresponding weight is 0.7
  • step S313 and step S314 verification processing is performed on a plurality of gene segments in the gene sample data sorted in descending order to obtain model parameters of the prediction model and the number of gene lists.
  • the verification process refers to sequentially selecting the first n gene fragments in the arranged gene samples, where the value of n can be set according to actual needs, for example, the value range of n is set to be 10 to 30 genes.
  • Based on the K nearest neighbor regression model enumerate the model parameters k of the regression model, select the nearest k neighbors, predict the drug sensitivity data corresponding to the cancer cell tissue to be trained, and perform 5-fold cross-validation to obtain the prediction result.
  • a prediction model for a certain drug of the cancer tissue cells to be trained is established according to the K-nearest neighbor algorithm.
  • the prediction model has the optimal model parameters, that is, has the optimal k value.
  • the number of gene fragments in the specific gene sample data that is, the value of n, for example, the first n gene fragments can obtain the optimal model parameters of the prediction model, then the first n gene fragments in the gene sample data constitute gene prediction. list.
  • the acquisition of model parameters corresponds to the case where the area under the curve (Area under curve, AUC) of the receiver operating characteristic curve (ROC) obtained after cross-validation based on the prediction model corresponds to the largest
  • the number of gene fragments n and the proximity parameter k are determined as the final model parameters.
  • the GDSC database contains the first sequencing data of each cell line, that is, RNA sequencing data and the IC50 of the cell line under the action of different drugs
  • the drugs are selected from four chemotherapeutic drugs, paclitaxel, 5-fluorouracil, cyclophosphamide, and cisplatin, and the above conditions are exemplified.
  • RNA sequencing data of colorectal cancer cell lines and the IC50 data of paclitaxel were obtained from the GDSC database. After preprocessing the RNA sequencing data, the RNA sequencing data were scored and ranked; When the k value of the K nearest neighbor regression model is 1 to 30, the IC50 data of paclitaxel is predicted, and the prediction results are cross-validated, the AUC value is calculated, and the maximum AUC value obtained by different K values is recorded, and the AUC The k value corresponding to the maximum value; then select the top 11 genes after the gene score sorting, and predict the IC50 data of paclitaxel for the cases where the k value of the K nearest neighbor regression model is 1 to 30, and cross the prediction results.
  • the gene prediction list for drug susceptibility prediction of paclitaxel includes the top n max genes ranked by gene score, and the optimal model parameter km max of the K-nearest neighbor regression model.
  • the above operations are repeated for the other three drugs, 5-fluorouracil, cyclophosphamide, and cisplatin, to obtain the gene prediction list of each drug and the corresponding optimal parameter km max of the K-nearest neighbor regression model.
  • the corresponding prediction model and the optimal parameter k max corresponding to the prediction model can be constructed and obtained for the four compound drugs, and there are corresponding gene prediction lists for the four kinds of compound drugs;
  • the gene prediction lists of the four compound drugs are aggregated into a large gene prediction list set.
  • the corresponding multiple key gene fragments can be extracted directly according to the gene prediction list set. Key gene fragments are not only specific to a single compound, thus ensuring data sufficiency.
  • the K-nearest neighbor regression model established above can be used to predict its drug susceptibility, and the IC50 value of the drug effect can be predicted. Therefore, the drug reaction conditions corresponding to each drug can be judged, and an appropriate medication plan can be efficiently formulated according to the respective drug reaction conditions.
  • step S300 specifically includes the steps:
  • S324 Generate a gene prediction list according to the number of gene lists and the corresponding multiple gene segments, and determine a prediction model according to model parameters.
  • the gene scores of multiple gene fragments in the gene sample data are obtained, wherein the gene scores are the P values of the gene fragments calculated by the Mann-Whitney U test method mentioned in step S221
  • the opposite numbers of , and according to the size of the obtained gene scores, the multiple gene fragments are sorted in descending order.
  • step S323 cross-validation is performed on the plurality of gene fragments arranged in descending order to obtain the model parameters of the prediction model and the number of gene lists, which are used as model parameters of a drug efficacy prediction model for a drug targeted by the plurality of gene fragments.
  • the prediction model refers to the K-nearest neighbor classification model for predicting the efficacy of a drug on cancer cells of a clinical patient.
  • the model has the optimal model parameters for the cancer tissue, and the model parameters include the optimal neighbor parameters and Gene fragment parameters.
  • the first n gene fragments in descending order are selected in sequence, and the value of n can be selected according to actual needs, for example, the value of n is 10 to 30, and the parameter k of the K nearest neighbor classification model is enumerated, and the parameter k Indicates that k neighboring points are selected, and the value of k can be selected according to actual needs.
  • the value of k is 1 to 30.
  • the K nearest neighbor regression model corresponding to the enumerated parameter k predict whether the drug is effective or not. ” or “invalid” to do 5-fold cross-validation, and calculate the accuracy and F1-score of the new prediction results according to the new prediction results obtained after cross-validation.
  • the F1 score is an indicator used in statistics to measure the accuracy of the binary classification model.
  • the above conditions are exemplified by taking the cancer cell tissue sample of a clinical patient as a colorectal cancer cell line and the drug to be tested as 5-fluorouracil.
  • RNA sequencing data of colorectal cancer cell line samples from clinical patients and the efficacy grading data of 5-fluorouracil were obtained from the TCGA database. After standardizing the RNA sequencing data, genes with TPM lower than 1 were discarded, and the calculation was effective. The data value of the gene expression level corresponding to each gene fragment between the sample data and the invalid sample data, the P value obtained by the Mann-Whitney U test method, and the gene fragments are sorted in ascending order according to the size of the P value, and the P value is filtered out. For gene fragments with a value greater than 0.1, gene fragments with a P value of less than 0.1 were retained, and valid data or invalid data in the RNA sequencing data were marked according to the drug effect classification data.
  • n from 10 to 30 Taking the value of n from 10 to 30 as an example, first take the top 10 gene fragments in colorectal cancer cells sorted according to the gene score, and use the K nearest neighbor classification model to predict 5- The IC50 data value of fluorouracil, and 5-fold cross-validation was performed for the IC50 data value, and the prediction result of 5-fluorouracil on colorectal cancer cell samples was found to be valid or invalid, and the K nearest neighbor classification corresponding to each model parameter was calculated through the prediction result.
  • the accuracy of the prediction results of the model record the maximum accuracy, and the parameter k and n values corresponding to the maximum accuracy; then go to the top 11 genes ranked by gene score, and repeat the enumeration of the parameter k of the K-nearest neighbor classification model If it is equal to 1 to 30, re-record the new maximum accuracy value and the corresponding parameter k and n values, or the current maximum accuracy is greater than the maximum accuracy obtained when n is equal to 10, then record the new accuracy.
  • n max the maximum accuracy in all cases and the corresponding n max and k max values.
  • the obtained value of nmax is 15 and the value of kmax is 5, then the first 15 gene segments are selected as the gene prediction list, that is, the key prediction gene list.
  • a gene set required for drug efficacy prediction that is, a gene prediction list
  • a gene prediction list can be obtained, so that when applied to actual drug efficacy prediction, a small number of genes can be obtained to realize the prediction of drug efficacy. It can improve the prediction speed and reduce the prediction cost; and avoid the dependence of drug efficacy prediction on long-term sequencing technologies such as RNA sequencing, which can quickly predict the cancer drug efficacy of patients, and is suitable for individualized medicine during or after surgery. .
  • step S400 drug sensitivity prediction is performed by using the generated prediction model and gene prediction list, for example, drug sensitivity prediction of a certain drug to be tested with respect to a certain cancer cell tissue. Specifically, through the established prediction model and the key genes in the gene prediction list, rapid drug sensitivity prediction of cancer cells is performed.
  • step S400 further includes:
  • the gene expression level is input into the prediction model, and the drug sensitivity result of the cancer cell tissue to be tested is obtained.
  • step S410 and step S420 according to the gene prediction list obtained by executing step S300, corresponding key gene fragments are extracted from the cancer cell tissue to be tested, and the extracted number of the gene fragments is selected according to the number in the gene prediction list; After the corresponding gene fragments are extracted from cancer cell tissue based on qPCR technology or gene chip technology, the gene expression of each gene fragment can be quickly measured.
  • step S430 and step S440 the gene expression of each gene fragment is input into the prediction model after the standardization process performed by the cancer cell tissue to be trained, and output the corresponding prediction model of the cancer cell tissue to be tested.
  • the prediction result of whether a drug is effective or not the prediction result represents the drug sensitivity result of the drug corresponding to the prediction model to the current cancer cell tissue. It should be noted that, in practical applications, by performing steps S100 to S300, a plurality of different prediction models can be established, and different prediction models are established for different drugs to be tested, and the selection of drugs to be tested is for cancer. The type of cell tissue is selected.
  • colorectal cancer cell line is selected as the type of cancer cell tissue, and paclitaxel, 5-fluorouracil, cyclophosphamide, and cisplatin are selected for the colorectal cancer cell line.
  • steps S100 to S300 are respectively performed for the four chemotherapeutic drugs to generate a prediction model and a gene prediction list, and each chemotherapeutic drug corresponds to its own prediction model and gene prediction list.
  • the corresponding prediction model is selected, the gene samples of the extracted cancer cells are input, and the corresponding gene expression levels are obtained.
  • Predict drug effects It can predict whether a certain chemotherapeutic drug is effective in predicting cancer cells of clinical patients; it can also predict the drug effect of a chemotherapeutic drug on cancer cells of the same line, that is, the IC50 data value.
  • RNA sequencing data of the cancer cell line to be trained in the GDSC and CCLE databases and the data indicating drug sensitivity are used as an example.
  • IC50 data preprocessing the acquired RNA sequencing data, the preprocessing includes filtering based on gene expression, filtering based on the correlation between gene expression and IC50 data, and filtering through Fisher's linear judgment, and finally retain some gene fragments ;
  • Use the K-nearest neighbor regression model for cross-validation enumerate the parameters of the K-nearest neighbor regression model, select the model parameters with the highest cross-validation accuracy, determine the optimal parameters of the prediction model and construct the generated gene prediction list;
  • the key gene fragments of cancer cells can be obtained according to the gene prediction list, the gene expression of key gene fragments can be obtained by qPCR technology or gene chip technology, and the gene expression can be used as the input parameter of the prediction model.
  • the prediction result of the drug sensitivity of the cancer cell tissue is obtained, that is, the predicted IC50 data value.
  • the RNA sequencing data of cancer tissue samples of clinical patients and the drug effect classification data of corresponding clinical drugs are obtained through the TCGA database.
  • Preprocess the acquired RNA sequencing data which includes filtering based on gene expression and filtering through the Mann-Whitney U test method to obtain preprocessed gene fragments; use K-nearest neighbor regression model for crossover Verify, enumerate the parameters of the K-nearest neighbor regression model, select the model parameters with the highest cross-validation accuracy, determine the optimal parameters of the prediction model, and construct the generated gene prediction list;
  • the key gene fragments of cancer cells of clinical patients can be obtained according to the gene prediction list, the gene expression of key gene fragments can be obtained by qPCR technology or gene chip technology, and the gene expression can be standardized as the input of the prediction model.
  • the clinical patient can be prescribed an appropriate method for drug use according to whether the drug is effective.
  • each candidate drug has a corresponding gene prediction list, which are gene prediction list 1, gene prediction list 2 and gene prediction list 3, among which the gene prediction list contains about ten corresponding key genes in practical applications.
  • the gene prediction lists of the three candidate drugs are assembled into a set based on the prediction list set; and, drug candidate 1, drug candidate 2 and drug candidate 3 have their own trained prediction models, namely prediction model 1, prediction model 2 and prediction model 2 respectively.
  • Model 3 on the other hand, the tumor cancer cell tissue of clinical patients is obtained, and the gene expression level of the corresponding gene fragment is obtained by qPCR technology or gene chip technology combined with the gene prediction list, and the gene expression level is standardized as each candidate drug.
  • the input of the corresponding prediction model so as to obtain the corresponding drug efficacy prediction of candidate drug 1, candidate drug 2 and candidate drug 3, that is, predict the tumor cancer cell tissue of candidate drug 1, candidate drug 2 and candidate drug 3 for the clinical patient. Whether it is effective or not, based on the prediction results, individualized drug delivery plans for clinical patients are formulated to achieve precision medicine.
  • the time-consuming of the entire drug effect prediction can be shortened, and it is convenient to timely recommend the fish drug regimen during or after the clinical operation. It effectively avoids the dependence of drug efficacy prediction on RNA sequencing and other time-consuming sequencing technologies; and by reducing the gene set required for drug efficacy prediction, it can quickly and accurately predict the drug responsiveness of clinical patients, reducing prediction costs and time costs, Improve forecasting efficiency.
  • an embodiment of the present application further provides an electronic device, including: at least one processor, and a memory communicatively connected to the at least one processor;
  • the processor is configured to execute the drug sensitivity prediction method mentioned in the embodiment of the first aspect by calling the computer program stored in the memory.
  • the memory can be used to store non-transitory software programs and non-transitory computer-executable programs, such as the drug sensitivity prediction method mentioned in the embodiments of the first aspect of the present application.
  • the processor implements the drug sensitivity prediction method mentioned in the embodiment of the first aspect above by running the non-transitory software program and instructions stored in the memory.
  • the memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system and an application program required by at least one function; the storage data area may store and execute the drug sensitivity prediction method mentioned in the first aspect embodiment above.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device.
  • the memory may optionally include memory located remotely from the processor, and these remote memories may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the non-transitory software programs and instructions required to implement the drug sensitivity prediction method mentioned in the embodiment of the first aspect above are stored in the memory, and when executed by one or more processors, execute the method proposed in the embodiment of the first aspect above.
  • the embodiments of the present application further provide a computer-readable storage medium storing computer-executable instructions, where the computer-executable instructions are used to: execute the drug sensitivity prediction method mentioned in the embodiments of the first aspect;
  • the computer-readable storage medium stores computer-executable instructions that are executed by one or more control processors, eg, by a processor in the electronic device of the second aspect embodiment
  • the execution can cause the above one or more processors to execute the drug sensitivity prediction method mentioned in the embodiment of the first aspect.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

Abstract

Disclosed are a drug sensitivity prediction method, an electronic device and a computer-readable storage medium, which relate to the technical field of drug testing. The method comprises: acquiring gene sequencing data of training cancer cell tissue and drug feature data; pre-processing the gene sequencing data according to the drug feature data, so as to obtain gene sample data; performing verification processing according to the gene sample data and the drug feature data, so as to obtain a prediction model and a gene prediction list; and by means of the gene prediction list and the prediction model, performing drug sensitivity prediction on cancer cell tissue to be tested. Therefore, drug reaction prediction of a clinic patient can be quickly and precisely realized, thereby reducing prediction costs and time costs, and improving the drug effect prediction efficiency.

Description

药物敏感预测方法、电子设备及计算机可读存储介质Drug susceptibility prediction method, electronic device and computer-readable storage medium 技术领域technical field
本申请涉及药物检测技术领域,尤其是涉及一种药物敏感预测方法、电子设备及计算机可读存储介质。The present application relates to the technical field of drug detection, and in particular, to a drug sensitivity prediction method, an electronic device and a computer-readable storage medium.
背景技术Background technique
在精准医学时代,基于患者的临床特征和基因组学对癌症患者的药物反应性预测,对于协助临床医生制定有效且毒性低的治疗方案至关重要。药物反应的预测模型通常会在不同的数据集上进行训练。目前最广泛应用的药物预测模型是基于监督学习技术,采用的监督学习方法包括回归模型和分类模型。前者可生成具体的药物敏感性数值,如IC50(The half maximal inhibitory concentration,半数抑制浓度),后者则可生成药物反应的水平,如高灵敏度的药物反应和低灵敏度的药物反应。In the era of precision medicine, the prediction of drug responsiveness in cancer patients based on their clinical characteristics and genomics is essential to assist clinicians in formulating effective and less toxic treatment regimens. Predictive models for drug response are often trained on different datasets. Currently, the most widely used drug prediction models are based on supervised learning techniques. The supervised learning methods used include regression models and classification models. The former can generate specific drug sensitivity values, such as IC50 (The half maximal inhibition concentration), and the latter can generate drug response levels, such as high-sensitivity drug responses and low-sensitivity drug responses.
目前存在一些研究和方法,致力于发现基因组/转录组和癌症用药效果的关系,从而辅助癌症给药方案,提高癌症用药疗效。但是目前的研究和方案距离实际应用还有一段距离,无法高效地应用于临床场景。例如,关于利用监督学习的手段根据基因组或转录组预测药物的反应性,存在一定的不足:数据分析局限于现有的数据库,缺乏实验和临床验证;方法基于RNA测序技术,而非小基因集,无法应用快速的基因表达测定手段,而RNA测序需要几天至几周的时间,不适用于临床常需要的术中或术后立即用药的情形;药效预测仅止步于数据分析,未提出具体、快速的应用方案,实际应用困难、成本高、时间久。At present, there are a number of studies and methods dedicated to discovering the relationship between the genome/transcriptome and the effect of cancer medication, so as to assist the cancer medication regimen and improve the efficacy of cancer medication. However, the current research and solutions are still far from practical application and cannot be efficiently applied to clinical scenarios. For example, there are certain deficiencies regarding the use of supervised learning to predict drug responsiveness from the genome or transcriptome: data analysis is limited to existing databases, lacking experimental and clinical validation; methods are based on RNA sequencing technology rather than small gene sets , it is impossible to use rapid gene expression measurement methods, and RNA sequencing takes several days to several weeks, which is not suitable for the situation of intraoperative or immediate postoperative medication that is often required in clinical practice; drug efficacy prediction only stops at data analysis, and no suggestion is made. The specific and fast application scheme is difficult to apply in practice, with high cost and long time.
发明内容SUMMARY OF THE INVENTION
本申请旨在至少解决现有技术中存在的技术问题之一。为此,本申请提出一种药物敏感预测方法,能够快速精确地实现对临床病人的药物反应性预测,减少预测成本和时间成本,提高药效预测效率。The present application aims to solve at least one of the technical problems existing in the prior art. To this end, the present application proposes a drug sensitivity prediction method, which can quickly and accurately predict drug responsiveness to clinical patients, reduce prediction costs and time costs, and improve drug efficacy prediction efficiency.
本申请还提出一种具有上述药物敏感预测方法的电子设备。The present application also proposes an electronic device having the above drug sensitivity prediction method.
本申请还提出一种具有上述药物敏感预测方法的计算机可读存储介质。The present application also provides a computer-readable storage medium having the above drug sensitivity prediction method.
根据本申请的第一方面实施例的药物敏感预测方法,包括:获取待训练癌细胞组织的基因测序数据和药物特征数据;根据所述药物特征数据对所述基因测序数据进行预处理,得到基因样本数据;根据所述基因样本数据和所述药物特征数据进行验证处理,得到预测模型和基因预测列表;通过所述预测模型和所述基因预测列表对待测癌细胞组织进行药物敏感性预测。The drug sensitivity prediction method according to the embodiment of the first aspect of the present application includes: acquiring gene sequencing data and drug characteristic data of the cancer cell tissue to be trained; preprocessing the gene sequencing data according to the drug characteristic data to obtain a gene sample data; perform verification processing according to the gene sample data and the drug characteristic data to obtain a prediction model and a gene prediction list; perform drug sensitivity prediction on the cancer cell tissue to be tested by using the prediction model and the gene prediction list.
根据本申请实施例的药物敏感预测方法,至少具有如下有益效果:通过获取待训练癌细胞组织的基因测序数据和药物特征数据,根据药物特征数据对基因测序数据进行预处理后得到基因样本数据,根据基因样本数据和药物特征数据进行验证处理,得到预测模型和基因预测列表,通过基因预测列表和预测模型对待测癌细胞组织进行药物敏感性预测,能够快速精确地实现对临床病人的药物反应性预测,减少预测成本和时间成本,提高药效预测效率。The drug sensitivity prediction method according to the embodiment of the present application has at least the following beneficial effects: by acquiring the gene sequencing data and drug characteristic data of the cancer cell tissue to be trained, and preprocessing the gene sequencing data according to the drug characteristic data to obtain gene sample data, Validation processing is carried out according to the gene sample data and drug characteristic data to obtain a prediction model and a gene prediction list. Through the gene prediction list and prediction model, the drug sensitivity of the cancer cell tissue to be tested can be predicted, which can quickly and accurately realize the drug responsiveness of clinical patients. Prediction, reduce prediction cost and time cost, and improve the efficiency of drug efficacy prediction.
根据本申请的一些实施例,所述基因测序数据包括第一测序数据,所述药物特征数据包括药物敏感性数据;对应的,所述获取待训练癌细胞组织的基因测序数据和药物特征数据,包括:基于基因组数据库获取待训练癌细胞组织的所述第一测序数据和对应的所述药物敏感性数据。According to some embodiments of the present application, the gene sequencing data includes first sequencing data, and the drug characteristic data includes drug sensitivity data; correspondingly, the acquiring gene sequencing data and drug characteristic data of the cancer cell tissue to be trained, The method includes: acquiring the first sequencing data of the cancer cell tissue to be trained and the corresponding drug sensitivity data based on a genome database.
根据本申请的一些实施例,所述根据所述药物特征数据对所述基因测序数据进行预处理,得到基因样本数据,包括:对所述第一测序数据进行标准化处理,得到第一样本数据;根据所述第一样本数据和所述药物敏感性数据的药敏相关系数对所述第一样本数据进行筛选,得到第二样本数据;根据所述药物敏感性数据对所述第二样本数据进行评分判定,得到所述第二样本数据的评分参数;基于所述评分参数对所述第二样本数据进行筛选处理,得到所述基因样本数据。According to some embodiments of the present application, the preprocessing of the gene sequencing data according to the drug characteristic data to obtain the gene sample data includes: standardizing the first sequencing data to obtain the first sample data ; Screen the first sample data according to the drug sensitivity correlation coefficient of the first sample data and the drug sensitivity data to obtain second sample data; The sample data is scored and determined to obtain scoring parameters of the second sample data; the second sample data is screened based on the scoring parameters to obtain the gene sample data.
根据本申请的一些实施例,所述根据所述基因样本数据和所述药物特征数据进行验证处理,得到预测模型和基因预测列表,包括:获取所述基因样本数据与所述药物敏感性数据的药敏相关系数,获取所述基因样本数据的评分参数,所述基因样本数据包括多个基因片段;根据所述药敏相关系数和所述评分参数对所述多个基因片段进行降序排列;对降序排列后的所述多个基因片段进行验证处理,得到所述预测模型的模型参数和基因列表数目;根据所述基因列表数目生成所述基因预测列表,根据所述模型参数确定所述预测模型。According to some embodiments of the present application, performing the verification process according to the gene sample data and the drug characteristic data to obtain a prediction model and a gene prediction list includes: acquiring the difference between the gene sample data and the drug sensitivity data. Drug susceptibility correlation coefficient, obtain the scoring parameters of the gene sample data, the gene sample data includes multiple gene fragments; according to the drug susceptibility correlation coefficient and the scoring parameters, the multiple gene fragments are sorted in descending order; The plurality of gene fragments arranged in descending order are subjected to verification processing to obtain the model parameters of the prediction model and the number of gene lists; the gene prediction list is generated according to the number of gene lists, and the prediction model is determined according to the model parameters. .
根据本申请的一些实施例,所述基因测序数据包括第二测序数据,所述药物特征数据包括药物效果分级数据;对应的,所述获取待训练癌细胞组织的基因测序数据和药物特征数据,包括:基于基因组图谱数据库获取待训练癌细胞组织的第二测序数据和药物效果分级数据。According to some embodiments of the present application, the gene sequencing data includes second sequencing data, and the drug feature data includes drug effect classification data; correspondingly, the acquiring gene sequencing data and drug feature data of the cancer cell tissue to be trained, Including: acquiring second sequencing data and drug effect grading data of the cancer cell tissue to be trained based on the genome atlas database.
根据本申请的一些实施例,所述根据所述药物特征数据对所述基因测序数据进行预处理,得到基因样本数据,包括:对所述第二测序数据进行标准化处理,得到第三样本数据;根据所述药物效果分级数据对所述第三样本数据进行检验,得到所述基因样本数据。According to some embodiments of the present application, the preprocessing of the gene sequencing data according to the drug characteristic data to obtain gene sample data includes: standardizing the second sequencing data to obtain third sample data; The third sample data is tested according to the drug effect classification data to obtain the gene sample data.
根据本申请的一些实施例,所述根据所述基因样本数据和所述药物特征数据进行验证处理,得到预测模型和基因预测列表,包括:获取所述基因样本数据的多个基因片段的基因评分;根据所述基因评分对所述多个基因片段进行降序排列;对降序排列后的所述多个基因片段进行交叉验证,得到所述预测模型的模型参数和基因列表数目;根据所述基因列表数目和对应的所述多个基因片段生成所述基因预测列表,根据所述模型参数确定所述预测模型。According to some embodiments of the present application, the performing verification processing according to the gene sample data and the drug characteristic data to obtain a prediction model and a gene prediction list includes: acquiring gene scores of multiple gene segments of the gene sample data ; Arrange the plurality of gene fragments in descending order according to the gene score; perform cross-validation on the plurality of gene fragments after the descending order to obtain the model parameters of the prediction model and the number of gene lists; According to the gene list The number and the corresponding plurality of gene segments generate the gene prediction list, and the prediction model is determined according to the model parameters.
根据本申请的一些实施例,所述通过所述预测模型和所述基因预测列表对所述待测癌细胞组织进 行药物敏感性预测,包括:根据所述基因预测列表获取待测癌细胞组织对应的基因片段;获取所述基因片段的基因表达量;将所述基因表达量输入所述预测模型中,获取所述待测癌细胞组织的药物敏感性结果。According to some embodiments of the present application, the predicting the drug sensitivity of the cancer cell tissue to be tested by using the prediction model and the gene prediction list includes: obtaining the corresponding information of the cancer cell tissue to be tested according to the gene prediction list obtain the gene expression of the gene fragment; input the gene expression into the prediction model to obtain the drug sensitivity result of the cancer cell tissue to be tested.
根据本申请的第二方面实施例的电子设备,包括:至少一个处理器,以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器执行所述指令时实现如第一方面所述的药物敏感预测方法。An electronic device according to an embodiment of the second aspect of the present application includes: at least one processor, and a memory communicatively connected to the at least one processor; wherein the memory stores instructions, the instructions are stored by the at least one processor A processor executes such that the at least one processor implements the drug susceptibility prediction method according to the first aspect when the at least one processor executes the instructions.
根据本申请的电子设备,至少具有如下有益效果:通过执行第一方面实施例中提到的药物敏感预测方法,能够快速精确地实现对临床病人的药物反应性预测,减少预测成本和时间成本,提高药效预测效率。The electronic device according to the present application has at least the following beneficial effects: by executing the drug sensitivity prediction method mentioned in the embodiment of the first aspect, the drug responsiveness prediction for clinical patients can be quickly and accurately realized, and the prediction cost and time cost can be reduced, Improve the efficiency of drug efficacy prediction.
根据本申请的第三方面实施例的计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行如第一方面所述的药物敏感预测方法。A computer-readable storage medium according to an embodiment of the third aspect of the present application, where the computer-readable storage medium stores computer-executable instructions for causing a computer to execute the drug-sensing method according to the first aspect method of prediction.
根据本申请的计算机可读存储介质,至少具有如下有益效果:通过执行第一方面实施例中提到的药物敏感预测方法,能够快速精确地实现对临床病人的药物反应性预测,减少预测成本和时间成本,提高药效预测效率。The computer-readable storage medium according to the present application has at least the following beneficial effects: by executing the drug sensitivity prediction method mentioned in the embodiment of the first aspect, the drug responsiveness prediction for clinical patients can be quickly and accurately realized, and the prediction cost and Time cost and improve the efficiency of drug efficacy prediction.
本申请的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。Additional aspects and advantages of the present application will be set forth, in part, from the following description, and in part will become apparent from the following description, or may be learned by practice of the present application.
附图说明Description of drawings
图1为本申请实施例中药物敏感预测方法的一具体流程示意图;Fig. 1 is a specific flow chart of the drug sensitivity prediction method in the embodiment of the application;
图2为本申请实施例中药物敏感预测方法的步骤S200的一具体流程示意图;FIG. 2 is a specific flowchart of step S200 of the drug sensitivity prediction method in the embodiment of the present application;
图3为本申请实施例中药物敏感预测方法的步骤S200的另一具体流程示意图;FIG. 3 is another specific schematic flowchart of step S200 of the drug sensitivity prediction method in the embodiment of the present application;
图4为本申请实施例中药物敏感预测方法的步骤S300的一具体流程示意图;FIG. 4 is a specific flowchart of step S300 of the drug sensitivity prediction method in the embodiment of the present application;
图5为本申请实施例中药物敏感预测方法的步骤S300的另一具体流程示意图;5 is another specific schematic flowchart of step S300 of the drug sensitivity prediction method in the embodiment of the present application;
图6为本申请实施例中药物敏感预测方法的步骤S400的一具体流程示意图;FIG. 6 is a specific flowchart of step S400 of the drug sensitivity prediction method in the embodiment of the present application;
图7为本申请实施例中药物敏感预测方法的一具体应用实例图。FIG. 7 is a diagram of a specific application example of the drug sensitivity prediction method in the embodiment of the present application.
具体实施方式Detailed ways
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能理解为对本申请的限制。The following describes in detail the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present application, but should not be construed as a limitation on the present application.
需要说明的是,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同流程图中的顺序执 行所示出或描述的步骤。如果涉及到“若干”,其含义是一个以上,如果涉及到“多个”,其含义是两个以上,如果涉及到“以下”,均应理解为包括本数。本文所提供的任何以及所有实例或示例性语言(“例如”、“如”等)的使用仅意图更好地说明本申请的实施例,并且除非另外要求,否则不会对本申请的范围施加限制。大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。如果有描述到第一、第二只是用于区分技术特征为目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。It should be noted that, in the flowcharts, a logical order is shown, but in some cases, the steps shown or described may be performed in the order in a different flowchart. If it refers to "several", it means more than one, if it refers to "plurality", it means two or more, if it refers to "the following", it should be understood as including the number. The use of any and all examples or exemplary language ("for example," "such as," etc.) provided herein is merely intended to better illustrate the embodiments of the present application and does not impose limitations on the scope of the present application unless otherwise required . Greater than, less than, exceeding, etc. are understood as not including this number, and above, below, within, etc. are understood as including this number. If it is described that the first and the second are only for the purpose of distinguishing technical features, it cannot be understood as indicating or implying relative importance, or indicating the number of the indicated technical features or the order of the indicated technical features. relation.
需要说明的是,如无特殊说明,在实施例中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。此外,除非另有定义,本文所使用的所有的技术和科学术语与本技术领域的技术人员通常理解的含义相同。本文说明书中所使用的术语只是为了描述具体的实施例,而不是为了限制本申请。本文所使用的术语“和/或”包括一个或多个相关的所列项目的任意的组合。It should be noted that, unless otherwise specified, the singular forms of "a", "the" and "the" used in the embodiments are also intended to include plural forms unless the context clearly indicates otherwise. Also, unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. The terms used in the specification herein are for the purpose of describing specific embodiments only, and not for the purpose of limiting the application. As used herein, the term "and/or" includes any combination of one or more of the associated listed items.
在精准医学时代,基于患者的临床特征和基因组学对癌症患者的药物反应性预测,对于协助临床医生制定有效且毒性低的治疗方案至关重要。药物反应的预测模型通常会在不同的数据集上进行训练。目前最广泛应用的药物预测模型是基于监督学习技术,采用的监督学习方法包括回归模型和分类模型。前者可生成具体的药物敏感性数值,如IC50(The half maximal inhibitory concentration,半数抑制浓度),后者则可生成药物反应的水平,如高灵敏度的药物反应和低灵敏度的药物反应。In the era of precision medicine, the prediction of drug responsiveness in cancer patients based on their clinical characteristics and genomics is essential to assist clinicians in formulating effective and less toxic treatment regimens. Predictive models for drug response are often trained on different datasets. Currently, the most widely used drug prediction models are based on supervised learning techniques. The supervised learning methods used include regression models and classification models. The former can generate specific drug sensitivity values, such as IC50 (The half maximal inhibition concentration), and the latter can generate drug response levels, such as high-sensitivity drug responses and low-sensitivity drug responses.
目前存在一些研究和方法,致力于发现基因组/转录组和癌症用药效果的关系,从而辅助癌症给药方案,提高癌症用药疗效。但是目前的研究和方案距离实际应用还有一段距离,无法高效地应用于临床场景。例如,关于利用监督学习的手段根据基因组或转录组预测药物的反应性,存在一定的不足:数据分析局限于现有的数据库,缺乏实验和临床验证;方法基于RNA测序技术,而非小基因集,无法应用快速的基因表达测定手段,而RNA测序需要几天至几周的时间,不适用于临床常需要的术中或术后立即用药的情形;药效预测仅止步于数据分析,未提出具体、快速的应用方案,实际应用困难、成本高、时间久。At present, there are a number of studies and methods dedicated to discovering the relationship between the genome/transcriptome and the effect of cancer medication, so as to assist the cancer medication regimen and improve the efficacy of cancer medication. However, the current research and solutions are still far from practical application and cannot be efficiently applied to clinical scenarios. For example, there are certain deficiencies regarding the use of supervised learning to predict drug responsiveness from the genome or transcriptome: data analysis is limited to existing databases, lacking experimental and clinical validation; methods are based on RNA sequencing technology rather than small gene sets , it is impossible to use rapid gene expression measurement methods, and RNA sequencing takes several days to several weeks, which is not suitable for the situation of intraoperative or immediate postoperative medication that is often required in clinical practice; drug efficacy prediction only stops at data analysis, and no suggestion is made. The specific and fast application scheme is difficult to apply in practice, with high cost and long time.
基于此,本申请实施例提供了一种药物敏感预测方法、电子设备及计算机可读存储介质,能够基于较少的基因数量,对癌症药效进行快速预测,避免了药效预测对RNA测序等耗时较长的测序技术的依赖,并且减少药效预测的成本。Based on this, the embodiments of the present application provide a drug sensitivity prediction method, an electronic device and a computer-readable storage medium, which can quickly predict the drug efficacy of cancer based on a small number of genes, avoiding the need for drug efficacy prediction to RNA sequencing, etc. Dependence on time-consuming sequencing technology, and reduce the cost of drug efficacy prediction.
第一方面,本申请实施例提供了一种药物敏感预测方法。In a first aspect, the embodiments of the present application provide a drug sensitivity prediction method.
在一些实施例中,参照图1,示出了本申请实施例中药物敏感预测方法的流程示意图。其具体包括步骤:In some embodiments, referring to FIG. 1 , a schematic flowchart of the drug sensitivity prediction method in the embodiment of the present application is shown. It specifically includes steps:
S100,获取待训练癌细胞组织的基因测序数据和药物特征数据;S100, acquiring gene sequencing data and drug characteristic data of the cancer cell tissue to be trained;
S200,根据药物特征数据对基因测序数据进行预处理,得到基因样本数据;S200, preprocessing the gene sequencing data according to the drug characteristic data to obtain gene sample data;
S300,根据基因样本数据和药物特征数据进行验证处理,得到预测模型和基因预测列表;S300, performing verification processing according to the gene sample data and drug characteristic data to obtain a prediction model and a gene prediction list;
S400,通过预测模型和基因预测列表对待测癌细胞组织进行药物敏感性预测。S400, predicting the drug sensitivity of the cancer cell tissue to be tested through a prediction model and a gene prediction list.
在步骤S100中,本申请实施例中需要获取待训练癌细胞组织的基因测序数据和对应的不同药物的药物特征数据,其中基因测序数据是指待训练癌细胞组织的RNA(核糖核酸,Ribonucleic Acid)测序数据;待测药物的药物特征数据是指应用于待训练癌细胞组织的不同药物的敏感性数据或者药物效果等数据,例如该待训练癌细胞组织相关药物的药物敏感性的IC50(half maximal inhibitory concentration,半抑制浓度)数据,IC50为50%抑制浓度时所对应的浓度,半数抑制是用来衡量抗体灵敏度;IC50的数值越低,说明抗体的灵敏度越高;例如关于癌细胞组织的药物效果分级数据,临床效果分级数据用于表示该癌细胞组织的临床用药的效果,具备不同的效果等级。In step S100, in the embodiment of the present application, it is necessary to obtain the gene sequencing data of the cancer cell tissue to be trained and the corresponding drug characteristic data of different drugs, wherein the gene sequencing data refers to the RNA (ribonucleic acid, Ribonucleic Acid) of the cancer cell tissue to be trained. ) sequencing data; the drug characteristic data of the drug to be tested refers to the sensitivity data or drug effect data of different drugs applied to the cancer cell tissue to be trained, such as the IC50 (half of the drug sensitivity of the drug related to the cancer cell tissue to be trained) maximal inhibition concentration) data, IC50 is the concentration corresponding to 50% inhibitory concentration, and half inhibition is used to measure the sensitivity of the antibody; the lower the IC50 value, the higher the sensitivity of the antibody; for example, about cancer cells. The drug effect classification data, and the clinical effect classification data are used to represent the effect of clinical drug use of the cancer cell tissue, and have different effect grades.
在一些实施例中,本申请实施例中的待训练癌细胞组织可以是从基因数据库中挑选出来的任一癌细胞组织;也可以是从基因数据库中获取的临床病人的癌细胞组织样本;该待训练癌细胞组织用于为后续建立预测模型提供训练数据。In some embodiments, the cancer cell tissue to be trained in the embodiments of the present application may be any cancer cell tissue selected from the gene database; it may also be a cancer cell tissue sample of a clinical patient obtained from the gene database; the The cancer cell tissue to be trained is used to provide training data for the subsequent establishment of the prediction model.
以挑选出的待训练的癌细胞组织为例,可以基于基因组数据库进行癌细胞组织的基因测序数据和待测药物的药物特征数据的获取,其中基因组数据库为抗癌药物敏感性基因组学数据库(Genomics of Drug Sensitivity in Cancer,GDSC)和癌细胞系百科全书(Cancer Cell Line Encyclopedia,CCLE)。具体地说,通过查阅抗癌药物敏感性基因组学数据库和癌细胞系百科全书得到所需要的的相关资料,即癌细胞组织的基因测序数据和待测药物的药物特征数据。Taking the selected cancer cell tissue to be trained as an example, the gene sequencing data of the cancer cell tissue and the drug characteristic data of the drug to be tested can be obtained based on the genome database, wherein the genome database is the anticancer drug sensitivity genomics database (Genomics). of Drug Sensitivity in Cancer, GDSC) and Cancer Cell Line Encyclopedia (CCLE). Specifically, by consulting the anticancer drug sensitivity genomics database and the cancer cell line encyclopedia, the required relevant information, that is, the gene sequencing data of the cancer cell tissue and the drug characteristic data of the drug to be tested, are obtained.
抗癌药物敏感性基因组学数据库(Genomics of Drug Sensitivity in Cancer,GDSC)由英国桑格研究院开发,收集肿瘤细胞对待测药物的敏感度和反应。癌基因组的变异会影响临床治疗的效果,不同的靶点对药物的反应也有很大不同。因此这类数据对于发现潜在的肿瘤治疗靶点之分重要。GDSC的数据来自75000个实验,描述了约200个抗癌药物在1000多种肿瘤细胞中的反应。该数据库中的癌基因组突变信息来自COSMIC数据库,包括癌基因点突变、基因扩增与丢失、组织类型以及表达谱等。用户可以从化合物、癌基因和细胞系3个层面对数据库进行检索,癌基因或细胞系对不同药物的反应会被详细列出,并且结果会以图形化的界面加以展示,包括统计分析,火山图及相关文献等。检索结果以及整个数据库都可由用户下载以进行后续分析。The Genomics of Drug Sensitivity in Cancer (GDSC) database was developed by the Sanger Institute in the United Kingdom to collect the sensitivity and response of tumor cells to the drug to be tested. Variations in the cancer genome can affect the efficacy of clinical treatments, and different targets respond to drugs differently. Such data are therefore important for the discovery of potential tumor therapeutic targets. Data from GDSC comes from 75,000 experiments describing the response of about 200 anticancer drugs in more than 1,000 tumor cells. The oncogenome mutation information in this database comes from the COSMIC database, including oncogene point mutations, gene amplification and loss, tissue types, and expression profiles. Users can search the database from three levels: compound, oncogene and cell line. The response of oncogene or cell line to different drugs will be listed in detail, and the results will be displayed in a graphical interface, including statistical analysis, volcano Figures and related literature. The search results, as well as the entire database, can be downloaded by the user for subsequent analysis.
癌细胞系百科全书通过对覆盖三十多种组织来源的947种人类癌细胞系进行了大规模深度测序,整合了DNA突变、基因表达和染色体拷贝数等遗传信息。The Encyclopedia of Cancer Cell Lines integrates genetic information such as DNA mutations, gene expression, and chromosomal copy number through large-scale deep sequencing of 947 human cancer cell lines covering more than 30 tissue sources.
通过抗癌药物敏感性基因组学数据库和癌细胞系百科全书直接进行查找,获得癌细胞组织对应的第一测序数据和对应的药物敏感性数据,其中第一测序数据是指癌细胞组织的RNA测序数据,RNA测序数据是通过RNA-seq(转录组测序)技术测序得到的数据,转录组指在某一生理条件下,细胞内所 有转录组产物的集合。转录组测序的研究对象为特定细胞在某一功能状态下所能转录出来的所有RNA的总和,主要包括mRNA和ncRNA。药物敏感性数据是指关于该癌细胞组织相关药物的IC50数据。Search directly through the anticancer drug sensitivity genomics database and cancer cell line encyclopedia to obtain the first sequencing data corresponding to the cancer cell tissue and the corresponding drug sensitivity data, where the first sequencing data refers to the RNA sequencing of the cancer cell tissue Data, RNA sequencing data is the data obtained by RNA-seq (transcriptome sequencing) technology, and the transcriptome refers to the collection of all transcriptome products in a cell under a certain physiological condition. The research object of transcriptome sequencing is the sum of all RNAs that can be transcribed by a specific cell in a functional state, mainly including mRNA and ncRNA. Drug sensitivity data refers to IC50 data about the cancer cell tissue-related drug.
以临床病人的癌细胞组织样本为例,基于肿瘤基因组图谱数据库(The Cancer Genome Atlas,TCGA)获取到临床病人的癌细胞组织样本所对应的第二测序数据和药物效果分级数据,其中肿瘤基因组图谱数据库收录了各种人类癌症(包括亚型在内的肿瘤)的临床数据、基因组变异、mRNA(信使RNA)表达、miRNA(micro RNA)表达、甲基化等数据,是癌症研究者很重要的数据来源。Taking the cancer cell tissue samples of clinical patients as an example, the second sequencing data and drug effect classification data corresponding to the cancer cell tissue samples of clinical patients were obtained based on The Cancer Genome Atlas (TCGA), among which the tumor genome atlas The database contains clinical data, genomic variation, mRNA (messenger RNA) expression, miRNA (micro RNA) expression, methylation and other data of various human cancers (tumors including subtypes), which are very important to cancer researchers. Data Sources.
在步骤S200中,根据获取到的药物特征数据对获取到癌细胞组织的基因测序数据进行预处理,得到预处理后的基因样本数据。In step S200, the acquired gene sequencing data of the cancer cell tissue is preprocessed according to the acquired drug characteristic data to obtain preprocessed gene sample data.
在一些实施例中,以挑选出需要预测的癌细胞组织为例,参考图2,步骤S200具体包括步骤:In some embodiments, taking the selection of cancer cells to be predicted as an example, referring to FIG. 2 , step S200 specifically includes the steps:
S211,对第一测序数据进行标准化处理,得到第一样本数据;S211, standardize the first sequencing data to obtain first sample data;
S212,根据第一样本数据和药物敏感性数据的药敏相关系数对第一样本数据进行筛选,得到第二样本数据;S212, screening the first sample data according to the drug sensitivity correlation coefficient of the first sample data and the drug sensitivity data to obtain second sample data;
S213,根据药物敏感性数据对第二样本数据进行评分判定,得到第二样本数据的评分参数;S213, scoring and determining the second sample data according to the drug sensitivity data to obtain scoring parameters of the second sample data;
S214,基于评分参数对第二样本数据进行筛选处理,得到基因样本数据。S214: Perform screening processing on the second sample data based on the scoring parameter to obtain gene sample data.
在步骤S211中,对获取到的第一测序数据进行标准化处理,得到第一样本数据,其中标准化处理是指将第一测序数据即对待训练癌细胞组织的RNA测序数据进行标准化基因长度和测序深度,即可得到第一测序数据对应的TPM(Transcripts Per Kilobase of exon model per Million mapped reads,每千个碱基的转录每百万映射读取的转录本数),再根据第一测序数据的TPM对第一测序数据进行筛选,得到第一样本数据,例如筛选TPM低于1的第一测序数据,得到筛选后的第一测序数据,即第一样本数据。在实际应用中,第一测序数据为多个基因片段,通过对第一测序数据进行标准化处理,即可基于多个基因的基因表达量进行第一测序数据的筛选,筛选掉TPM低于1的基因片段,保留TPM高于1的基因片段。In step S211, standardize the acquired first sequencing data to obtain first sample data, wherein the normalization processing refers to standardizing the gene length and sequencing of the first sequencing data, that is, the RNA sequencing data of the cancer cell tissue to be trained. depth, the TPM corresponding to the first sequencing data can be obtained (Transcripts Per Kilobase of exon model per Million mapped reads, the number of transcripts per kilobase of transcription per million mapped reads), and then according to the TPM of the first sequencing data The first sequencing data is screened to obtain the first sample data, for example, the first sequencing data whose TPM is lower than 1 is screened to obtain the screened first sequencing data, that is, the first sample data. In practical applications, the first sequencing data are multiple gene fragments. By standardizing the first sequencing data, the first sequencing data can be screened based on the gene expression levels of multiple genes, and those with TPM lower than 1 can be screened out. Gene fragments, retain gene fragments with a TPM higher than 1.
在步骤S212中,根据第一样本数据和对应的待测药物的药物敏感性数据的药敏相关系数,对第一样本数据进行筛选,得到第二样本数据。其中药敏相关系数是第一样本数据中的各个基因的TPM和第一样本数据相关用药即某一待测药物的IC50数据之间的皮尔逊相关系数(Pearson correlation coefficient),其中皮尔逊相关系数用于度量两个变量之间的相关程度,其值介于-1与1之间,其中两个变量分别为基因的TPM和样本用药的IC50数据。通过计算第一样本数据的TPM和对应的某一药物的药物敏感性数据即IC50数据的皮尔逊相关系数,筛选掉皮尔逊相关系数绝对值低于0.1的第一样本数据即皮尔逊相关系数系数绝对值低于0.1的基因片段,得到第二样本数据。In step S212, the first sample data is screened according to the drug sensitivity correlation coefficient between the first sample data and the drug sensitivity data of the corresponding drug to be tested to obtain second sample data. The drug susceptibility correlation coefficient is the Pearson correlation coefficient (Pearson correlation coefficient) between the TPM of each gene in the first sample data and the drug related to the first sample data, that is, the IC50 data of a drug to be tested, where Pearson The correlation coefficient is used to measure the degree of correlation between two variables, and its value is between -1 and 1, where the two variables are the TPM of the gene and the IC50 data of the sample drug. By calculating the Pearson correlation coefficient of the TPM of the first sample data and the drug sensitivity data of the corresponding drug, that is, the IC50 data, the first sample data whose absolute value of the Pearson correlation coefficient is lower than 0.1 is screened out, that is, the Pearson correlation. For gene segments whose absolute value of coefficient coefficient is lower than 0.1, the second sample data is obtained.
在步骤S213和步骤S214中,通过药敏相关系数对第一样本数据筛选得到第二样本数据后,根据相关用药的药物敏感性数据对第二样本进行评分判定,得到第二样本数据的评分参数,通过评分参数 对第二样本数据进行筛选处理,得到筛选后的基因样本数据。例如,基于费雪线性判别方法进行评分判定,通过计算待测药物所适用的部分癌细胞组织的基因表达量的均值和标准差,基于计算得到的均是和标准差进行评分参数的计算,根据计算得到的评分参数对癌细胞组织所对应的基因进行筛选,得到筛选后的基因,即基因样本数据,在实际应用中,基因样本数据为多个基因片段的集合。在实际应用中,计算该药物的药物敏感性数据即IC50数据最高的癌细胞组织系中的15%的基因片段的基因表达量的均值E1和标准差STD1,再计算药物的药物敏感性数据即IC50数据最低的癌细胞组织中的15%的基因片段的基因表达量的均值E2和标准差STD2,根据计算得到的基因表达量的均值E1、均值E2、标准差STD1和标准差STD2,通过公式(E1-E2)/(STD1+STD2)进行计算得到评分参数,保留评分参数最高的部分的第二样本数据,作为筛选得到的基因样本数据,其中基因样本数据包括若干个基因片段,该基因片段的数目选取可以根据实际需求进行设定,以根据数目对第二样本数据进行筛选。In steps S213 and S214, after screening the first sample data through the drug sensitivity correlation coefficient to obtain the second sample data, the second sample is scored and determined according to the drug sensitivity data of the relevant drug, and the score of the second sample data is obtained. parameter, the second sample data is screened by the scoring parameter, and the screened gene sample data is obtained. For example, scoring is determined based on Fisher's linear discriminant method. By calculating the mean and standard deviation of the gene expression levels of some cancer cells to which the drug to be tested is applied, the scoring parameters are calculated based on the calculated values and standard deviations. The calculated scoring parameters are used to screen genes corresponding to cancer cells to obtain the screened genes, that is, gene sample data. In practical applications, the gene sample data is a collection of multiple gene fragments. In practical applications, the drug sensitivity data of the drug, that is, the mean E1 and the standard deviation STD1 of the gene expression levels of 15% of the gene fragments in the cancer cell tissue line with the highest IC50 data, are calculated, and then the drug sensitivity data of the drug are calculated, namely The mean E2 and standard deviation STD2 of the gene expression levels of 15% of the gene fragments in the cancer cell tissue with the lowest IC50 data, according to the calculated mean E1, mean E2, standard deviation STD1 and standard deviation STD2 of the gene expression levels, through the formula (E1-E2)/(STD1+STD2) is calculated to obtain the scoring parameter, and the second sample data with the highest scoring parameter is retained as the gene sample data obtained by screening, wherein the gene sample data includes several gene fragments, the gene fragment The selection of the number of s can be set according to actual needs, so as to filter the second sample data according to the number.
在一些实施例中,以临床病人的癌细胞组织为例,参考图3,步骤S200具体包括步骤:In some embodiments, taking the cancer cell tissue of a clinical patient as an example, referring to FIG. 3 , step S200 specifically includes the steps:
S221,对第二测序数据进行标准化处理,得到第三样本数据;S221, standardize the second sequencing data to obtain third sample data;
S222,根据药物效果分级数据对第三样本数据进行检验,得到基因样本数据。S222, test the third sample data according to the drug effect classification data to obtain gene sample data.
在步骤S221中,对获取到的第二测序数据进行标准化处理,得到第三样本数据,其中标准化处理是指将第二测序数据即癌细胞组织的RNA测序数据标准化基因长度,再标准化测序深度,即可得到第二测序数据对应的TPM,再根据第二测序数据的TPM对第二测序数据进行筛选,得到第三样本数据,例如筛选TPM低于1的第二测序数据,得到筛选后的第二测序数据,即第三样本数据。在实际应用中,第二测序数据为多个基因片段,通过对第一测序数据进行标准化处理,即可基于多个基因片段的表达量进行第二测序数据的筛选,筛选掉TPM低于1的基因片段,保留得到TPM高于1的基因片段。In step S221, normalize the acquired second sequencing data to obtain third sample data, wherein the normalization processing refers to normalizing the second sequencing data, that is, the RNA sequencing data of the cancer cell tissue, by normalizing the gene length, and then normalizing the sequencing depth, The TPM corresponding to the second sequencing data can be obtained, and then the second sequencing data is screened according to the TPM of the second sequencing data to obtain third sample data. The second sequencing data is the third sample data. In practical applications, the second sequencing data are multiple gene fragments. By standardizing the first sequencing data, the second sequencing data can be screened based on the expression levels of multiple gene fragments, and those with TPM lower than 1 can be screened out. Gene fragments, retain gene fragments with a TPM higher than 1.
在步骤S222中,根据待测药物对癌细胞组织的药物效果分级数据对第三样本数据进行检验,得到检验处理后的基因样本数据。具体地说,基于曼-惠特尼U检验方法对第三样本数据进行检验,根据药物效果分级数据划分第三样本数据为有效数据或无效数据,计算有效数据所对应的第三样本数据中的基因片段的基因表达量和无效数据所对应的第三样本数据中的基因片段的基因表达量,作为计算得到的数据值,保留数据值小于一定数值的基因片段,例如小于0.1的基因片段,作为基因样本数据。其中从癌症基因组图谱中获取到的待测药物针对某一癌细胞组织的药物效果分级数据包括多种数据,例如“完全缓解”,“部分缓解”,“疾病稳定”,“疾病进展”等,其中可以定义“完全缓解”,“部分缓解”,“疾病稳定”为“有效”,“疾病进展”为“无效”,则可以根据药物效果分级数据将第三样本数据分为有效样本数据或无效样本数据。In step S222, the third sample data is inspected according to the graded data of the drug effect of the drug to be tested on the cancer cell tissue, and the gene sample data after inspection processing is obtained. Specifically, the third sample data is tested based on the Mann-Whitney U test method, the third sample data is divided into valid data or invalid data according to the drug effect classification data, and the third sample data corresponding to the valid data is calculated. The gene expression amount of the gene fragment and the gene expression amount of the gene fragment in the third sample data corresponding to the invalid data are taken as the calculated data value, and the gene fragment whose data value is less than a certain value, for example, the gene fragment less than 0.1, is used as the calculated data value. Genetic sample data. The grading data of the drug effect of the drug to be tested against a certain cancer cell tissue obtained from The Cancer Genome Atlas includes various data, such as "complete remission", "partial remission", "stable disease", "disease progression", etc. Among them, "complete remission", "partial remission", "stable disease" can be defined as "effective", and "disease progression" can be defined as "ineffective", then the third sample data can be divided into effective sample data or ineffective according to the drug effect classification data sample.
在步骤S300中,根据预处理得到的基因样本数据和待测药物的药物特征数据进行验证,得到待测药物的预测模型和基因预测列表。其中待测药物的预测模型是指基于预设的数学模型进行验证,得到该预测模型的最优参数;基因预测列表是指癌细胞组织中关于该待测药物的药物敏感性预测起到关 键预测作用的基因片段。In step S300, verification is performed according to the gene sample data obtained by preprocessing and the drug characteristic data of the drug to be tested, and a prediction model and a gene prediction list of the drug to be tested are obtained. The prediction model of the drug to be tested refers to the verification based on a preset mathematical model, and the optimal parameters of the prediction model are obtained; the gene prediction list refers to the prediction of drug sensitivity of the drug to be tested in cancer cells and plays a key role in prediction Gene fragments that act.
在一些实施例中,以挑选出需要预测的癌细胞组织为例,参考图4,步骤S300具体包括步骤:In some embodiments, taking the selection of cancer cells to be predicted as an example, referring to FIG. 4 , step S300 specifically includes the steps:
S311,获取基因样本数据与药物敏感性数据的药敏相关系数,获取基因样本数据的评分参数;S311, obtaining the drug sensitivity correlation coefficient between the gene sample data and the drug sensitivity data, and obtaining the scoring parameter of the gene sample data;
S312,根据药敏相关系数和评分参数对多个基因片段进行降序排列;S312, arranging the multiple gene fragments in descending order according to the drug susceptibility correlation coefficient and the scoring parameter;
S313,对降序排列后的多个基因片段进行验证处理,得到预测模型的模型参数和基因列表数目;S313, performing verification processing on the plurality of gene fragments arranged in descending order to obtain model parameters of the prediction model and the number of gene lists;
S314,根据基因列表数目生成基因预测列表,根据模型参数确定预测模型。S314: Generate a gene prediction list according to the number of gene lists, and determine a prediction model according to model parameters.
在步骤S311和步骤S312中,获取基因样本数据中的多个基因片段与药物的药物敏感性数据所对应的药敏相关系数,以及获取基因样本数据的评分参数,其中药敏相关系数是指在步骤S212中获取到的药敏相关系数,评分参数则是指步骤S213中获取到的评分参数,通过结合药敏相关系数和评分参数,在一定的权值分配下,计算基因样本数据中各个基因片段所对应的评分得分,根据计算得到的各个基因片段的评分得分对其进行降序排列。在实际应用中,设定药敏相关系数为S1,对应的权值为0.3,评分参数即费雪判别得分为S2,对应的权值为0.7,则基因片段的评分计算为S=0.3*S1+0.7*S2。根据计算得到的基因片段进行依次降序排列,得到降序排列后的基因样本数据。In steps S311 and S312, the drug susceptibility correlation coefficients corresponding to the multiple gene fragments in the gene sample data and the drug sensitivity data of the drug are obtained, and the scoring parameters of the gene sample data are obtained, wherein the drug susceptibility correlation coefficient refers to the The drug susceptibility correlation coefficient obtained in step S212, and the scoring parameter refers to the scoring parameter obtained in step S213. By combining the drug susceptibility correlation coefficient and the scoring parameter, under a certain weight distribution, each gene in the gene sample data is calculated. The scores corresponding to the fragments are sorted in descending order according to the calculated scores of each gene fragment. In practical applications, the drug susceptibility correlation coefficient is set as S1, the corresponding weight is 0.3, the scoring parameter, namely the Fisher discrimination score, is S2, and the corresponding weight is 0.7, then the score of the gene fragment is calculated as S=0.3*S1 +0.7*S2. Arrange in descending order according to the calculated gene fragments to obtain the gene sample data after descending order.
在步骤S313和步骤S314中,对降序排列后的基因样本数据中的多个基因片段进行验证处理,得到预测模型的模型参数和基因列表数目。其中验证处理是指,依次选取排列后的基因样本中的前n个基因片段,其中n的取值可以根据实际需求进行设定,例如设定n的取值范围为10个至30个基因。基于K近邻回归模型,枚举回归模型的模型参数k,选取最接近的k个临近点,对待训练癌细胞组织所对应的药物敏感性数据进行预测,并进行5折交叉验证,得到预测结果。在实际应用中,基于获取到的基因样本数据,根据K近邻算法建立对关于待训练癌组织细胞对某一药物的预测模型,该预测模型具备最优的模型参数,即具备最优的k值,以及根据具体的基因样本数据中的基因片段的数目即n值,例如前n个基因片段能够得到该预测模型的最优模型参数,则该基因样本数据中的前n个基因片段组成基因预测列表。在可能实施的应用实例中,模型参数的获取是将基于预测模型进行交叉验证后得到的接收者操作特征曲线(Receiver operating characteristic,ROC)的曲线下面积(Area under curve,AUC)最大的情况对应的基因片段数目n和邻近参数k确定为最终的模型参数。In step S313 and step S314, verification processing is performed on a plurality of gene segments in the gene sample data sorted in descending order to obtain model parameters of the prediction model and the number of gene lists. The verification process refers to sequentially selecting the first n gene fragments in the arranged gene samples, where the value of n can be set according to actual needs, for example, the value range of n is set to be 10 to 30 genes. Based on the K nearest neighbor regression model, enumerate the model parameters k of the regression model, select the nearest k neighbors, predict the drug sensitivity data corresponding to the cancer cell tissue to be trained, and perform 5-fold cross-validation to obtain the prediction result. In practical applications, based on the acquired gene sample data, a prediction model for a certain drug of the cancer tissue cells to be trained is established according to the K-nearest neighbor algorithm. The prediction model has the optimal model parameters, that is, has the optimal k value. , and according to the number of gene fragments in the specific gene sample data, that is, the value of n, for example, the first n gene fragments can obtain the optimal model parameters of the prediction model, then the first n gene fragments in the gene sample data constitute gene prediction. list. In a possible application example, the acquisition of model parameters corresponds to the case where the area under the curve (Area under curve, AUC) of the receiver operating characteristic curve (ROC) obtained after cross-validation based on the prediction model corresponds to the largest The number of gene fragments n and the proximity parameter k are determined as the final model parameters.
在可能实施的应用实例中,以待训练癌细胞组织为结直肠癌细胞系为例,GDSC数据库中存有各细胞系的第一测序数据即RNA测序数据和细胞系在不同药物作用下的IC50数据,用药药物选取为紫杉醇、5-氟尿嘧啶、环磷酰胺、顺铂四种化疗药,针对上述条件进行示例性说明。In a possible application example, taking the cancer cell tissue to be trained as a colorectal cancer cell line as an example, the GDSC database contains the first sequencing data of each cell line, that is, RNA sequencing data and the IC50 of the cell line under the action of different drugs According to the data, the drugs are selected from four chemotherapeutic drugs, paclitaxel, 5-fluorouracil, cyclophosphamide, and cisplatin, and the above conditions are exemplified.
通过GDSC数据库获取结直肠癌细胞系的RNA测序数据和紫杉醇的IC50数据,通过对RNA测序数据进行预处理后,对RNA测序数据进行评分排序;选取基因评分排序后的前10个基因,分别针对K近邻回归模型的k值为1至30的情况进行紫杉醇的IC50数据的预测,并对预测结果进行交叉验证,计算出AUC的值,记录下不同K值所得到的最大的AUC值,以及AUC值最大时所对应的k值;再选取 基因评分排序后的前11个基因,分别针对K近邻回归模型的k值为1至30的情况进行紫杉醇的IC50数据的预测,并对预测结果进行交叉验证,计算出新的AUC的值,记录下不同K值所得到的新的最大的AUC值,以及新的AUC值最大时所对应的k值,针对基因选取数目n值为10至30的情况重复上述操作,最终得到最大的AUC值以及AUC值所对应的n值和k值。针对该n max和k max作为该K近邻回归模型的模型参数。并且根据得到的n max可以确定针对紫杉醇的药敏预测的基因预测列表包括基因评分排序的前n max个基因,以及K近邻回归模型的最优模型参数k maxThe RNA sequencing data of colorectal cancer cell lines and the IC50 data of paclitaxel were obtained from the GDSC database. After preprocessing the RNA sequencing data, the RNA sequencing data were scored and ranked; When the k value of the K nearest neighbor regression model is 1 to 30, the IC50 data of paclitaxel is predicted, and the prediction results are cross-validated, the AUC value is calculated, and the maximum AUC value obtained by different K values is recorded, and the AUC The k value corresponding to the maximum value; then select the top 11 genes after the gene score sorting, and predict the IC50 data of paclitaxel for the cases where the k value of the K nearest neighbor regression model is 1 to 30, and cross the prediction results. Verify, calculate the new AUC value, record the new maximum AUC value obtained by different K values, and the corresponding k value when the new AUC value is the largest, for the case where the number n of the gene selection is 10 to 30 Repeat the above operations, and finally obtain the maximum AUC value and the n and k values corresponding to the AUC value. For the n max and k max as the model parameters of the K-nearest neighbor regression model. And according to the obtained n max , it can be determined that the gene prediction list for drug susceptibility prediction of paclitaxel includes the top n max genes ranked by gene score, and the optimal model parameter km max of the K-nearest neighbor regression model.
再针对另外3种药5-氟尿嘧啶、环磷酰胺、顺铂重复上述操作,得到每种药物的基因预测列表和对应的K近邻回归模型的最优参数k max。由此可知,针对四种化合药能够构建得到各自对应的预测模型以及预测模型所对应的最优参数k max,并且四种化和药存在各自对应的基因预测列表;在实际应用中,可以将四种化合药的基因预测列表汇聚成一个大的基因预测列表集合,当需要对待测的癌细胞组织进行预测时,则可以直接根据该基因预测列表集合提取对应的多个关键基因片段,多个关键基因片段不仅仅针对单个化合药,从而保证数据的充分性。 The above operations are repeated for the other three drugs, 5-fluorouracil, cyclophosphamide, and cisplatin, to obtain the gene prediction list of each drug and the corresponding optimal parameter km max of the K-nearest neighbor regression model. It can be seen that the corresponding prediction model and the optimal parameter k max corresponding to the prediction model can be constructed and obtained for the four compound drugs, and there are corresponding gene prediction lists for the four kinds of compound drugs; The gene prediction lists of the four compound drugs are aggregated into a large gene prediction list set. When the cancer cell tissue to be tested needs to be predicted, the corresponding multiple key gene fragments can be extracted directly according to the gene prediction list set. Key gene fragments are not only specific to a single compound, thus ensuring data sufficiency.
当需要预测一个新的结直肠癌细胞系的癌细胞组织针对这四种药物的药敏表现时,可以通过上述建立的K近邻回归模型对其进行药敏预测,预测出药物作用的IC50值,从而能够判断各个药物对应的药物反应情况,并且能够针对各自的药物反应情况高效地制定合适的用药方案。When it is necessary to predict the drug susceptibility of a new colorectal cancer cell line to these four drugs, the K-nearest neighbor regression model established above can be used to predict its drug susceptibility, and the IC50 value of the drug effect can be predicted. Therefore, the drug reaction conditions corresponding to each drug can be judged, and an appropriate medication plan can be efficiently formulated according to the respective drug reaction conditions.
在一些实施例中,以临床病人的癌细胞组织为例,参考图5,步骤S300具体包括步骤:In some embodiments, taking the cancer cell tissue of a clinical patient as an example, referring to FIG. 5 , step S300 specifically includes the steps:
S321,获取基因样本数据的多个基因片段的基因评分;S321, obtaining gene scores of multiple gene segments of the gene sample data;
S322,根据基因评分对多个基因片段进行降序排列;S322, arranging the multiple gene fragments in descending order according to the gene score;
S323,对降序排列后的多个基因片段进行交叉验证,得到预测模型的模型参数和基因列表数目;S323, performing cross-validation on the plurality of gene fragments arranged in descending order to obtain model parameters of the prediction model and the number of gene lists;
S324,根据基因列表数目和对应的多个基因片段生成基因预测列表,根据模型参数确定预测模型。S324: Generate a gene prediction list according to the number of gene lists and the corresponding multiple gene segments, and determine a prediction model according to model parameters.
在步骤S321和步骤S322中,获取基因样本数据中的多个基因片段的基因评分,其中基因评分是通过步骤S221中所提到的曼-惠特尼U检验方法计算得到的基因片段的P值的相反数,根据获取到的基因评分的大小对多个基因片段进行降序排列。In steps S321 and S322, the gene scores of multiple gene fragments in the gene sample data are obtained, wherein the gene scores are the P values of the gene fragments calculated by the Mann-Whitney U test method mentioned in step S221 The opposite numbers of , and according to the size of the obtained gene scores, the multiple gene fragments are sorted in descending order.
在步骤S323中,对降序排列后的多个基因片段进行交叉验证,得到预测模型的模型参数和基因列表数目,作为多个基因片段所针对的某一药物的药效预测模型的模型参数。其中预测模型是指某一药物针对某一临床病人的癌细胞组织的药效预测的K近邻分类模型,该模型具有关于该癌细胞组织的最优模型参数,模型参数包括最优的近邻参数和基因片段的参数。具体地说,依次选取降序排列后的前n个基因片段,其中n的取值可以根据实际需求进行选取,例如n取值为10至30,并枚举K近邻分类模型的参数k,参数k表示选取k个临近点,其中k的取值可以根据实际需求进行选取,例如k取值为1至30,根据枚举的参数k所对应的K近邻回归模型预测药物是否有效,并根据“有效”或者“无效”做5折交叉验证,根据交叉验证后得到的新的预测结果,计算新的预测结果的准确率和 F1评分(F1-score)。确定每一种参数k和n所得到的准确率和F1评分中数值最大的情况所对应的模型参数k和n作为预测模型的最优模型参数,并且前n个基因片段组成基因预测列表。其中F1评分是统计学中用来衡量二分类模型精确度的一种指标。In step S323, cross-validation is performed on the plurality of gene fragments arranged in descending order to obtain the model parameters of the prediction model and the number of gene lists, which are used as model parameters of a drug efficacy prediction model for a drug targeted by the plurality of gene fragments. The prediction model refers to the K-nearest neighbor classification model for predicting the efficacy of a drug on cancer cells of a clinical patient. The model has the optimal model parameters for the cancer tissue, and the model parameters include the optimal neighbor parameters and Gene fragment parameters. Specifically, the first n gene fragments in descending order are selected in sequence, and the value of n can be selected according to actual needs, for example, the value of n is 10 to 30, and the parameter k of the K nearest neighbor classification model is enumerated, and the parameter k Indicates that k neighboring points are selected, and the value of k can be selected according to actual needs. For example, the value of k is 1 to 30. According to the K nearest neighbor regression model corresponding to the enumerated parameter k, predict whether the drug is effective or not. ” or “invalid” to do 5-fold cross-validation, and calculate the accuracy and F1-score of the new prediction results according to the new prediction results obtained after cross-validation. Determine the accuracy rate obtained by each parameter k and n and the model parameters k and n corresponding to the largest value in the F1 score as the optimal model parameters of the prediction model, and the first n gene segments form a gene prediction list. The F1 score is an indicator used in statistics to measure the accuracy of the binary classification model.
在可能实施的应用实例中,以临床病人的癌细胞组织样本为结直肠癌细胞系,待测药物为5-氟尿嘧啶为例,针对上述条件进行示例性说明。In a possible application example, the above conditions are exemplified by taking the cancer cell tissue sample of a clinical patient as a colorectal cancer cell line and the drug to be tested as 5-fluorouracil.
通过TCGA数据库获取临床病人的结直肠癌细胞系样本的RNA测序数据和5-氟尿嘧啶的临床用药的药效分级数据,对RNA测序数据进行标准化处理后,舍弃TPM低于1的基因,并且计算有效样本数据和无效样本数据之间的各个基因片段对应的基因表达量的数据值,通过曼-惠特尼U检验方法得到的P值,并针对P值大小对基因片段进行升序排列,筛选掉P值大于0.1的基因片段,保留P值小于0.1的基因片段,并根据药物效果分级数据标记RNA测序数据中的有效数据或者无效数据。The RNA sequencing data of colorectal cancer cell line samples from clinical patients and the efficacy grading data of 5-fluorouracil were obtained from the TCGA database. After standardizing the RNA sequencing data, genes with TPM lower than 1 were discarded, and the calculation was effective. The data value of the gene expression level corresponding to each gene fragment between the sample data and the invalid sample data, the P value obtained by the Mann-Whitney U test method, and the gene fragments are sorted in ascending order according to the size of the P value, and the P value is filtered out. For gene fragments with a value greater than 0.1, gene fragments with a P value of less than 0.1 were retained, and valid data or invalid data in the RNA sequencing data were marked according to the drug effect classification data.
以n取值10至30为例,先取结直肠癌细胞中根据基因评分排序后的前10个基因片段,分别对预测模型的参数k等于1至30的情况,用K近邻分类模型预测5-氟尿嘧啶的IC50数据值,并针对IC50数据值进行5折交叉验证,得到该5-氟尿嘧啶对结直肠癌细胞样本的预测结果即有效或者无效,通过预测结果计算出各个模型参数所对应的K近邻分类模型的预测结果的准确率,记录下最大的准确率,以及最大准确率所对应的参数k值和n值;再去基因评分排序的前11个基因,重复枚举K近邻分类模型的参数k等于1至30的情况,重新记录下新的最大的准确率数值和对应的参数k值和n值,或者当前最大准确率大于n等于10时所得到的最大准确率,则记录该新的准确率数据和对应的参数k值和n值,若不大于则无需记录;以此类推,针对n等于10至30的情况重复执行上述操作,最终得到所有情况下的最大的准确率以及对应的n max和k max值。例如,对于结直肠癌和5-氟尿嘧啶的预测结果,得到的n max的值为15,k max的值为5,则选取前15个基因片段作为基因预测列表,即关键预测基因列表。 Taking the value of n from 10 to 30 as an example, first take the top 10 gene fragments in colorectal cancer cells sorted according to the gene score, and use the K nearest neighbor classification model to predict 5- The IC50 data value of fluorouracil, and 5-fold cross-validation was performed for the IC50 data value, and the prediction result of 5-fluorouracil on colorectal cancer cell samples was found to be valid or invalid, and the K nearest neighbor classification corresponding to each model parameter was calculated through the prediction result. The accuracy of the prediction results of the model, record the maximum accuracy, and the parameter k and n values corresponding to the maximum accuracy; then go to the top 11 genes ranked by gene score, and repeat the enumeration of the parameter k of the K-nearest neighbor classification model If it is equal to 1 to 30, re-record the new maximum accuracy value and the corresponding parameter k and n values, or the current maximum accuracy is greater than the maximum accuracy obtained when n is equal to 10, then record the new accuracy. rate data and the corresponding parameter k value and n value, if it is not greater than that, there is no need to record; and so on, repeat the above operation for the case where n is equal to 10 to 30, and finally obtain the maximum accuracy in all cases and the corresponding n max and k max values. For example, for the prediction results of colorectal cancer and 5-fluorouracil, the obtained value of nmax is 15 and the value of kmax is 5, then the first 15 gene segments are selected as the gene prediction list, that is, the key prediction gene list.
在本申请实施例中,通过执行步骤S300能够获取到进行药效预测所需的基因集即基因预测列表,使得在应用到实际药效预测时,可以通过获取少量的基因,实现对药物的药效预测,提升预测速度以及减少预测成本;并且避免了药效预测对RNA测序等耗时长的测序技术的依赖,可快速进行病人的癌症药效预测,适用于术中或术后进行个体化用药。In this embodiment of the present application, by executing step S300, a gene set required for drug efficacy prediction, that is, a gene prediction list, can be obtained, so that when applied to actual drug efficacy prediction, a small number of genes can be obtained to realize the prediction of drug efficacy. It can improve the prediction speed and reduce the prediction cost; and avoid the dependence of drug efficacy prediction on long-term sequencing technologies such as RNA sequencing, which can quickly predict the cancer drug efficacy of patients, and is suitable for individualized medicine during or after surgery. .
在步骤S400中,通过生成的预测模型和基因预测列表进行药物敏感性预测,例如对某一种待测药物关于某一癌细胞组织的药物敏感性预测。具体地说,通过建立起的预测模型,以及基因预测列表中的关键基因,对癌细胞组织进行快速药物敏感性预测。In step S400, drug sensitivity prediction is performed by using the generated prediction model and gene prediction list, for example, drug sensitivity prediction of a certain drug to be tested with respect to a certain cancer cell tissue. Specifically, through the established prediction model and the key genes in the gene prediction list, rapid drug sensitivity prediction of cancer cells is performed.
在一些实施例中,参考图6,步骤S400具体还包括:In some embodiments, referring to FIG. 6 , step S400 further includes:
S410,根据基因预测列表获取待测癌细胞组织对应的基因片段;S410, obtaining gene fragments corresponding to the cancer cell tissue to be tested according to the gene prediction list;
S420,获取基因片段的基因表达量;S420, obtaining the gene expression level of the gene fragment;
S430,将基因表达量输入预测模型中,获取待测癌细胞组织的药物敏感性结果。S430, the gene expression level is input into the prediction model, and the drug sensitivity result of the cancer cell tissue to be tested is obtained.
在步骤S410和步骤S420中,根据执行步骤S300获取到的基因预测列表,从待测癌细胞组织中提取对应的关键的基因片段,该基因片段的提取数目根据基因预测列表中的数目进行选取;可以基于qPCR技术或者基因芯片等技术从癌细胞组织中提取出对应的基因片段后,快速测量各个基因片段的基因表达量。In step S410 and step S420, according to the gene prediction list obtained by executing step S300, corresponding key gene fragments are extracted from the cancer cell tissue to be tested, and the extracted number of the gene fragments is selected according to the number in the gene prediction list; After the corresponding gene fragments are extracted from cancer cell tissue based on qPCR technology or gene chip technology, the gene expression of each gene fragment can be quickly measured.
在步骤S430和步骤S440中,将各个基因片段的基因表达量通过待训练癌细胞组织所进行的标准化处理后,输入到预测模型中,输出该待测癌细胞组织的关于预测模型所对应的某一药物是否有效的预测结果,该预测结果表示该预测模型所对应的药物对当前癌细胞组织的药物敏感性结果。需要说明的是,在实际应用中,通过执行步骤S100至步骤S300,可以建立多个不同的预测模型,不同的预测模型针对不同的待测药物进行建立,而待测药物的选取则是针对癌细胞组织的类别进行选取,例如本申请实施例中,癌细胞组织的类别选用结直肠癌细胞系为例,则针对结直肠癌细胞系选用紫杉醇、5-氟尿嘧啶、环磷酰胺、顺铂四种化疗药,则针对该四种化疗药分别执行步骤S100至步骤S300进行预测模型和基因预测列表的生成,每一种化疗药对应各自的预测模型和基因预测列表。当需要预测某一化疗药对临床病人的用药效果或者对结直肠癌癌细胞系的用药效果时,则选取对应的预测模型,输入提取到的癌细胞组织的基因样本,获取对应的基因表达量进行用药效果预测。可以预估某一化疗药对临床病人的癌细胞组织预测是否有效;也可以预估某一化疗药对同一系别的癌细胞组织的用药效果即IC50数据值等。In step S430 and step S440, the gene expression of each gene fragment is input into the prediction model after the standardization process performed by the cancer cell tissue to be trained, and output the corresponding prediction model of the cancer cell tissue to be tested. The prediction result of whether a drug is effective or not, the prediction result represents the drug sensitivity result of the drug corresponding to the prediction model to the current cancer cell tissue. It should be noted that, in practical applications, by performing steps S100 to S300, a plurality of different prediction models can be established, and different prediction models are established for different drugs to be tested, and the selection of drugs to be tested is for cancer. The type of cell tissue is selected. For example, in the examples of this application, colorectal cancer cell line is selected as the type of cancer cell tissue, and paclitaxel, 5-fluorouracil, cyclophosphamide, and cisplatin are selected for the colorectal cancer cell line. For chemotherapeutic drugs, steps S100 to S300 are respectively performed for the four chemotherapeutic drugs to generate a prediction model and a gene prediction list, and each chemotherapeutic drug corresponds to its own prediction model and gene prediction list. When it is necessary to predict the effect of a chemotherapeutic drug on clinical patients or on colorectal cancer cell lines, the corresponding prediction model is selected, the gene samples of the extracted cancer cells are input, and the corresponding gene expression levels are obtained. Predict drug effects. It can predict whether a certain chemotherapeutic drug is effective in predicting cancer cells of clinical patients; it can also predict the drug effect of a chemotherapeutic drug on cancer cells of the same line, that is, the IC50 data value.
在可能实施的应用实例中,以预测某一药物对癌细胞系的IC50数据为例,本申请实施例中通过GDSC、CCLE数据库中的待训练癌细胞系的RNA测序数据和表示药物敏感性的IC50数据,对获取到的RNA测序数据进行预处理,该预处理包括基于基因表达量进行过滤、基于基因表达量与IC50数据相关性进行过滤以及通过费雪线性判断进行过滤,最终保留部分基因片段;用K近邻回归模型进行交叉验证,枚举K近邻回归模型的参数,挑选出使交叉验证准确率最高的模型参数,确定预测模型的最优参数以及构建生成的基因预测列表;当对癌细胞组织的药物敏感性进行预测时,则可以根据基因预测列表获取癌细胞组织的关键基因片段,通过qPCR技术或基因芯片技术获取关键基因片段的基因表达量,将基因表达量作为预测模型的输入参数,得到该癌细胞组织的药物敏感性的预测结果即预测IC50数据值。In a possible application example, taking the prediction of IC50 data of a drug on a cancer cell line as an example, in the examples of this application, the RNA sequencing data of the cancer cell line to be trained in the GDSC and CCLE databases and the data indicating drug sensitivity are used as an example. IC50 data, preprocessing the acquired RNA sequencing data, the preprocessing includes filtering based on gene expression, filtering based on the correlation between gene expression and IC50 data, and filtering through Fisher's linear judgment, and finally retain some gene fragments ; Use the K-nearest neighbor regression model for cross-validation, enumerate the parameters of the K-nearest neighbor regression model, select the model parameters with the highest cross-validation accuracy, determine the optimal parameters of the prediction model and construct the generated gene prediction list; When the drug sensitivity of tissue is predicted, the key gene fragments of cancer cells can be obtained according to the gene prediction list, the gene expression of key gene fragments can be obtained by qPCR technology or gene chip technology, and the gene expression can be used as the input parameter of the prediction model. , and the prediction result of the drug sensitivity of the cancer cell tissue is obtained, that is, the predicted IC50 data value.
在可能实施的应用实例中,以预测某一药物对临床病人的癌细胞组织是否有效为例,通过TCGA数据库获取临床病人的癌组织细胞样本的RNA测序数据以及对应的临床用药的药物效果分级数据;对获取到的RNA测序数据进行预处理,其中预处理包括基于基因表达量进行过滤和通过曼-惠特尼U检验方法进行过滤,得到预处理后的基因片段;用K近邻回归模型进行交叉验证,枚举K近邻回归模型的参数,挑选出使交叉验证准确率最高的模型参数,确定预测模型的最优参数以及构建生成的基因预 测列表;当需要对临床病人的癌细胞组织的用药效果进行预测时,可以根据基因预测列表获取临床病人的癌细胞组织的关键基因片段,通过qPCR技术或者基因芯片技术获取关键基因片段的基因表达量,将基因表达量进行标准化处理后作为预测模型的输入参数,得到预测模型所对应的药物对临床病人的癌细胞组织的预测结果即药效预测,预测该药物是否有效或无效,可以根据用药是否有效对临床病人指定适宜的方法进行用药。In a possible application example, taking the prediction of whether a drug is effective on cancer cells of clinical patients as an example, the RNA sequencing data of cancer tissue samples of clinical patients and the drug effect classification data of corresponding clinical drugs are obtained through the TCGA database. ; Preprocess the acquired RNA sequencing data, which includes filtering based on gene expression and filtering through the Mann-Whitney U test method to obtain preprocessed gene fragments; use K-nearest neighbor regression model for crossover Verify, enumerate the parameters of the K-nearest neighbor regression model, select the model parameters with the highest cross-validation accuracy, determine the optimal parameters of the prediction model, and construct the generated gene prediction list; When making predictions, the key gene fragments of cancer cells of clinical patients can be obtained according to the gene prediction list, the gene expression of key gene fragments can be obtained by qPCR technology or gene chip technology, and the gene expression can be standardized as the input of the prediction model. parameters, to obtain the prediction result of the drug corresponding to the prediction model on the cancer cell tissue of clinical patients, that is, drug efficacy prediction, to predict whether the drug is effective or ineffective, and the clinical patient can be prescribed an appropriate method for drug use according to whether the drug is effective.
在可能实施的应用实例中,针对于临床病人的肿瘤癌细胞组织进行药效预测,参考图7,针对该肿瘤癌细胞组织存在若干个候选药物,例如候选药物1、候选药物2以及候选药物3,每种候选药物都有对应的基因预测列表,分别为基因预测列表1、基因预测列表2和基因预测列表3,其中基因预测列表在实际应用中包含大约十余个对应的关键基因,可以将三种候选药物的基因预测列表汇集成一个集合即基于预测列表集;并且,候选药物1、候选药物2和候选药物3有各自训练好的预测模型,分别为预测模型1、预测模型2和预测模型3;另一方面,获取临床病人的肿瘤癌细胞组织,通过qPCR技术或基因芯片技术结合基因预测列表获取对应的基因片段的基因表达量,对基因表达量进行标准化处理后,作为各个候选药物所对应的预测模型的输入,从而得到候选药物1、候选药物2和候选药物3各自对应的药效预测,即预测候选药物1、候选药物2和候选药物3对该临床病人的肿瘤癌细胞组织是否有效,根据该预测结果制定临床病人的个体化给药方案,实现精准医疗。In a possible application example, drug efficacy prediction is performed on tumor cancer cell tissues of clinical patients. Referring to FIG. 7 , there are several candidate drugs, such as candidate drug 1, candidate drug 2 and candidate drug 3, for the tumor cancer cell tissue. , each candidate drug has a corresponding gene prediction list, which are gene prediction list 1, gene prediction list 2 and gene prediction list 3, among which the gene prediction list contains about ten corresponding key genes in practical applications. The gene prediction lists of the three candidate drugs are assembled into a set based on the prediction list set; and, drug candidate 1, drug candidate 2 and drug candidate 3 have their own trained prediction models, namely prediction model 1, prediction model 2 and prediction model 2 respectively. Model 3; on the other hand, the tumor cancer cell tissue of clinical patients is obtained, and the gene expression level of the corresponding gene fragment is obtained by qPCR technology or gene chip technology combined with the gene prediction list, and the gene expression level is standardized as each candidate drug. The input of the corresponding prediction model, so as to obtain the corresponding drug efficacy prediction of candidate drug 1, candidate drug 2 and candidate drug 3, that is, predict the tumor cancer cell tissue of candidate drug 1, candidate drug 2 and candidate drug 3 for the clinical patient. Whether it is effective or not, based on the prediction results, individualized drug delivery plans for clinical patients are formulated to achieve precision medicine.
在本申请实施例中,通过使用qPCR技术或基因芯片对基因片段进行快速测定基因表达量,能够缩短整个药效预测的耗时,便于在临床手术中或者临床手术后及时给鱼用药方案建议,有效避免了药效预测对RNA测序等耗时长的测序技术的依赖;并且通过减少药效预测所需的基因集,快速精确地实现对临床病人的药物反应性预测,减少预测成本和时间成本,提高预测效率。In the embodiment of the present application, by using qPCR technology or gene chip to quickly measure the gene expression level of the gene fragment, the time-consuming of the entire drug effect prediction can be shortened, and it is convenient to timely recommend the fish drug regimen during or after the clinical operation. It effectively avoids the dependence of drug efficacy prediction on RNA sequencing and other time-consuming sequencing technologies; and by reducing the gene set required for drug efficacy prediction, it can quickly and accurately predict the drug responsiveness of clinical patients, reducing prediction costs and time costs, Improve forecasting efficiency.
第二方面,本申请实施例还提供了一种电子设备,包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器;In a second aspect, an embodiment of the present application further provides an electronic device, including: at least one processor, and a memory communicatively connected to the at least one processor;
其中,所述处理器通过调用所述存储器中存储的计算机程序,用于执行第一方面实施例中提到的药物敏感预测方法。Wherein, the processor is configured to execute the drug sensitivity prediction method mentioned in the embodiment of the first aspect by calling the computer program stored in the memory.
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序,如本申请第一方面实施例中提到的药物敏感预测方法。处理器通过运行存储在存储器中的非暂态软件程序以及指令,从而实现上述第一方面实施例中提到的药物敏感预测方法。As a non-transitory computer-readable storage medium, the memory can be used to store non-transitory software programs and non-transitory computer-executable programs, such as the drug sensitivity prediction method mentioned in the embodiments of the first aspect of the present application. The processor implements the drug sensitivity prediction method mentioned in the embodiment of the first aspect above by running the non-transitory software program and instructions stored in the memory.
存储器可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储执行上述第一方面实施例中提到的药物敏感预测方法。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该终端。上述网络的实例包括但不限于互联网、企业内部网、 局域网、移动通信网及其组合。The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system and an application program required by at least one function; the storage data area may store and execute the drug sensitivity prediction method mentioned in the first aspect embodiment above. . Additionally, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, and these remote memories may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
实现上述第一方面实施例中提到的药物敏感预测方法所需的非暂态软件程序以及指令存储在存储器中,当被一个或者多个处理器执行时,执行上述第一方面实施例中提到的药物敏感预测方法。The non-transitory software programs and instructions required to implement the drug sensitivity prediction method mentioned in the embodiment of the first aspect above are stored in the memory, and when executed by one or more processors, execute the method proposed in the embodiment of the first aspect above. A method for predicting drug susceptibility.
第三方面,本申请实施例还提供了计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于:执行第一方面实施例中提到的药物敏感预测方法;In a third aspect, the embodiments of the present application further provide a computer-readable storage medium storing computer-executable instructions, where the computer-executable instructions are used to: execute the drug sensitivity prediction method mentioned in the embodiments of the first aspect;
在一些实施例中,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个或多个控制处理器执行,例如,被第二方面实施例的电子设备中的一个处理器执行,可使得上述一个或多个处理器执行上述第一方面实施例中提到的药物敏感预测方法。In some embodiments, the computer-readable storage medium stores computer-executable instructions that are executed by one or more control processors, eg, by a processor in the electronic device of the second aspect embodiment The execution can cause the above one or more processors to execute the drug sensitivity prediction method mentioned in the embodiment of the first aspect.
以上所描述的设备实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those of ordinary skill in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .
在本说明书的描述中,参考术语“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。In the description of this specification, reference to the terms "some embodiments", "examples", "specific examples", or "some examples", etc., means that specific features or characteristics described in connection with the embodiments or examples are included in the present application at least one embodiment or example of . In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example.

Claims (10)

  1. 药物敏感预测方法,其特征在于,包括:A drug sensitivity prediction method, characterized in that it includes:
    获取待训练癌细胞组织的基因测序数据和药物特征数据;Obtain the gene sequencing data and drug characteristic data of the cancer cell tissue to be trained;
    根据所述药物特征数据对所述基因测序数据进行预处理,得到基因样本数据;Preprocessing the gene sequencing data according to the drug characteristic data to obtain gene sample data;
    根据所述基因样本数据和所述药物特征数据进行验证处理,得到预测模型和基因预测列表;Perform verification processing according to the gene sample data and the drug characteristic data to obtain a prediction model and a gene prediction list;
    通过所述预测模型和所述基因预测列表对待测癌细胞组织进行药物敏感性预测。The drug sensitivity prediction of the cancer cell tissue to be tested is performed through the prediction model and the gene prediction list.
  2. 根据权利要求1所述的药物敏感预测方法,其特征在于,所述基因测序数据包括第一测序数据,所述药物特征数据包括药物敏感性数据;The drug sensitivity prediction method according to claim 1, wherein the gene sequencing data includes first sequencing data, and the drug characteristic data includes drug sensitivity data;
    对应的,所述获取待训练癌细胞组织的基因测序数据和药物特征数据,包括:Correspondingly, the obtaining of the gene sequencing data and drug characteristic data of the cancer cell tissue to be trained includes:
    基于基因组数据库获取待训练癌细胞组织的所述第一测序数据和对应的所述药物敏感性数据。The first sequencing data of the cancer cell tissue to be trained and the corresponding drug sensitivity data are acquired based on a genome database.
  3. 根据权利要求2所述的药物敏感预测方法,其特征在于,所述根据所述药物特征数据对所述基因测序数据进行预处理,得到基因样本数据,包括:The drug sensitivity prediction method according to claim 2, wherein the gene sequencing data is preprocessed according to the drug characteristic data to obtain gene sample data, comprising:
    对所述第一测序数据进行标准化处理,得到第一样本数据;standardizing the first sequencing data to obtain first sample data;
    根据所述第一样本数据和所述药物敏感性数据的药敏相关系数对所述第一样本数据进行筛选,得到第二样本数据;Screening the first sample data according to the drug sensitivity correlation coefficient of the first sample data and the drug sensitivity data to obtain second sample data;
    根据所述药物敏感性数据对所述第二样本数据进行评分判定,得到所述第二样本数据的评分参数;Scoring and judging the second sample data according to the drug sensitivity data to obtain a scoring parameter of the second sample data;
    基于所述评分参数对所述第二样本数据进行筛选处理,得到所述基因样本数据。The second sample data is screened based on the scoring parameter to obtain the gene sample data.
  4. 根据权利要求3所述的药物敏感预测方法,其特征在于,所述根据所述基因样本数据和所述药物特征数据进行验证处理,得到预测模型和基因预测列表,包括:The method for predicting drug sensitivity according to claim 3, wherein the verification process is performed according to the gene sample data and the drug characteristic data to obtain a prediction model and a gene prediction list, comprising:
    获取所述基因样本数据与所述药物敏感性数据的药敏相关系数,获取所述基因样本数据的评分参数,所述基因样本数据包括多个基因片段;obtaining the drug sensitivity correlation coefficient between the genetic sample data and the drug sensitivity data, and obtaining the scoring parameter of the genetic sample data, where the genetic sample data includes a plurality of gene fragments;
    根据所述药敏相关系数和所述评分参数对所述多个基因片段进行降序排列;Arrange the multiple gene fragments in descending order according to the drug susceptibility correlation coefficient and the scoring parameter;
    对降序排列后的所述多个基因片段进行验证处理,得到所述预测模型的模型参数和基因列表数目;Perform verification processing on the plurality of gene fragments arranged in descending order to obtain the model parameters of the prediction model and the number of gene lists;
    根据所述基因列表数目生成所述基因预测列表,根据所述模型参数确定所述预测模型。The gene prediction list is generated according to the number of gene lists, and the prediction model is determined according to the model parameters.
  5. 根据权利要求1所述的药物敏感预测方法,其特征在于,所述基因测序数据包括第二测序数据,所述药物特征数据包括药物效果分级数据;The drug sensitivity prediction method according to claim 1, wherein the gene sequencing data includes second sequencing data, and the drug characteristic data includes drug effect classification data;
    对应的,所述获取待训练癌细胞组织的基因测序数据和药物特征数据,包括:Correspondingly, the obtaining of the gene sequencing data and drug characteristic data of the cancer cell tissue to be trained includes:
    基于基因组图谱数据库获取待训练癌细胞组织的第二测序数据和药物效果分级数据。The second sequencing data and drug effect grading data of the cancer cell tissue to be trained are obtained based on the genome atlas database.
  6. 根据权利要求5所述的药物敏感预测方法,其特征在于,所述根据所述药物特征数据对所述 基因测序数据进行预处理,得到基因样本数据,包括:The drug sensitivity prediction method according to claim 5, wherein the gene sequencing data is preprocessed according to the drug characteristic data to obtain gene sample data, comprising:
    对所述第二测序数据进行标准化处理,得到第三样本数据;standardizing the second sequencing data to obtain third sample data;
    根据所述药物效果分级数据对所述第三样本数据进行检验,得到所述基因样本数据。The third sample data is tested according to the drug effect classification data to obtain the gene sample data.
  7. 根据权利要求6所述的药物敏感预测方法,其特征在于,所述根据所述基因样本数据和所述药物特征数据进行验证处理,得到预测模型和基因预测列表,包括:The drug sensitivity prediction method according to claim 6, wherein the verification process is performed according to the gene sample data and the drug characteristic data to obtain a prediction model and a gene prediction list, comprising:
    获取所述基因样本数据的多个基因片段的基因评分;obtaining gene scores of multiple gene fragments of the gene sample data;
    根据所述基因评分对所述多个基因片段进行降序排列;Ranking the plurality of gene fragments in descending order according to the gene score;
    对降序排列后的所述多个基因片段进行交叉验证,得到所述预测模型的模型参数和基因列表数目;Cross-validation is performed on the plurality of gene fragments arranged in descending order to obtain the model parameters of the prediction model and the number of gene lists;
    根据所述基因列表数目和对应的所述多个基因片段生成所述基因预测列表,根据所述模型参数确定所述预测模型。The gene prediction list is generated according to the number of gene lists and the corresponding plurality of gene segments, and the prediction model is determined according to the model parameters.
  8. 根据权利要求4或7所述的药物敏感预测方法,其特征在于,所述通过所述预测模型和所述基因预测列表对所述待测癌细胞组织进行药物敏感性预测,包括:The drug sensitivity prediction method according to claim 4 or 7, wherein the predicting the drug sensitivity of the cancer cell tissue to be tested by using the prediction model and the gene prediction list includes:
    根据所述基因预测列表获取待测癌细胞组织对应的基因片段;Acquiring gene fragments corresponding to the cancer cell tissue to be tested according to the gene prediction list;
    获取所述基因片段的基因表达量;obtaining the gene expression level of the gene fragment;
    将所述基因表达量输入所述预测模型中,获取所述待测癌细胞组织的药物敏感性结果。The gene expression level is input into the prediction model, and the drug sensitivity result of the cancer cell tissue to be tested is obtained.
  9. 电子设备,其特征在于,包括:Electronic equipment, characterized in that it includes:
    至少一个处理器,以及,at least one processor, and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器执行所述指令时实现如权利要求1至8任一项所述的药物敏感预测方法。The memory stores instructions, which are executed by the at least one processor, so that when the at least one processor executes the instructions, the drug sensitivity prediction method of any one of claims 1 to 8 is implemented.
  10. 计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行如权利要求1至8任一项所述的药物敏感预测方法。A computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to cause a computer to perform the drug sensitivity prediction according to any one of claims 1 to 8 method.
PCT/CN2022/071509 2021-02-09 2022-01-12 Drug sensitivity prediction method, electronic device and computer-readable storage medium WO2022170909A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110175355.4 2021-02-09
CN202110175355.4A CN112951327B (en) 2021-02-09 2021-02-09 Drug sensitivity prediction method, electronic device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2022170909A1 true WO2022170909A1 (en) 2022-08-18

Family

ID=76244452

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071509 WO2022170909A1 (en) 2021-02-09 2022-01-12 Drug sensitivity prediction method, electronic device and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN112951327B (en)
WO (1) WO2022170909A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115458188A (en) * 2022-11-11 2022-12-09 神州医疗科技股份有限公司 Mining method and system for drug high-efficiency response candidate marker
CN117079716A (en) * 2023-09-13 2023-11-17 江苏运动健康研究院 Deep learning prediction method of tumor drug administration scheme based on gene detection

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951327B (en) * 2021-02-09 2023-10-27 清华大学深圳国际研究生院 Drug sensitivity prediction method, electronic device, and computer-readable storage medium
CN113362895A (en) * 2021-06-15 2021-09-07 上海基绪康生物科技有限公司 Comprehensive analysis method for predicting anti-cancer drug response related gene
CN116597902B (en) * 2023-04-24 2023-12-01 浙江大学 Method and device for screening multiple groups of chemical biomarkers based on drug sensitivity data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012122017A2 (en) * 2011-03-04 2012-09-13 Cornell University Method for rapid identification of drug targets and drug mechanisms of action in human cells
CN105005693A (en) * 2015-07-08 2015-10-28 中国科学院合肥物质科学研究院 Genetic material specificity based tumor cell drug sensitivity evaluation method
US20160224723A1 (en) * 2015-01-29 2016-08-04 The Trustees Of Columbia University In The City Of New York Method for predicting drug response based on genomic and transcriptomic data
CN107609326A (en) * 2017-07-26 2018-01-19 同济大学 Drug sensitivity prediction method in the accurate medical treatment of cancer
CN110310703A (en) * 2019-06-25 2019-10-08 中国人民解放军军事科学院军事医学研究院 Prediction technique, device and the computer equipment of drug
CN111223577A (en) * 2020-01-17 2020-06-02 江苏大学 Deep learning-based synergistic anti-tumor multi-drug combination effect prediction method
CN112951327A (en) * 2021-02-09 2021-06-11 清华大学深圳国际研究生院 Drug sensitivity prediction method, electronic device and computer-readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190214136A1 (en) * 2017-07-11 2019-07-11 Regents Of The University Of Minnesota Predictive biomarkers of drug response in malignancies

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012122017A2 (en) * 2011-03-04 2012-09-13 Cornell University Method for rapid identification of drug targets and drug mechanisms of action in human cells
US20160224723A1 (en) * 2015-01-29 2016-08-04 The Trustees Of Columbia University In The City Of New York Method for predicting drug response based on genomic and transcriptomic data
CN105005693A (en) * 2015-07-08 2015-10-28 中国科学院合肥物质科学研究院 Genetic material specificity based tumor cell drug sensitivity evaluation method
CN107609326A (en) * 2017-07-26 2018-01-19 同济大学 Drug sensitivity prediction method in the accurate medical treatment of cancer
CN110310703A (en) * 2019-06-25 2019-10-08 中国人民解放军军事科学院军事医学研究院 Prediction technique, device and the computer equipment of drug
CN111223577A (en) * 2020-01-17 2020-06-02 江苏大学 Deep learning-based synergistic anti-tumor multi-drug combination effect prediction method
CN112951327A (en) * 2021-02-09 2021-06-11 清华大学深圳国际研究生院 Drug sensitivity prediction method, electronic device and computer-readable storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115458188A (en) * 2022-11-11 2022-12-09 神州医疗科技股份有限公司 Mining method and system for drug high-efficiency response candidate marker
CN117079716A (en) * 2023-09-13 2023-11-17 江苏运动健康研究院 Deep learning prediction method of tumor drug administration scheme based on gene detection
CN117079716B (en) * 2023-09-13 2024-04-05 江苏运动健康研究院 Deep learning prediction method of tumor drug administration scheme based on gene detection

Also Published As

Publication number Publication date
CN112951327A (en) 2021-06-11
CN112951327B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
WO2022170909A1 (en) Drug sensitivity prediction method, electronic device and computer-readable storage medium
JP7368483B2 (en) An integrated machine learning framework for estimating homologous recombination defects
US11621083B2 (en) Cancer evolution detection and diagnostic
CN112888459B (en) Convolutional neural network system and data classification method
JP6681337B2 (en) Device, kit and method for predicting the onset of sepsis
CN109689891A (en) The method of segment group spectrum analysis for cell-free nucleic acid
WO2022141775A1 (en) Construction method for tumor immune checkpoint inhibitor therapy effectiveness evaluation model based on dna methylation spectrum
CN110770838A (en) Method and system for determining clonality of somatic mutations
CN113362894A (en) Method for predicting syndromal cancer driver gene
CN113862351B (en) Kit and method for identifying extracellular RNA biomarkers in body fluid sample
US20210090686A1 (en) Single cell rna-seq data processing
CN111540410B (en) System and method for predicting a smoking status of an individual
CN111785319B (en) Drug repositioning method based on differential expression data
US20200294622A1 (en) Subtyping of TNBC And Methods
EP4297037A1 (en) Device for determining an indicator of presence of hrd in a genome of a subject
Shu et al. Mergeomics: integration of diverse genomics resources to identify pathogenic perturbations to biological systems
CN116486913B (en) System, apparatus and medium for de novo predictive regulatory mutations based on single cell sequencing
US20230230655A1 (en) Methods and systems for assessing fibrotic disease with deep learning
CN117594118A (en) Method for predicting tumor genome biomarker by combining convolutional neural network with network medical method
Menand Machine learning based novel biomarkers discovery for therapeutic use in" pan-gyn" cancers
Ramírez 1 GENOMICS, BIOINFORMATICS AND DATA MINING: AN OVERVIEW
Mansmann et al. Classification and prediction in pharmacogenetics–context, construction and validation
WO2023023125A1 (en) Methods for characterizing infections and methods for developing tests for the same
CN117275585A (en) Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment
dos Santos Valente Development of computational tools for the integrated analysis of DNA microarray data with applications in cancer research

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22752071

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22752071

Country of ref document: EP

Kind code of ref document: A1